pub struct Unit(/* private fields */);
Expand description
Unit represents a single unit of haystack for DFA based regex engines.
It is not expected for consumers of this crate to need to use this type unless they are implementing their own DFA. And even then, it’s not required: implementors may use other techniques to handle haystack units.
Typically, a single unit of haystack for a DFA would be a single byte.
However, for the DFAs in this crate, matches are delayed by a single byte
in order to handle look-ahead assertions (\b
, $
and \z
). Thus, once
we have consumed the haystack, we must run the DFA through one additional
transition using a unit that indicates the haystack has ended.
There is no way to represent a sentinel with a u8
since all possible
values may be valid haystack units to a DFA, therefore this type
explicitly adds room for a sentinel value.
The sentinel EOI value is always its own equivalence class and is
ultimately represented by adding 1 to the maximum equivalence class value.
So for example, the regex ^[a-z]+$
might be split into the following
equivalence classes:
0 => [\x00-`]
1 => [a-z]
2 => [{-\xFF]
3 => [EOI]
Where EOI is the special sentinel value that is always in its own singleton equivalence class.
Implementations§
Source§impl Unit
impl Unit
Sourcepub fn u8(byte: u8) -> Unit
pub fn u8(byte: u8) -> Unit
Create a new haystack unit from a byte value.
All possible byte values are legal. However, when creating a haystack unit for a specific DFA, one should be careful to only construct units that are in that DFA’s alphabet. Namely, one way to compact a DFA’s in-memory representation is to collapse its transitions to a set of equivalence classes into a set of all possible byte values. If a DFA uses equivalence classes instead of byte values, then the byte given here should be the equivalence class.
Sourcepub fn eoi(num_byte_equiv_classes: usize) -> Unit
pub fn eoi(num_byte_equiv_classes: usize) -> Unit
Create a new “end of input” haystack unit.
The value given is the sentinel value used by this unit to represent
the “end of input.” The value should be the total number of equivalence
classes in the corresponding alphabet. Its maximum value is 256
,
which occurs when every byte is its own equivalence class.
§Panics
This panics when num_byte_equiv_classes
is greater than 256
.
Sourcepub fn as_u8(self) -> Option<u8>
pub fn as_u8(self) -> Option<u8>
If this unit is not an “end of input” sentinel, then returns its
underlying byte value. Otherwise return None
.
Sourcepub fn as_eoi(self) -> Option<u16>
pub fn as_eoi(self) -> Option<u16>
If this unit is an “end of input” sentinel, then return the underlying
sentinel value that was given to Unit::eoi
. Otherwise return
None
.
Sourcepub fn as_usize(self) -> usize
pub fn as_usize(self) -> usize
Return this unit as a usize
, regardless of whether it is a byte value
or an “end of input” sentinel. In the latter case, the underlying
sentinel value given to Unit::eoi
is returned.
Sourcepub fn is_byte(self, byte: u8) -> bool
pub fn is_byte(self, byte: u8) -> bool
Returns true if and only of this unit is a byte value equivalent to the byte given. This always returns false when this is an “end of input” sentinel.
Sourcepub fn is_word_byte(self) -> bool
pub fn is_word_byte(self) -> bool
Returns true when this unit corresponds to an ASCII word byte.
This always returns false when this unit represents an “end of input” sentinel.