aho_corasick::automaton

Trait Automaton

Source
pub unsafe trait Automaton: Sealed {
Show 26 methods // Required methods fn start_state(&self, anchored: Anchored) -> Result<StateID, MatchError>; fn next_state(&self, anchored: Anchored, sid: StateID, byte: u8) -> StateID; fn is_special(&self, sid: StateID) -> bool; fn is_dead(&self, sid: StateID) -> bool; fn is_match(&self, sid: StateID) -> bool; fn is_start(&self, sid: StateID) -> bool; fn match_kind(&self) -> MatchKind; fn match_len(&self, sid: StateID) -> usize; fn match_pattern(&self, sid: StateID, index: usize) -> PatternID; fn patterns_len(&self) -> usize; fn pattern_len(&self, pid: PatternID) -> usize; fn min_pattern_len(&self) -> usize; fn max_pattern_len(&self) -> usize; fn memory_usage(&self) -> usize; fn prefilter(&self) -> Option<&Prefilter>; // Provided methods fn try_find(&self, input: &Input<'_>) -> Result<Option<Match>, MatchError> { ... } fn try_find_overlapping( &self, input: &Input<'_>, state: &mut OverlappingState, ) -> Result<(), MatchError> { ... } fn try_find_iter<'a, 'h>( &'a self, input: Input<'h>, ) -> Result<FindIter<'a, 'h, Self>, MatchError> where Self: Sized { ... } fn try_find_overlapping_iter<'a, 'h>( &'a self, input: Input<'h>, ) -> Result<FindOverlappingIter<'a, 'h, Self>, MatchError> where Self: Sized { ... } fn try_replace_all<B>( &self, haystack: &str, replace_with: &[B], ) -> Result<String, MatchError> where Self: Sized, B: AsRef<str> { ... } fn try_replace_all_bytes<B>( &self, haystack: &[u8], replace_with: &[B], ) -> Result<Vec<u8>, MatchError> where Self: Sized, B: AsRef<[u8]> { ... } fn try_replace_all_with<F>( &self, haystack: &str, dst: &mut String, replace_with: F, ) -> Result<(), MatchError> where Self: Sized, F: FnMut(&Match, &str, &mut String) -> bool { ... } fn try_replace_all_with_bytes<F>( &self, haystack: &[u8], dst: &mut Vec<u8>, replace_with: F, ) -> Result<(), MatchError> where Self: Sized, F: FnMut(&Match, &[u8], &mut Vec<u8>) -> bool { ... } fn try_stream_find_iter<'a, R: Read>( &'a self, rdr: R, ) -> Result<StreamFindIter<'a, Self, R>, MatchError> where Self: Sized { ... } fn try_stream_replace_all<R, W, B>( &self, rdr: R, wtr: W, replace_with: &[B], ) -> Result<()> where Self: Sized, R: Read, W: Write, B: AsRef<[u8]> { ... } fn try_stream_replace_all_with<R, W, F>( &self, rdr: R, wtr: W, replace_with: F, ) -> Result<()> where Self: Sized, R: Read, W: Write, F: FnMut(&Match, &[u8], &mut W) -> Result<()> { ... }
}
Expand description

A trait that abstracts over Aho-Corasick automata.

This trait primarily exists for niche use cases such as:

  • Using an NFA or DFA directly, bypassing the top-level AhoCorasick searcher. Currently, these include noncontiguous::NFA, contiguous::NFA and dfa::DFA.
  • Implementing your own custom search routine by walking the automaton yourself. This might be useful for implementing search on non-contiguous strings or streams.

For most use cases, it is not expected that users will need to use or even know about this trait. Indeed, the top level AhoCorasick searcher does not expose any details about this trait, nor does it implement it itself.

Note that this trait defines a number of default methods, such as Automaton::try_find and Automaton::try_find_iter, which implement higher level search routines in terms of the lower level automata API.

§Sealed

Currently, this trait is sealed. That means users of this crate can write generic routines over this trait but cannot implement it themselves. This restriction may be lifted in the future, but sealing the trait permits adding new required methods in a backwards compatible fashion.

§Special states

This trait encodes a notion of “special” states in an automaton. Namely, a state is treated as special if it is a dead, match or start state:

  • A dead state is a state that cannot be left once entered. All transitions on a dead state lead back to itself. The dead state is meant to be treated as a sentinel indicating that the search should stop and return a match if one has been found, and nothing otherwise.
  • A match state is a state that indicates one or more patterns have matched. Depending on the MatchKind of the automaton, a search may stop once a match is seen, or it may continue looking for matches until it enters a dead state or sees the end of the haystack.
  • A start state is a state that a search begins in. It is useful to know when a search enters a start state because it may mean that a prefilter can be used to skip ahead and quickly look for candidate matches. Unlike dead and match states, it is never necessary to explicitly handle start states for correctness. Indeed, in this crate, implementations of Automaton will only treat start states as “special” when a prefilter is enabled and active. Otherwise, treating it as special has no purpose and winds up slowing down the overall search because it results in ping-ponging between the main state transition and the “special” state logic.

Since checking whether a state is special by doing three different checks would be too expensive inside a fast search loop, the Automaton::is_special method is provided for quickly checking whether the state is special. The Automaton::is_dead, Automaton::is_match and Automaton::is_start predicates can then be used to determine which kind of special state it is.

§Panics

Most of the APIs on this trait should panic or give incorrect results if invalid inputs are given to it. For example, Automaton::next_state has unspecified behavior if the state ID given to it is not a valid state ID for the underlying automaton. Valid state IDs can only be retrieved in one of two ways: calling Automaton::start_state or calling Automaton::next_state with a valid state ID.

§Safety

This trait is not safe to implement so that code may rely on the correctness of implementations of this trait to avoid undefined behavior. The primary correctness guarantees are:

  • Automaton::start_state always returns a valid state ID or an error or panics.
  • Automaton::next_state, when given a valid state ID, always returns a valid state ID for all values of anchored and byte, or otherwise panics.

In general, the rest of the methods on Automaton need to uphold their contracts as well. For example, Automaton::is_dead should only returns true if the given state ID is actually a dead state.

Note that currently this crate does not rely on the safety property defined here to avoid undefined behavior. Instead, this was done to make it possible to do in the future.

§Example

This example shows how one might implement a basic but correct search routine. We keep things simple by not using prefilters or worrying about anchored searches, but do make sure our search is correct for all possible MatchKind semantics. (The comments in the code below note the parts that are needed to support certain MatchKind semantics.)

use aho_corasick::{
    automaton::Automaton,
    nfa::noncontiguous::NFA,
    Anchored, Match, MatchError, MatchKind,
};

// Run an unanchored search for 'aut' in 'haystack'. Return the first match
// seen according to the automaton's match semantics. This returns an error
// if the given automaton does not support unanchored searches.
fn find<A: Automaton>(
    aut: A,
    haystack: &[u8],
) -> Result<Option<Match>, MatchError> {
    let mut sid = aut.start_state(Anchored::No)?;
    let mut at = 0;
    let mut mat = None;
    let get_match = |sid, at| {
        let pid = aut.match_pattern(sid, 0);
        let len = aut.pattern_len(pid);
        Match::new(pid, (at - len)..at)
    };
    // Start states can be match states!
    if aut.is_match(sid) {
        mat = Some(get_match(sid, at));
        // Standard semantics require matches to be reported as soon as
        // they're seen. Otherwise, we continue until we see a dead state
        // or the end of the haystack.
        if matches!(aut.match_kind(), MatchKind::Standard) {
            return Ok(mat);
        }
    }
    while at < haystack.len() {
        sid = aut.next_state(Anchored::No, sid, haystack[at]);
        if aut.is_special(sid) {
            if aut.is_dead(sid) {
                return Ok(mat);
            } else if aut.is_match(sid) {
                mat = Some(get_match(sid, at + 1));
                // As above, standard semantics require that we return
                // immediately once a match is found.
                if matches!(aut.match_kind(), MatchKind::Standard) {
                    return Ok(mat);
                }
            }
        }
        at += 1;
    }
    Ok(mat)
}

// Show that it works for standard searches.
let nfa = NFA::new(&["samwise", "sam"]).unwrap();
assert_eq!(Some(Match::must(1, 0..3)), find(&nfa, b"samwise")?);

// But also works when using leftmost-first. Notice how the match result
// has changed!
let nfa = NFA::builder()
    .match_kind(MatchKind::LeftmostFirst)
    .build(&["samwise", "sam"])
    .unwrap();
assert_eq!(Some(Match::must(0, 0..7)), find(&nfa, b"samwise")?);

Required Methods§

Source

fn start_state(&self, anchored: Anchored) -> Result<StateID, MatchError>

Returns the starting state for the given anchor mode.

Upon success, the state ID returned is guaranteed to be valid for this automaton.

§Errors

This returns an error when the given search configuration is not supported by the underlying automaton. For example, if the underlying automaton only supports unanchored searches but the given configuration was set to an anchored search, then this must return an error.

Source

fn next_state(&self, anchored: Anchored, sid: StateID, byte: u8) -> StateID

Performs a state transition from sid for byte and returns the next state.

anchored should be Anchored::Yes when executing an anchored search and Anchored::No otherwise. For some implementations of Automaton, it is required to know whether the search is anchored or not in order to avoid following failure transitions. Other implementations may ignore anchored altogether and depend on Automaton::start_state returning a state that walks a different path through the automaton depending on whether the search is anchored or not.

§Panics

This routine may panic or return incorrect results when the given state ID is invalid. A state ID is valid if and only if:

  1. It came from a call to Automaton::start_state, or
  2. It came from a previous call to Automaton::next_state with a valid state ID.

Implementations must treat all possible values of byte as valid.

Implementations may panic on unsupported values of anchored, but are not required to do so.

Source

fn is_special(&self, sid: StateID) -> bool

Returns true if the given ID represents a “special” state. A special state is a dead, match or start state.

Note that implementations may choose to return false when the given ID corresponds to a start state. Namely, it always correct to treat start states as non-special. Implementations must return true for states that are dead or contain matches.

This has unspecified behavior when given an invalid state ID.

Source

fn is_dead(&self, sid: StateID) -> bool

Returns true if the given ID represents a dead state.

A dead state is a type of “sink” in a finite state machine. It corresponds to a state whose transitions all loop back to itself. That is, once entered, it can never be left. In practice, it serves as a sentinel indicating that the search should terminate.

This has unspecified behavior when given an invalid state ID.

Source

fn is_match(&self, sid: StateID) -> bool

Returns true if the given ID represents a match state.

A match state is always associated with one or more pattern IDs that matched at the position in the haystack when the match state was entered. When a match state is entered, the match semantics dictate whether it should be returned immediately (for MatchKind::Standard) or if the search should continue (for MatchKind::LeftmostFirst and MatchKind::LeftmostLongest) until a dead state is seen or the end of the haystack has been reached.

This has unspecified behavior when given an invalid state ID.

Source

fn is_start(&self, sid: StateID) -> bool

Returns true if the given ID represents a start state.

While it is never incorrect to ignore start states during a search (except for the start of the search of course), knowing whether one has entered a start state can be useful for certain classes of performance optimizations. For example, if one is in a start state, it may be legal to try to skip ahead and look for match candidates more quickly than would otherwise be accomplished by walking the automaton.

Implementations of Automaton in this crate “unspecialize” start states when a prefilter is not active or enabled. In this case, it is possible for Automaton::is_special(sid) to return false while Automaton::is_start(sid) returns true.

This has unspecified behavior when given an invalid state ID.

Source

fn match_kind(&self) -> MatchKind

Returns the match semantics that this automaton was built with.

Source

fn match_len(&self, sid: StateID) -> usize

Returns the total number of matches for the given state ID.

This has unspecified behavior if the given ID does not refer to a match state.

Source

fn match_pattern(&self, sid: StateID, index: usize) -> PatternID

Returns the pattern ID for the match state given by sid at the index given.

Typically, index is only ever greater than 0 when implementing an overlapping search. Otherwise, it’s likely that your search only cares about reporting the first pattern ID in a match state.

This has unspecified behavior if the given ID does not refer to a match state, or if the index is greater than or equal to the total number of matches in this match state.

Source

fn patterns_len(&self) -> usize

Returns the total number of patterns compiled into this automaton.

Source

fn pattern_len(&self, pid: PatternID) -> usize

Returns the length of the pattern for the given ID.

This has unspecified behavior when given an invalid pattern ID. A pattern ID is valid if and only if it is less than Automaton::patterns_len.

Source

fn min_pattern_len(&self) -> usize

Returns the length, in bytes, of the shortest pattern in this automaton.

Source

fn max_pattern_len(&self) -> usize

Returns the length, in bytes, of the longest pattern in this automaton.

Source

fn memory_usage(&self) -> usize

Returns the heap memory usage, in bytes, used by this automaton.

Source

fn prefilter(&self) -> Option<&Prefilter>

Returns a prefilter, if available, that can be used to accelerate searches for this automaton.

The typical way this is used is when the start state is entered during a search. When that happens, one can use a prefilter to skip ahead and look for candidate matches without having to walk the automaton on the bytes between candidates.

Typically a prefilter is only available when there are a small (<100) number of patterns built into the automaton.

Provided Methods§

Source

fn try_find(&self, input: &Input<'_>) -> Result<Option<Match>, MatchError>

Executes a non-overlapping search with this automaton using the given configuration.

See AhoCorasick::try_find for more documentation and examples.

Source

fn try_find_overlapping( &self, input: &Input<'_>, state: &mut OverlappingState, ) -> Result<(), MatchError>

Executes a overlapping search with this automaton using the given configuration.

See AhoCorasick::try_find_overlapping for more documentation and examples.

Source

fn try_find_iter<'a, 'h>( &'a self, input: Input<'h>, ) -> Result<FindIter<'a, 'h, Self>, MatchError>
where Self: Sized,

Returns an iterator of non-overlapping matches with this automaton using the given configuration.

See AhoCorasick::try_find_iter for more documentation and examples.

Source

fn try_find_overlapping_iter<'a, 'h>( &'a self, input: Input<'h>, ) -> Result<FindOverlappingIter<'a, 'h, Self>, MatchError>
where Self: Sized,

Returns an iterator of overlapping matches with this automaton using the given configuration.

See AhoCorasick::try_find_overlapping_iter for more documentation and examples.

Source

fn try_replace_all<B>( &self, haystack: &str, replace_with: &[B], ) -> Result<String, MatchError>
where Self: Sized, B: AsRef<str>,

Replaces all non-overlapping matches in haystack with strings from replace_with depending on the pattern that matched. The replace_with slice must have length equal to Automaton::patterns_len.

See AhoCorasick::try_replace_all for more documentation and examples.

Source

fn try_replace_all_bytes<B>( &self, haystack: &[u8], replace_with: &[B], ) -> Result<Vec<u8>, MatchError>
where Self: Sized, B: AsRef<[u8]>,

Replaces all non-overlapping matches in haystack with strings from replace_with depending on the pattern that matched. The replace_with slice must have length equal to Automaton::patterns_len.

See AhoCorasick::try_replace_all_bytes for more documentation and examples.

Source

fn try_replace_all_with<F>( &self, haystack: &str, dst: &mut String, replace_with: F, ) -> Result<(), MatchError>
where Self: Sized, F: FnMut(&Match, &str, &mut String) -> bool,

Replaces all non-overlapping matches in haystack by calling the replace_with closure given.

See AhoCorasick::try_replace_all_with for more documentation and examples.

Source

fn try_replace_all_with_bytes<F>( &self, haystack: &[u8], dst: &mut Vec<u8>, replace_with: F, ) -> Result<(), MatchError>
where Self: Sized, F: FnMut(&Match, &[u8], &mut Vec<u8>) -> bool,

Replaces all non-overlapping matches in haystack by calling the replace_with closure given.

See AhoCorasick::try_replace_all_with_bytes for more documentation and examples.

Source

fn try_stream_find_iter<'a, R: Read>( &'a self, rdr: R, ) -> Result<StreamFindIter<'a, Self, R>, MatchError>
where Self: Sized,

Returns an iterator of non-overlapping matches with this automaton from the stream given.

See AhoCorasick::try_stream_find_iter for more documentation and examples.

Source

fn try_stream_replace_all<R, W, B>( &self, rdr: R, wtr: W, replace_with: &[B], ) -> Result<()>
where Self: Sized, R: Read, W: Write, B: AsRef<[u8]>,

Replaces all non-overlapping matches in rdr with strings from replace_with depending on the pattern that matched, and writes the result to wtr. The replace_with slice must have length equal to Automaton::patterns_len.

See AhoCorasick::try_stream_replace_all for more documentation and examples.

Source

fn try_stream_replace_all_with<R, W, F>( &self, rdr: R, wtr: W, replace_with: F, ) -> Result<()>
where Self: Sized, R: Read, W: Write, F: FnMut(&Match, &[u8], &mut W) -> Result<()>,

Replaces all non-overlapping matches in rdr by calling the replace_with closure given and writing the result to wtr.

See AhoCorasick::try_stream_replace_all_with for more documentation and examples.

Implementations on Foreign Types§

Source§

impl<'a, A: Automaton + ?Sized> Automaton for &'a A

Source§

fn start_state(&self, anchored: Anchored) -> Result<StateID, MatchError>

Source§

fn next_state(&self, anchored: Anchored, sid: StateID, byte: u8) -> StateID

Source§

fn is_special(&self, sid: StateID) -> bool

Source§

fn is_dead(&self, sid: StateID) -> bool

Source§

fn is_match(&self, sid: StateID) -> bool

Source§

fn is_start(&self, sid: StateID) -> bool

Source§

fn match_kind(&self) -> MatchKind

Source§

fn match_len(&self, sid: StateID) -> usize

Source§

fn match_pattern(&self, sid: StateID, index: usize) -> PatternID

Source§

fn patterns_len(&self) -> usize

Source§

fn pattern_len(&self, pid: PatternID) -> usize

Source§

fn min_pattern_len(&self) -> usize

Source§

fn max_pattern_len(&self) -> usize

Source§

fn memory_usage(&self) -> usize

Source§

fn prefilter(&self) -> Option<&Prefilter>

Implementors§

Source§

impl Automaton for DFA

Source§

impl Automaton for aho_corasick::nfa::contiguous::NFA

Source§

impl Automaton for aho_corasick::nfa::noncontiguous::NFA