regex_automata::util::prefilter

Struct Prefilter

Source
pub struct Prefilter { /* private fields */ }
Expand description

A prefilter for accelerating regex searches.

If you already have your literals that you want to search with, then the vanilla Prefilter::new constructor is for you. But if you have an Hir value from the regex-syntax crate, then Prefilter::from_hir_prefix might be more convenient. Namely, it uses the regex-syntax::hir::literal module to extract literal prefixes for you, optimize them and then select and build a prefilter matcher.

A prefilter must have zero false negatives. However, by its very nature, it may produce false positives. That is, a prefilter will never skip over a position in the haystack that corresponds to a match of the original regex pattern, but it may produce a match for a position in the haystack that does not correspond to a match of the original regex pattern. If you use either the Prefilter::from_hir_prefix or Prefilter::from_hirs_prefix constructors, then this guarantee is upheld for you automatically. This guarantee is not preserved if you use Prefilter::new though, since it is up to the caller to provide correct literal strings with respect to the original regex pattern.

§Cloning

It is an API guarantee that cloning a prefilter is cheap. That is, cloning it will not duplicate whatever heap memory is used to represent the underlying matcher.

§Example

This example shows how to attach a Prefilter to the PikeVM in order to accelerate searches.

use regex_automata::{
    nfa::thompson::pikevm::PikeVM,
    util::prefilter::Prefilter,
    Match, MatchKind,
};

let pre = Prefilter::new(MatchKind::LeftmostFirst, &["Bruce "])
    .expect("a prefilter");
let re = PikeVM::builder()
    .configure(PikeVM::config().prefilter(Some(pre)))
    .build(r"Bruce \w+")?;
let mut cache = re.create_cache();
assert_eq!(
    Some(Match::must(0, 6..23)),
    re.find(&mut cache, "Hello Bruce Springsteen!"),
);

But note that if you get your prefilter incorrect, it could lead to an incorrect result!

use regex_automata::{
    nfa::thompson::pikevm::PikeVM,
    util::prefilter::Prefilter,
    Match, MatchKind,
};

// This prefilter is wrong!
let pre = Prefilter::new(MatchKind::LeftmostFirst, &["Patti "])
    .expect("a prefilter");
let re = PikeVM::builder()
    .configure(PikeVM::config().prefilter(Some(pre)))
    .build(r"Bruce \w+")?;
let mut cache = re.create_cache();
// We find no match even though the regex does match.
assert_eq!(
    None,
    re.find(&mut cache, "Hello Bruce Springsteen!"),
);

Implementations§

Source§

impl Prefilter

Source

pub fn new<B: AsRef<[u8]>>(kind: MatchKind, needles: &[B]) -> Option<Prefilter>

Create a new prefilter from a sequence of needles and a corresponding match semantics.

This may return None for a variety of reasons, for example, if a suitable prefilter could not be constructed. That might occur if they are unavailable (e.g., the perf-literal-substring and perf-literal-multisubstring features aren’t enabled), or it might occur because of heuristics or other artifacts of how the prefilter works.

Note that if you have an Hir expression, it may be more convenient to use Prefilter::from_hir_prefix. It will automatically handle the task of extracting prefix literals for you.

§Example

This example shows how match semantics can impact the matching algorithm used by the prefilter. For this reason, it is important to ensure that the match semantics given here are consistent with the match semantics intended for the regular expression that the literals were extracted from.

use regex_automata::{
    util::{prefilter::Prefilter, syntax},
    MatchKind, Span,
};

let hay = "Hello samwise";

// With leftmost-first, we find 'samwise' here because it comes
// before 'sam' in the sequence we give it..
let pre = Prefilter::new(MatchKind::LeftmostFirst, &["samwise", "sam"])
    .expect("a prefilter");
assert_eq!(
    Some(Span::from(6..13)),
    pre.find(hay.as_bytes(), Span::from(0..hay.len())),
);
// Still with leftmost-first but with the literals reverse, now 'sam'
// will match instead!
let pre = Prefilter::new(MatchKind::LeftmostFirst, &["sam", "samwise"])
    .expect("a prefilter");
assert_eq!(
    Some(Span::from(6..9)),
    pre.find(hay.as_bytes(), Span::from(0..hay.len())),
);
Source

pub fn from_hir_prefix(kind: MatchKind, hir: &Hir) -> Option<Prefilter>

This attempts to extract prefixes from the given Hir expression for the given match semantics, and if possible, builds a prefilter for them.

§Example

This example shows how to build a prefilter directly from an Hir expression, and use to find an occurrence of a prefix from the regex pattern.

use regex_automata::{
    util::{prefilter::Prefilter, syntax},
    MatchKind, Span,
};

let hir = syntax::parse(r"(Bruce|Patti) \w+")?;
let pre = Prefilter::from_hir_prefix(MatchKind::LeftmostFirst, &hir)
    .expect("a prefilter");
let hay = "Hello Patti Scialfa!";
assert_eq!(
    Some(Span::from(6..12)),
    pre.find(hay.as_bytes(), Span::from(0..hay.len())),
);
Source

pub fn from_hirs_prefix<H: Borrow<Hir>>( kind: MatchKind, hirs: &[H], ) -> Option<Prefilter>

This attempts to extract prefixes from the given Hir expressions for the given match semantics, and if possible, builds a prefilter for them.

Note that as of now, prefilters throw away information about which pattern each literal comes from. In other words, when a prefilter finds a match, there’s no way to know which pattern (or patterns) it came from. Therefore, in order to confirm a match, you’ll have to check all of the patterns by running the full regex engine.

§Example

This example shows how to build a prefilter directly from multiple Hir expressions expression, and use it to find an occurrence of a prefix from the regex patterns.

use regex_automata::{
    util::{prefilter::Prefilter, syntax},
    MatchKind, Span,
};

let hirs = syntax::parse_many(&[
    r"(Bruce|Patti) \w+",
    r"Mrs?\. Doubtfire",
])?;
let pre = Prefilter::from_hirs_prefix(MatchKind::LeftmostFirst, &hirs)
    .expect("a prefilter");
let hay = "Hello Mrs. Doubtfire";
assert_eq!(
    Some(Span::from(6..20)),
    pre.find(hay.as_bytes(), Span::from(0..hay.len())),
);
Source

pub fn find(&self, haystack: &[u8], span: Span) -> Option<Span>

Run this prefilter on haystack[span.start..end] and return a matching span if one exists.

The span returned is guaranteed to have a start position greater than or equal to the one given, and an end position less than or equal to the one given.

§Example

This example shows how to build a prefilter directly from an Hir expression, and use it to find an occurrence of a prefix from the regex pattern.

use regex_automata::{
    util::{prefilter::Prefilter, syntax},
    MatchKind, Span,
};

let hir = syntax::parse(r"Bruce \w+")?;
let pre = Prefilter::from_hir_prefix(MatchKind::LeftmostFirst, &hir)
    .expect("a prefilter");
let hay = "Hello Bruce Springsteen!";
assert_eq!(
    Some(Span::from(6..12)),
    pre.find(hay.as_bytes(), Span::from(0..hay.len())),
);
Source

pub fn prefix(&self, haystack: &[u8], span: Span) -> Option<Span>

Returns the span of a prefix of haystack[span.start..span.end] if the prefilter matches.

The span returned is guaranteed to have a start position equivalent to the one given, and an end position less than or equal to the one given.

§Example

This example shows how to build a prefilter directly from an Hir expression, and use it to find an occurrence of a prefix from the regex pattern that begins at the start of a haystack only.

use regex_automata::{
    util::{prefilter::Prefilter, syntax},
    MatchKind, Span,
};

let hir = syntax::parse(r"Bruce \w+")?;
let pre = Prefilter::from_hir_prefix(MatchKind::LeftmostFirst, &hir)
    .expect("a prefilter");
let hay = "Hello Bruce Springsteen!";
// Nothing is found here because 'Bruce' does
// not occur at the beginning of our search.
assert_eq!(
    None,
    pre.prefix(hay.as_bytes(), Span::from(0..hay.len())),
);
// But if we change where we start the search
// to begin where 'Bruce ' begins, then a
// match will be found.
assert_eq!(
    Some(Span::from(6..12)),
    pre.prefix(hay.as_bytes(), Span::from(6..hay.len())),
);
Source

pub fn memory_usage(&self) -> usize

Returns the heap memory, in bytes, used by the underlying prefilter.

Source

pub fn max_needle_len(&self) -> usize

Return the length of the longest needle in this Prefilter

Source

pub fn is_fast(&self) -> bool

Implementations might return true here if they believe themselves to be “fast.” The concept of “fast” is deliberately left vague, but in practice this usually corresponds to whether it’s believed that SIMD will be used.

Why do we care about this? Well, some prefilter tricks tend to come with their own bits of overhead, and so might only make sense if we know that a scan will be much faster than the regex engine itself. Otherwise, the trick may not be worth doing. Whether something is “much” faster than the regex engine generally boils down to whether SIMD is used. (But not always. Even a SIMD matcher with a high false positive rate can become quite slow.)

Even if this returns true, it is still possible for the prefilter to be “slow.” Remember, prefilters are just heuristics. We can’t really know a prefilter will be fast without actually trying the prefilter. (Which of course we cannot afford to do.)

Trait Implementations§

Source§

impl Clone for Prefilter

Source§

fn clone(&self) -> Prefilter

Returns a copy of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Debug for Prefilter

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dst: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dst. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.