Skip to contents

ac_locate_bytes() searches a character vector with a compiled automaton and returns byte offsets from the Rust aho-corasick crate. Byte offsets are 0-based, and byte_end is end-exclusive.

Usage

ac_locate_bytes(ac, doc, overlapping = FALSE, na = c("omit", "keep", "error"))

Arguments

ac

An <ac_automaton> object created by ac_build().

doc

A character vector of documents to search.

overlapping

Default is FALSE. If TRUE, report overlapping matches. This is only supported when ac was built with match_kind = "standard".

na

How to handle NA documents. "omit" drops missing documents (default); "keep" returns one row with missing result columns for each missing document; "error" fails.

Value

A data frame with one row per match and four columns: doc_id, pattern_id, byte_start, and byte_end.

Examples

ac <- ac_build(c("hello", "world"))
doc <- c("hello world", "nothing", "world hello")
ac_locate_bytes(ac, doc)
#>   doc_id pattern_id byte_start byte_end
#> 1      1          1          0        5
#> 2      1          2          6       11
#> 3      3          2          0        5
#> 4      3          1          6       11