Skip to contents

ac_locate() searches a character vector with a compiled automaton and returns one list element per document. Character offsets are 1-based and inclusive, so they can be used directly with substr().

Usage

ac_locate(ac, doc, overlapping = FALSE, na = c("keep", "empty", "error"))

Arguments

ac

An <ac_automaton> object created by ac_build().

doc

A character vector of documents to search.

overlapping

Default is FALSE. If TRUE, report overlapping matches. This is only supported when ac was built with match_kind = "standard".

na

How to handle NA documents. "keep" returns one row with missing pattern_id, start, and end values (default); "empty" treats missing documents as no matches; "error" fails.

Value

A list with the same length as doc. Each element is a data frame with one row per match and three columns:

  • pattern_id: Index of the matched pattern in ac_patterns(ac).

  • start: 1-based index of the first character in each match.

  • end: 1-based index of the last character in each match.

Examples

if (
  requireNamespace("dplyr", quietly = TRUE) &&
    requireNamespace("tibble", quietly = TRUE) &&
    requireNamespace("tidyr", quietly = TRUE)
) {
  ac <- ac_build(c("hello", "world"))
  tibble::tibble(doc = c("hello world", "nothing", "world")) |>
    dplyr::mutate(hits = ac_locate(ac, doc)) |>
    tidyr::unnest(hits)
}
#> # A tibble: 3 × 4
#>   doc         pattern_id start   end
#>   <chr>            <int> <int> <int>
#> 1 hello world          1     1     5
#> 2 hello world          2     7    11
#> 3 world                2     1     5