Locate pattern matches in strings

ac_locate() searches a character vector with a compiled automaton and returns one list element per document. Character offsets are 1-based and inclusive, so they can be used directly with substr().

Usage

ac_locate(ac, doc, ..., overlapping = FALSE, na = c("keep", "empty", "error"))

Arguments

ac: An <ac_automaton> object created by ac_build().
doc: A character vector of documents to search.
...: Must be empty. This is used to require optional arguments to be supplied by name.
overlapping: Default is FALSE. If TRUE, report overlapping matches. This is only supported when ac was built with match_kind = "standard".
na: How to handle NA documents. "keep" returns one row with missing pattern_id, start, and end values (default); "empty" treats missing documents as no matches; "error" fails.

Value

A list with the same length as doc. Each element is a data frame with one row per match and three columns:

pattern_id: Index of the matched pattern in ac_patterns(ac).
start: 1-based index of the first character in each match.
end: 1-based index of the last character in each match.

Examples

if (
  requireNamespace("dplyr", quietly = TRUE) &&
    requireNamespace("tibble", quietly = TRUE) &&
    requireNamespace("tidyr", quietly = TRUE)
) {
  ac <- ac_build(c("hello", "world"))
  tibble::tibble(doc = c("hello world", "nothing", "world")) |>
    dplyr::mutate(hits = ac_locate(ac, doc)) |>
    tidyr::unnest(hits)
}
#> # A tibble: 3 × 4
#>   doc         pattern_id start   end
#>   <chr>            <int> <int> <int>
#> 1 hello world          1     1     5
#> 2 hello world          2     7    11
#> 3 world                2     1     5

Usage

Arguments

Value

See also

Examples