Fast multi-pattern string matching in R with the Aho-Corasick algorithm.
ahocorasick, powered by the Rust aho-corasick crate, allows you to search for many substrings in one or more strings with linear complexity in R. It builds a reusable automaton once and then uses it for detection, counting, locating, and extraction and replacement.
Aho-Corasick is a multiple-pattern string matching algorithm designed to search for many keywords in a text in a single pass. After preprocessing the pattern set into an automaton, it finds all matches in O(text_length + number_of_matches), instead of the naive O(text_length × number_of_patterns) approach. This makes it especially suitable for high-throughput tasks such as dictionary matching, keyword extraction, rule-based text filtering, and large-scale log or document analysis.
Installation
Install the released version from CRAN:
install.packages("ahocorasick")Install the released version from R-universe/R-multiverse:
install.packages("ahocorasick", repos = "https://yousa-mirage.r-universe.dev")
# or
install.packages("ahocorasick", repos = "https://community.r-multiverse.org")Install the development version from source. You need Cargo and rustc on PATH.
# install.packages("pak")
pak::pak("Yousa-Mirage/r-ahocorasick")Quick Start
Start with a character vector of patterns you want to search for, then compile it into an <ac_automaton> object by calling ac_build() that can be reused across many documents.
library(ahocorasick)
patterns <- c("hello", "world", "fish")
doc <- c(
"this is my first hello world. hello!",
"nothing to see",
"fish and chips"
)
ac <- ac_build(patterns)You can set ascii_case_insensitive = TRUE to make the search case-insensitive for ASCII characters (a-z and A-Z) only.
Extracting Matches
ac_extract() returns one data frame per document, with both the matched text and the pattern value that produced each match.
patterns <- c("hello", "world", "fish")
doc <- c(
"this is my first hello world. hello!",
"nothing to see",
"fish and chips"
)
ac <- ac_build(patterns)
hits <- ac_extract(ac, doc)
hits[[1]]
#> matches patterns
#> 1 hello hello
#> 2 world world
#> 3 hello helloUse ac_extract_df() when you want one data frame for all documents.
ac_extract_df(ac, doc)
#> doc_id matches patterns
#> 1 1 hello hello
#> 2 1 world world
#> 3 1 hello hello
#> 4 3 fish fishSet overlapping = TRUE to return overlapping matches. This is only supported when the automaton was built with the default match_kind = "standard".
ac_overlap <- ac_build(c("aba", "bab"))
ac_extract_df(ac_overlap, "ababa", overlapping = TRUE)
#> doc_id matches patterns
#> 1 1 aba aba
#> 2 1 bab bab
#> 3 1 aba abaLocating Matches
ac_locate() returns one data frame per document. Offsets are 1-based character positions, inclusive on both sides, so they can be used with substr().
patterns <- c("hello", "world", "fish")
doc <- c(
"this is my first hello world. hello!",
"nothing to see",
"fish and chips"
)
ac <- ac_build(patterns)
hits <- ac_locate(ac, doc)
hits[[1]]
#> pattern_id start end
#> 1 1 18 22
#> 2 2 24 28
#> 3 1 31 35Use ac_locate_df() when you want one data frame for all documents.
ac_locate_df(ac, doc)
#> doc_id pattern_id start end
#> 1 1 1 18 22
#> 2 1 2 24 28
#> 3 1 1 31 35
#> 4 3 3 1 4Use ac_locate_bytes() when byte offsets are more useful than R character positions. Byte offsets are 0-based and byte_end is end-exclusive.
Detecting And Counting
Use ac_detect() when you only need to know whether a document contains at least one match, and ac_count() when you need the number of matches.
patterns <- c("hello", "world", "fish")
doc <- c(
"this is my first hello world. hello!",
"nothing to see",
"fish and chips"
)
ac <- ac_build(patterns)
ac_detect(ac, doc)
#> [1] TRUE FALSE TRUE
ac_count(ac, doc)
#> [1] 3 0 1These functions are usually the cheapest way to answer yes/no or frequency questions without materializing the matched strings or offsets.
Replacing Matches
Use ac_replace() to replace every non-overlapping match with the replacement for the matched pattern.
patterns <- c("fox", "brown", "quick")
doc <- "The quick brown fox."
ac <- ac_build(patterns)
ac_replace(ac, doc, c("sloth", "grey", "slow"))
#> [1] "The slow grey sloth."Use one replacement string to replace all patterns with the same value.
ac_replace(ac, doc, "")
#> [1] "The ."Replacement uses the automaton’s match semantics. If you want Polars replace_many(..., leftmost = TRUE)-style priority, build the automaton with match_kind = "leftmost_first".
ac_leftmost <- ac_build(
c("append", "appendage", "app"),
match_kind = "leftmost_first"
)
ac_replace(ac_leftmost, "append the app to the appendage", c("x", "y", "z"))
#> [1] "x the z to the xage"Searching Files
The ac_*_file() family applies the same automaton to one or more files. Pass a vector of file paths and each function returns results aligned with those files.
ac_files <- ac_build(c("hello", "world"))
paths <- c(tempfile(fileext = ".txt"), tempfile(fileext = ".txt"))
writeLines("hello world", paths[[1]])
writeLines("fish and chips", paths[[2]])
ac_detect_file(ac_files, paths)
#> [1] TRUE FALSE
ac_count_file(ac_files, paths)
#> [1] 2 0
ac_extract_file(ac_files, paths)
#> [[1]]
#> matches patterns
#> 1 hello hello
#> 2 world world
#>
#> [[2]]
#> [1] matches patterns
#> <0 rows> (or 0-length row.names)ac_detect_file(), ac_count_file(), ac_extract_file(), and ac_replace_file() support stream = TRUE when the automaton was built with the default match_kind = "standard". This is useful for large files when you want to avoid reading the whole file into memory at once.
ac_locate_file() does not support stream = TRUE. It returns R character offsets instead of raw byte offsets, and converting streamed byte positions back into character positions would require another pass over the file. Keeping file location search non-streaming makes the behavior simpler and easier to reason about.
Using With Tidyverse
These search functions work naturally inside dplyr::mutate(). Scalar outputs like ac_detect(), ac_count(), and ac_replace() become ordinary columns, while ac_extract() and ac_locate() can be stored as list-columns.
patterns <- c("hello", "world", "fish")
ac <- ac_build(patterns)
docs <- tibble::tibble(
doc = c(
"this is my first hello world. hello!",
"nothing to see",
"fish and chips"
)
)
docs |>
dplyr::mutate(
detected = ac_detect(ac, doc),
n_matches = ac_count(ac, doc),
replaced = ac_replace(ac, doc, "[match]"),
extracted = ac_extract(ac, doc)
) |>
tidyr::unnest(extracted)
#> # A tibble: 4 × 6
#> doc detected n_matches replaced matches patterns
#> <chr> <lgl> <int> <chr> <chr> <chr>
#> 1 this is my first hello world. he… TRUE 3 this is… hello hello
#> 2 this is my first hello world. he… TRUE 3 this is… world world
#> 3 this is my first hello world. he… TRUE 3 this is… hello hello
#> 4 fish and chips TRUE 1 [match]… fish fishMissing Values
Search functions let you choose how NA documents behave.
doc_na <- c("hello", NA_character_, "world")
ac_na <- ac_build(c("hello", "world"))
ac_detect(ac_na, doc_na, na = "false")
#> [1] TRUE FALSE TRUE
ac_count(ac_na, doc_na, na = "zero")
#> [1] 1 0 1
ac_replace(ac_na, doc_na, "[match]", na = "empty")
#> [1] "[match]" "" "[match]"For list-column workflows, ac_locate(..., na = "empty") and ac_extract(..., na = "empty") treat missing documents as no matches.
Matching Semantics
ac_build() exposes three match kinds from the Rust library.
standard (the default)
The default, match_kind = "standard", returns the match that finishes first. It is also the only mode that supports overlapping search.
patterns <- c("content", "disco", "disc", "discontent", "winter")
haystack <- "This is the winter of my discontent"
ac <- ac_build(patterns)
ac_extract_df(ac, haystack)
#> doc_id matches patterns
#> 1 1 winter winter
#> 2 1 disc disc
ac_extract_df(ac, haystack, overlapping = TRUE)
#> doc_id matches patterns
#> 1 1 winter winter
#> 2 1 disc disc
#> 3 1 disco disco
#> 4 1 discontent discontent
#> 5 1 content content
leftmost_first
match_kind = "leftmost_first" returns the leftmost non-overlapping match. If several patterns start at the same position, the earlier pattern wins.
ac <- ac_build(
c("disco", "disc"),
match_kind = "leftmost_first"
)
ac_extract_df(ac, "discontent")
#> doc_id matches patterns
#> 1 1 disco disco
leftmost_longest
match_kind = "leftmost_longest" returns the leftmost non-overlapping match. If several patterns start at the same position, the longest pattern wins.
ac <- ac_build(
c("disco", "disc", "discontent"),
match_kind = "leftmost_longest"
)
ac_extract_df(ac, "discontent")
#> doc_id matches patterns
#> 1 1 discontent discontentPerformance Options
implementation controls the underlying automaton implementation:
-
"auto"lets the Rust crate choose a heuristic default. -
"noncontiguous_nfa"is fast to build and uses moderate memory. -
"contiguous_nfa"uses memory efficiently and is usually faster to search than a noncontiguous NFA. -
"dfa"can be fastest to search, but may be slow to build and can use much more memory.
Benchmark
See the benchmark article for results on English, Chinese, and mixed-text workloads. In the current benchmark, ahocorasick is fastest for detect and count across the tested cases. The ac_extract_df() is also the fastest for extract while preserving a tidy long data-frame result.
As with any benchmark, real-world results will differ based on your particular situation. If performance is important to your application, measure the alternatives yourself!
Related Work
Packages that developed to handle multiple string matching in R are less than in Python. These are what I found:
-
AhoCorasickTrie: Implemented in C++. It only supports 128 ASCII characters and currently does not support UTF-8/Unicode (such as CJK and emojis). -
Biostrings: It is a special tool for bioinformatics, and its object system isXString/DNAString/XStringSet. It is not suitable as a drop-in multi-mode retrieval tool for general R string vectors. -
r-polars: Polars’ string expressions also use the Aho-Corasick algorithm to match (based on the same crate as this package). If your data is already in a DataFrame, I recommend you use Polars for matching strings directly.
It’s great to see that Rust is providing more and more modern, safe, high-performance open source tools for R (and also for Python) :)
Acknowledgments
Inspired by the Python package ahocorasick_rs. Thanks for the excellent Rust aho-corasick crate.
MIT © 2026 Yousa Mirage