Count contiguous n-grams from a segmented character vector or from each element of a list of segmented character vectors.
This function is a drop-in replacement for jiebaR::get_tuple(), which
is deprecated in jiebaRS. See Details for more information.
Usage
count_ngrams(
x,
...,
n = 2,
sep = " ",
sort = TRUE,
format = c("data.frame", "vector")
)Arguments
- x
A character vector of tokens or a list of character vectors.
- ...
Must be empty. This enforces that optional arguments such as
n,sep,sort, andformatare supplied with explicit names.- n
A positive integer or integer vector giving the n-gram sizes to count. The default is
2. Ifnis a integer vector of length > 1, n-grams of all specified sizes will be counted.- sep
Separator inserted between tokens when constructing the n-gram label. The default is
" ", a single space.- sort
Whether to sort results by descending frequency. The default is
TRUE. IfFALSE, results keep first-appearance order within each requested n.- format
Output format.
"data.frame"returns a data frame withterm,n, andcountcolumns."vector"returns a named integer vector using the n-gram terms as names.
Details
The original jiebaR::get_tuple() interface has several design problems:
Its n-gram extraction behavior does not match the most obvious reading of the argument name:
size = ncounts all contiguous n-grams from2:n, not just the exact sizen.Its documentation says it accepts list input, but the original exported implementation does not reliably support lists.
It concatenates tokens without a separator, which makes tuple boundaries ambiguous.
count_ngrams() addresses these issues, providing more explicit and
abundant parameters. In addition, this function is about 1.3x to
2.0x faster than jiebaR::get_tuple().
Examples
count_ngrams(c("\u6211", "\u7231", "R"), n = 2)
#> term n count
#> 1 我 爱 2 1
#> 2 爱 R 2 1
count_ngrams(c("\u6211", "\u7231", "R"), n = 1:2, format = "data.frame")
#> term n count
#> 1 我 1 1
#> 2 爱 1 1
#> 3 R 1 1
#> 4 我 爱 2 1
#> 5 爱 R 2 1
count_ngrams(c("a", "b", "b", "b", "a"), n = 1, sort = FALSE)
#> term n count
#> 1 a 1 2
#> 2 b 1 3
count_ngrams(list(c("a", "b", "c"), c("a", "b")), n = 2)
#> term n count
#> 1 a b 2 2
#> 2 b c 2 1