Count n-grams from segmented text

Count contiguous n-grams from a segmented character vector or from each element of a list of segmented character vectors.

This function is a drop-in replacement for jiebaR::get_tuple(), which is deprecated in jiebaRS. See Details for more information.

Usage

count_ngrams(
  x,
  ...,
  n = 2,
  sep = " ",
  sort = TRUE,
  format = c("data.frame", "vector")
)

Arguments

x: A character vector of tokens or a list of character vectors.
...: Must be empty. This enforces that optional arguments such as n, sep, sort, and format are supplied with explicit names.
n: A positive integer or integer vector giving the n-gram sizes to count. The default is 2. If n is a integer vector of length > 1, n-grams of all specified sizes will be counted.
sep: Separator inserted between tokens when constructing the n-gram label. The default is " ", a single space.
sort: Whether to sort results by descending frequency. The default is TRUE. If FALSE, results keep first-appearance order within each requested n.
format: Output format. "data.frame" returns a data frame with term, n, and count columns. "vector" returns a named integer vector using the n-gram terms as names.

Value

N-gram counts in the requested format.

Details

The original jiebaR::get_tuple() interface has several design problems:

Its n-gram extraction behavior does not match the most obvious reading of the argument name: size = n counts all contiguous n-grams from 2:n, not just the exact size n.
Its documentation says it accepts list input, but the original exported implementation does not reliably support lists.
It concatenates tokens without a separator, which makes tuple boundaries ambiguous.

count_ngrams() addresses these issues, providing more explicit and abundant parameters. In addition, this function is about 1.3x to 2.0x faster than jiebaR::get_tuple().

Examples

count_ngrams(c("\u6211", "\u7231", "R"), n = 2)
#>    term n count
#> 1 我 爱 2     1
#> 2  爱 R 2     1
count_ngrams(c("\u6211", "\u7231", "R"), n = 1:2, format = "data.frame")
#>    term n count
#> 1    我 1     1
#> 2    爱 1     1
#> 3     R 1     1
#> 4 我 爱 2     1
#> 5  爱 R 2     1
count_ngrams(c("a", "b", "b", "b", "a"), n = 1, sort = FALSE)
#>   term n count
#> 1    a 1     2
#> 2    b 1     3
count_ngrams(list(c("a", "b", "c"), c("a", "b")), n = 2)
#>   term n count
#> 1  a b 2     2
#> 2  b c 2     1

Usage

Arguments

Value

Details

See also

Examples