Filter segmentation results — filter

Remove selected words from a segmented character vector or from each element of a list of segmented character vectors.

Usage

filter_segment(input, filter_words, keep_na = TRUE)

Arguments

input: A character vector or a list of character vectors.
filter_words: A character vector of words to remove.
keep_na: Whether to keep NA values in the returned result. The default TRUE matches jiebaR::filter_segment().

Value

An object with the same shape as input, with matching words removed.

Details

This is a modern reimplementation of jiebaR::filter_segment() with the same core filtering behavior under the default settings.

In the reproducible benchmark, this version is about 110x to 140x faster than jiebaR::filter_segment() on the tested workloads.

Examples

filter_segment(c("abc", "def", " ", "."), c("abc"))
#> [1] "def" " "   "."  
filter_segment(c("a", NA, "b", "a"), c("b"), keep_na = FALSE)
#> [1] "a" "a"
input <- list(
  c("\u6211", "\u662f", "\u6d4b\u8bd5"),
  c("\u6d4b\u8bd5", "\u6587\u672c", "\u6211")
)
filter_segment(input, "\u6211")
#> [[1]]
#> [1] "是"   "测试"
#> 
#> [[2]]
#> [1] "测试" "文本"
#>