Segment text with a jieba worker — segment • jiebaRS

Segment one or more strings with a jieba_worker created by worker().

Usage

segment(
  code,
  jiebar,
  ...,
  mod = NULL,
  batch = c("list", "data.frame", "flatten")
)

Arguments

code: A character vector to segment.
jiebar: A jieba_worker object.
...: Must be empty. This enforces that optional arguments such as mod and batch are supplied with explicit names.
mod: Deprecated Compatibility argument retained from jiebaR. This argument no longer has any effect.
batch: Batch aggregation mode for multi-string input. Must be one of "list", "data.frame", or "flatten". The default is "list".

Value

Segmented tokens in the requested aggregation form.

Details

For a single input string, segment() always returns a character vector of segmented tokens.

In the current release benchmarks on the bundled Fortress Besieged and Dream of the Red Chamber texts, jiebaRS::segment() is about 1.7x to 1.9x faster than jiebaR::segment() when each novel is segmented as one long string. When the input is many short strings segmented in parallel, jiebaRS::segment() reaches about 7x to 12x speedup over jiebaR.

For very long texts, splitting into about 32 to 128 chunks before segmentation is recommended for good throughput.

For multiple input strings, the argument batch controls how the per-string token vectors are aggregated:

"list": one character vector per input string.
"data.frame": a data frame with doc_id and word columns.
"flatten": all token vectors concatenated into one character vector.

When batch is omitted, jiebaRS returns list output for multi-string input.

The mod argument from jiebaR::segment() is retained only as a deprecated compatibility placeholder. In jiebaRS, segmentation behavior should be controlled by the worker type itself (for example, worker(type = "mix") or worker(type = "query")), not by mutating behavior at call time. When mod is supplied, jiebaRS warns and ignores it.

Examples

seg <- worker()
text1 <- "\u5357\u4eac\u5e02\u957f\u6c5f\u5927\u6865"
text2 <- "\u8fd9\u662f\u4e00\u4e2a\u6d4b\u8bd5"
segment(text1, seg)
#> [1] "南京市"   "长江大桥"
segment(c(text1, text2), seg, batch = "list")
#> [[1]]
#> [1] "南京市"   "长江大桥"
#> 
#> [[2]]
#> [1] "这是" "一个" "测试"
#> 
segment(c(text1, text2), seg, batch = "data.frame")
#>   doc_id     word
#> 1      1   南京市
#> 2      1 长江大桥
#> 3      2     这是
#> 4      2     一个
#> 5      2     测试