Segment one or more strings with a jieba_worker created by worker().
Usage
segment(
code,
jiebar,
...,
mod = NULL,
batch = c("list", "data.frame", "flatten")
)Arguments
- code
A character vector to segment.
- jiebar
A
jieba_workerobject.- ...
Must be empty. This enforces that optional arguments such as
modandbatchare supplied with explicit names.- mod
Deprecated Compatibility argument retained from
jiebaR. This argument no longer has any effect.- batch
Batch aggregation mode for multi-string input. Must be one of
"list","data.frame", or"flatten". The default is"list".
Details
For a single input string, segment() always returns a character vector of
segmented tokens.
In the current release benchmarks on the bundled Fortress Besieged and
Dream of the Red Chamber texts, jiebaRS::segment() is about 1.7x to
1.9x faster than jiebaR::segment() when each novel is segmented as one
long string. When the input is many short strings segmented in parallel,
jiebaRS::segment() reaches about 7x to 12x speedup over jiebaR.
For very long texts, splitting into about 32 to 128 chunks before segmentation is recommended for good throughput.
For multiple input strings, the argument batch controls how the
per-string token vectors are aggregated:
"list": one character vector per input string."data.frame": a data frame withdoc_idandwordcolumns."flatten": all token vectors concatenated into one character vector.
When batch is omitted, jiebaRS returns list output for multi-string
input.
The mod argument from jiebaR::segment() is retained only as a deprecated
compatibility placeholder. In jiebaRS, segmentation behavior should be
controlled by the worker type itself (for example, worker(type = "mix") or
worker(type = "query")), not by mutating behavior at call time. When mod
is supplied, jiebaRS warns and ignores it.
Examples
seg <- worker()
text1 <- "\u5357\u4eac\u5e02\u957f\u6c5f\u5927\u6865"
text2 <- "\u8fd9\u662f\u4e00\u4e2a\u6d4b\u8bd5"
segment(text1, seg)
#> [1] "南京市" "长江大桥"
segment(c(text1, text2), seg, batch = "list")
#> [[1]]
#> [1] "南京市" "长江大桥"
#>
#> [[2]]
#> [1] "这是" "一个" "测试"
#>
segment(c(text1, text2), seg, batch = "data.frame")
#> doc_id word
#> 1 1 南京市
#> 2 1 长江大桥
#> 3 2 这是
#> 4 2 一个
#> 5 2 测试