Segment a batch of strings — segment

Convenience wrapper around segment() for multi-string input. When batch is omitted, segment_batch() will return list output by default.

Usage

segment_batch(texts, jiebar, ..., batch = c("list", "data.frame", "flatten"))

Arguments

texts: A character vector of strings to segment.
jiebar: A jieba_worker object.
...: Must be empty. This enforces that optional arguments such as batch are supplied with explicit names.
batch: Batch aggregation mode. Must be one of "list", "data.frame", or "flatten". The default is "list".

Value

Segmented tokens in the requested aggregation form.

Details

segment_batch() is a convenience wrapper around segment() for explicit batch processing. It always treats texts as multi-string input. The returned object depends on batch:

"list": one character vector per input string.
"data.frame": a data frame with doc_id and word columns.
"flatten": one concatenated character vector.

In the current release benchmarks on the bundled Fortress Besieged and Dream of the Red Chamber texts, batch segmentation reaches about 7x to 12x speedup over the comparable jiebaR workflow on many-string inputs. For very long texts, splitting into about 32 to 128 chunks before calling segment_batch() is recommended for good throughput.

Examples

seg <- worker()
texts <- c("\u5357\u4eac\u5e02\u957f\u6c5f\u5927\u6865", "\u8fd9\u662f\u4e00\u4e2a\u6d4b\u8bd5")
segment_batch(texts, seg)
#> [[1]]
#> [1] "南京市"   "长江大桥"
#> 
#> [[2]]
#> [1] "这是" "一个" "测试"
#> 
segment_batch(texts, seg, batch = "flatten")
#> [1] "南京市"   "长江大桥" "这是"     "一个"     "测试"