Convenience wrapper around tagging() for multi-string input. When batch
is not supplied, tagging_batch() always returns list output.
Arguments
- texts
A non-empty character vector to tag.
- jiebar
A
jieba_workerobject created withworker(type = "tag").- ...
Must be empty. This enforces that optional arguments such as
formatandbatchare supplied with explicit names.- format
Output format for each single tagged result. Must be one of
"vector","data.frame", or"legacy".- batch
Aggregation mode. Must be one of
"list"or"flatten".
Details
tagging_batch() is a convenience wrapper for explicit multi-string input.
The returned object depends on both format and batch:
batch = "list": returns one single-string tagging result per input string.batch = "flatten": concatenates all results into one. The shape is decided byformat:"vector"/"legacy"produce a named character vector, while"data.frame"produces a combined data frame with adoc_idcolumn.
In the current release benchmarks on the bundled Fortress Besieged and
Dream of the Red Chamber texts, batch tagging is about 2x to 5x faster
than the comparable jiebaR workflow on many-string inputs. For very long
texts, the best throughput was usually reached by splitting into about 32
to 128 chunks, while much finer splitting still helped but was no longer
optimal.
Examples
tagger <- worker(type = "tag")
texts <- c("\u8fd9\u662f\u4e00\u4e2a\u6d4b\u8bd5", "\u518d\u6765\u4e00\u6b21")
tagging_batch(texts, tagger)
#> [[1]]
#> 这是 一个 测试
#> "v" "m" "vn"
#>
#> [[2]]
#> 再 来 一次
#> "d" "v" "m"
#>
tagging_batch(texts, tagger, format = "legacy", batch = "flatten")
#> v m vn d v m
#> "这是" "一个" "测试" "再" "来" "一次"