Tag one or more strings with a jieba_worker created by worker().
Arguments
- code
A non-empty character vector to tag.
- jiebar
A
jieba_workerobject created withworker(type = "tag").- ...
Must be empty. This enforces that optional arguments such as
formatandbatchare supplied with explicit names.- format
Output format for a single tagged string. Must be one of
"vector","data.frame", or"legacy".- batch
Aggregation mode for multi-string input. Must be one of
"list"or"flatten".
Details
format controls the shape of each single-string tagging result:
"vector": a named character vector with token names and tag values."data.frame": a data frame withtermandtagcolumns."legacy": the oldjiebaRlayout with token values and tag names.
In the current release benchmarks on the bundled Fortress Besieged and
Dream of the Red Chamber texts, jiebaRS::tagging() is about 1.6x to
1.8x faster than jiebaR::tagging() when each novel is tagged as one long
string. When the same content is split into many strings and processed in
batch, jiebaRS::tagging() is about 2x to 5x faster than jiebaR.
For very long texts, splitting before tagging is usually faster than sending one huge string. In the same release benchmarks, the best results appeared around 32 to 128 chunks, while much finer splitting still helped but was no longer optimal.
When code contains multiple strings, batch controls how the per-string
results are aggregated:
"list": one single-string result per input string."flatten": concatenate all results into one. The shape is decided byformat:"vector"/"legacy"produce a named character vector, while"data.frame"produces a combined data frame with adoc_idcolumn.
When batch is omitted, jiebaRS returns "vector" for single-string
input and "list" for multi-string input.
Examples
tagger <- worker(type = "tag")
text1 <- "\u8fd9\u662f\u4e00\u4e2a\u6d4b\u8bd5"
text2 <- "\u518d\u6765\u4e00\u6b21"
tagging(text1, tagger)
#> 这是 一个 测试
#> "v" "m" "vn"
tagging(c(text1, text2), tagger)
#> [[1]]
#> 这是 一个 测试
#> "v" "m" "vn"
#>
#> [[2]]
#> 再 来 一次
#> "d" "v" "m"
#>
tagging(c(text1, text2), tagger, format = "data.frame", batch = "flatten")
#> doc_id term tag
#> 1 1 这是 v
#> 2 1 一个 m
#> 3 1 测试 vn
#> 4 2 再 d
#> 5 2 来 v
#> 6 2 一次 m