Tag text with a jiebaRS worker — tagging • jiebaRS

Tag one or more strings with a jieba_worker created by worker().

Usage

tagging(
  code,
  jiebar,
  ...,
  format = c("vector", "data.frame", "legacy"),
  batch = c("list", "flatten")
)

Arguments

code: A non-empty character vector to tag.
jiebar: A jieba_worker object created with worker(type = "tag").
...: Must be empty. This enforces that optional arguments such as format and batch are supplied with explicit names.
format: Output format for a single tagged string. Must be one of "vector", "data.frame", or "legacy".
batch: Aggregation mode for multi-string input. Must be one of "list" or "flatten".

Value

Tagging results in the requested format.

Details

format controls the shape of each single-string tagging result:

"vector": a named character vector with token names and tag values.
"data.frame": a data frame with term and tag columns.
"legacy": the old jiebaR layout with token values and tag names.

In the current release benchmarks on the bundled Fortress Besieged and Dream of the Red Chamber texts, jiebaRS::tagging() is about 1.6x to 1.8x faster than jiebaR::tagging() when each novel is tagged as one long string. When the same content is split into many strings and processed in batch, jiebaRS::tagging() is about 2x to 5x faster than jiebaR.

For very long texts, splitting before tagging is usually faster than sending one huge string. In the same release benchmarks, the best results appeared around 32 to 128 chunks, while much finer splitting still helped but was no longer optimal.

When code contains multiple strings, batch controls how the per-string results are aggregated:

"list": one single-string result per input string.
"flatten": concatenate all results into one. The shape is decided by format: "vector"/"legacy" produce a named character vector, while "data.frame" produces a combined data frame with a doc_id column.

When batch is omitted, jiebaRS returns "vector" for single-string input and "list" for multi-string input.

Examples

tagger <- worker(type = "tag")
text1 <- "\u8fd9\u662f\u4e00\u4e2a\u6d4b\u8bd5"
text2 <- "\u518d\u6765\u4e00\u6b21"
tagging(text1, tagger)
#> 这是 一个 测试 
#>  "v"  "m" "vn" 
tagging(c(text1, text2), tagger)
#> [[1]]
#> 这是 一个 测试 
#>  "v"  "m" "vn" 
#> 
#> [[2]]
#>   再   来 一次 
#>  "d"  "v"  "m" 
#> 
tagging(c(text1, text2), tagger, format = "data.frame", batch = "flatten")
#>   doc_id term tag
#> 1      1 这是   v
#> 2      1 一个   m
#> 3      1 测试  vn
#> 4      2   再   d
#> 5      2   来   v
#> 6      2 一次   m