Tag a batch of strings — tagging

Convenience wrapper around tagging() for multi-string input. When batch is not supplied, tagging_batch() always returns list output.

Usage

tagging_batch(
  texts,
  jiebar,
  ...,
  format = c("vector", "data.frame", "legacy"),
  batch = c("list", "flatten")
)

Arguments

texts: A non-empty character vector to tag.
jiebar: A jieba_worker object created with worker(type = "tag").
...: Must be empty. This enforces that optional arguments such as format and batch are supplied with explicit names.
format: Output format for each single tagged result. Must be one of "vector", "data.frame", or "legacy".
batch: Aggregation mode. Must be one of "list" or "flatten".

Value

Tagging results in the requested format.

Details

tagging_batch() is a convenience wrapper for explicit multi-string input. The returned object depends on both format and batch:

batch = "list": returns one single-string tagging result per input string.
batch = "flatten": concatenates all results into one. The shape is decided by format: "vector"/"legacy" produce a named character vector, while "data.frame" produces a combined data frame with a doc_id column.

In the current release benchmarks on the bundled Fortress Besieged and Dream of the Red Chamber texts, batch tagging is about 2x to 5x faster than the comparable jiebaR workflow on many-string inputs. For very long texts, the best throughput was usually reached by splitting into about 32 to 128 chunks, while much finer splitting still helped but was no longer optimal.

Examples

tagger <- worker(type = "tag")
texts <- c("\u8fd9\u662f\u4e00\u4e2a\u6d4b\u8bd5", "\u518d\u6765\u4e00\u6b21")
tagging_batch(texts, tagger)
#> [[1]]
#> 这是 一个 测试 
#>  "v"  "m" "vn" 
#> 
#> [[2]]
#>   再   来 一次 
#>  "d"  "v"  "m" 
#> 
tagging_batch(texts, tagger, format = "legacy", batch = "flatten")
#>      v      m     vn      d      v      m 
#> "这是" "一个" "测试"   "再"   "来" "一次"