Skip to contents

This function can initialize a jiebaRS worker. See Details for more information.

Usage

worker(
  type = c("mix", "mp", "hmm", "full", "query", "tag", "keywords", "textrank"),
  stop_word = NULL,
  stop_word_file = NULL,
  hmm = TRUE,
  topn = 5L,
  idf = NULL,
  dict = NULL,
  user = NULL,
  symbol = FALSE,
  bylines = FALSE
)

Arguments

type

Worker type. Supported values are "mix", "mp", "hmm", "full", "query", "tag", "keywords", and "textrank". Default is "mix".

stop_word

Optional character vector of stop words supplied directly.

stop_word_file

Optional file path containing one stop word per line.

hmm

Logical scalar or character scalar. If logical, controls whether to enable HMM fallback for unknown terms. If character, must be a path to a custom HMM model file compatible with jieba-rs's hmm.model format, and HMM fallback is enabled with that model. Default is TRUE.

topn

Integer. The number of terms returned by keywords and textrank workers. Default is 5.

idf

Optional character scalar. A path to a custom IDF dictionary file for keywords workers. Each line should be word idf_value. When NULL, the embedded default IDF dictionary is used. Ignored by non-keyword workers. Default is NULL.

dict

Optional character scalar. A path to a custom main dictionary file that replaces the embedded dictionary. Each line should be word [freq] [tag] (whitespace-separated; freq defaults to 0, tag defaults to empty). When NULL, the embedded dictionary is used. Default is NULL.

user

Optional character scalar. A path to a user dictionary file whose entries are appended to the main dictionary. Same line format as dict: word [freq] [tag]. Default is NULL.

symbol

Logical. Whether to keep symbol-like tokens in the sentence. Default is FALSE.

bylines

Deprecated compatibility argument retained from jiebaR. jiebaRS no longer uses this value; control batch aggregation directly in specific functions.

Value

A jieba_worker S3 object.

Details

The qmax argument is not supported. Although jiebaR documented qmax for query workers, the value was never actually passed to the underlying segmentation call. Similarly, the jieba-rs backend implements search-mode segmentation without a configurable query threshold. To avoid user confusion, jiebaRS omits the qmax argument entirely rather than retaining a no-op parameter.

jieba-rs does not expose dedicated public implementations for mp or hmm workers. jiebaRS therefore maps mp to cut(..., false) and hmm to cut(..., true). This is a compatibility approximation rather than a byte-for-byte reimplementation of jiebaR, and jiebaRS warns once per R session when either type is requested.

tag workers use jieba-rs tagging on top of the default mixed segmentation path, which is the closest public behavior to jiebaR.

stop_word and stop_word_file can be both supplied at once and then be merged together. Then they will be normalized.

In jiebaRS, hmm accepts either a logical scalar or a file path. A logical value controls whether the underlying jieba-rs segmentation/tagging pipeline may fall back to HMM for unknown terms. A character scalar is interpreted as a path to a custom HMM model file and enables HMM fallback with that model. The flag affects mix and query workers directly, tag workers through the underlying mixed tagging path, and keywords workers through TF-IDF keyword extraction. mp, hmm, and full workers ignore the runtime switch because their jieba-rs backends do not use this runtime switch.

dict and user load dictionary files at worker creation time. dict replaces the embedded main dictionary entirely; user appends entries to whatever main dictionary is in place (default or custom dict). Both files use the same line format: word [freq] [tag], whitespace-separated, one entry per line. freq is an integer word frequency (default 0 if omitted); tag is a part-of-speech tag string (default empty if omitted). For user files, a word with no freq is assigned frequency 0.