This function can initialize a jiebaRS worker. See Details for more information.
Usage
worker(
type = c("mix", "mp", "hmm", "full", "query", "tag", "keywords", "textrank"),
stop_word = NULL,
stop_word_file = NULL,
hmm = TRUE,
topn = 5L,
idf = NULL,
dict = NULL,
user = NULL,
symbol = FALSE,
bylines = FALSE
)Arguments
- type
Worker type. Supported values are
"mix","mp","hmm","full","query","tag","keywords", and"textrank". Default is"mix".- stop_word
Optional character vector of stop words supplied directly.
- stop_word_file
Optional file path containing one stop word per line.
- hmm
Logical scalar or character scalar. If logical, controls whether to enable HMM fallback for unknown terms. If character, must be a path to a custom HMM model file compatible with
jieba-rs'shmm.modelformat, and HMM fallback is enabled with that model. Default isTRUE.- topn
Integer. The number of terms returned by
keywordsandtextrankworkers. Default is5.- idf
Optional character scalar. A path to a custom IDF dictionary file for
keywordsworkers. Each line should beword idf_value. WhenNULL, the embedded default IDF dictionary is used. Ignored by non-keyword workers. Default isNULL.- dict
Optional character scalar. A path to a custom main dictionary file that replaces the embedded dictionary. Each line should be
word [freq] [tag](whitespace-separated;freqdefaults to0,tagdefaults to empty). WhenNULL, the embedded dictionary is used. Default isNULL.- user
Optional character scalar. A path to a user dictionary file whose entries are appended to the main dictionary. Same line format as
dict:word [freq] [tag]. Default isNULL.- symbol
Logical. Whether to keep symbol-like tokens in the sentence. Default is
FALSE.- bylines
Deprecated compatibility argument retained from
jiebaR.jiebaRSno longer uses this value; control batch aggregation directly in specific functions.
Details
The qmax argument is not supported. Although jiebaR documented
qmax for query workers, the value was never actually passed to the
underlying segmentation call. Similarly, the jieba-rs backend implements
search-mode segmentation without a configurable query threshold. To avoid
user confusion, jiebaRS omits the qmax argument entirely rather than
retaining a no-op parameter.
jieba-rs does not expose dedicated public implementations for mp or
hmm workers. jiebaRS therefore maps mp to cut(..., false) and hmm
to cut(..., true). This is a compatibility approximation rather than a
byte-for-byte reimplementation of jiebaR, and jiebaRS warns once per R
session when either type is requested.
tag workers use jieba-rs tagging on top of the default mixed
segmentation path, which is the closest public behavior to jiebaR.
stop_word and stop_word_file can be both supplied at once and then
be merged together. Then they will be normalized.
In jiebaRS, hmm accepts either a logical scalar or a file path. A
logical value controls whether the underlying jieba-rs
segmentation/tagging pipeline may fall back to HMM for unknown terms. A
character scalar is interpreted as a path to a custom HMM model file and
enables HMM fallback with that model. The flag affects mix and query
workers directly, tag workers through the underlying mixed tagging path,
and keywords workers through TF-IDF keyword extraction. mp, hmm, and
full workers ignore the runtime switch because their jieba-rs backends
do not use this runtime switch.
dict and user load dictionary files at worker creation time. dict
replaces the embedded main dictionary entirely; user appends entries
to whatever main dictionary is in place (default or custom dict). Both
files use the same line format: word [freq] [tag], whitespace-separated,
one entry per line. freq is an integer word frequency (default 0 if
omitted); tag is a part-of-speech tag string (default empty if omitted).
For user files, a word with no freq is assigned frequency 0.