`i6_core.text.label.sentencepiece.train`¶

class i6_core.text.label.sentencepiece.train.SentencePieceType(value)¶

An enumeration.

BPE = 'bpe'¶

CHAR = 'char'¶

UNIGRAM = 'unigram'¶

WORD = 'word'¶

class i6_core.text.label.sentencepiece.train.TrainSentencePieceJob(*args, **kwargs)¶

Train a sentence-piece model to be used with RETURNN

See also `https://returnn.readthedocs.io/en/latest/api/datasets.util.vocabulary.html#returnn.datasets.util.vocabulary.SentencePieces`_

Parameters:

training_text (tk.Path) – raw text or gzipped text
vocab_size (int) – target vocabulary size for the created model
model_type (SentencePieceType) – which sentence model to use, use “UNIGRAM” for “typical” SPM
character_coverage (float) – official default is 0.9995, but this caused the least used character to be dropped entirely
additional_options (dict|None) – additional trainer options, see `https://github.com/google/sentencepiece/blob/master/doc/options.md`_

run()¶

tasks()¶

Returns:: yields Task’s
Return type:: list[sisyphus.task.Task]