i6_core.text.label.sentencepiece.train
¶
- class i6_core.text.label.sentencepiece.train.SentencePieceType(value)¶
An enumeration.
- BPE = 'bpe'¶
- CHAR = 'char'¶
- UNIGRAM = 'unigram'¶
- WORD = 'word'¶
- class i6_core.text.label.sentencepiece.train.TrainSentencePieceJob(*args, **kwargs)¶
Train a sentence-piece model to be used with RETURNN
- Parameters:
training_text (tk.Path) – raw text or gzipped text
vocab_size (int) – target vocabulary size for the created model
model_type (SentencePieceType) – which sentence model to use, use “UNIGRAM” for “typical” SPM
character_coverage (float) – official default is 0.9995, but this caused the least used character to be dropped entirely
additional_options (dict|None) – additional trainer options, see `https://github.com/google/sentencepiece/blob/master/doc/options.md`_
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]