i6_core.text.label.sentencepiece.train

class i6_core.text.label.sentencepiece.train.SentencePieceType(value)

An enumeration.

BPE = 'bpe'
CHAR = 'char'
UNIGRAM = 'unigram'
WORD = 'word'
class i6_core.text.label.sentencepiece.train.TrainSentencePieceJob(*args, **kwargs)

Train a sentence-piece model to be used with RETURNN

See also `https://returnn.readthedocs.io/en/latest/api/datasets.util.vocabulary.html#returnn.datasets.util.vocabulary.SentencePieces`_

Parameters:
  • training_text (tk.Path) – raw text or gzipped text

  • vocab_size (int) – target vocabulary size for the created model

  • model_type (SentencePieceType) – which sentence model to use, use “UNIGRAM” for “typical” SPM

  • character_coverage (float) – official default is 0.9995, but this caused the least used character to be dropped entirely

  • additional_options (dict|None) – additional trainer options, see `https://github.com/google/sentencepiece/blob/master/doc/options.md`_

run()
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]