i6_core.text.label.subword_nmt.train

class i6_core.text.label.subword_nmt.train.ReturnnTrainBpeJob(*args, **kwargs)

Create Bpe codes and vocab files compatible with RETURNN BytePairEncoding Repository:

This job can be used to produce BPE codes compatible to legacy (non-sisyphus) RETURNN setups.

Outputs:
  • bpe_codes: the codes file to apply BPE to any text

  • bpe_vocab: the index vocab in the form of {“<token>”: <index>, …} that can be used e.g. for RETURNN

    Will contain <s> and </s> pointing to index 0 and the unk_label pointing to index 1

  • bpe_dummy_count_vocab: a text file containing all words, to be used with the ApplyBPEToTextJob

    DOES NOT INCLUDE COUNTS, but just set each count to -1. Is used to not cause invalid merges when converting text to the BPE form.

  • vocab_size: variable containing the number of indices

Parameters:
  • text_file – corpus text file, .gz compressed or uncompressed

  • bpe_size (int) – number of BPE merge operations

  • unk_label (str) – unknown label

  • subword_nmt_repo (Path|None) – subword nmt repository path. see also CloneGitRepositoryJob

run()
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]

class i6_core.text.label.subword_nmt.train.TrainBPEModelJob(*args, **kwargs)

Create a bpe codes file using the official subword-nmt repo, either installed from pip or https://github.com/rsennrich/subword-nmt

This job is deprecated, to create BPE codes that are compatible with legacy (non-sisyphus) RETURNN setups using e.g. language models from Kazuki, please use the ReturnnTrainBpeJob.

Otherwise, please consider using the sentencepiece implementation.

Parameters:
  • text_corpus (Path) –

  • symbols (int) –

  • min_frequency (int) –

  • dict_input (bool) –

  • total_symbols (bool) –

  • subword_nmt_repo (Optional[Path]) –

run()
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]