i6_core.bpe.train

This is an old location of bpe jobs kept for backwards compatibility, for new setups using the subword-nmt based BPE, please use i6_core.label.bpe, for other setups please switch to the sentencepiece implementation

class i6_core.bpe.train.ReturnnTrainBpeJob(*args, **kwargs)

Create Bpe codes and vocab files compatible with RETURNN BytePairEncoding Repository:

Parameters:
  • text_file – corpus text file, .gz compressed or uncompressed

  • bpe_size (int) – number of BPE merge operations

  • unk_label (str) – unknown label

  • subword_nmt_repo (Path|None) – subword nmt repository path. see also CloneGitRepositoryJob

class i6_core.bpe.train.TrainBPEModelJob(*args, **kwargs)

Create a bpe codes file using the official subword-nmt repo, either installed from pip or https://github.com/rsennrich/subword-nmt

Parameters:
  • text_corpus (Path) –

  • symbols (int) –

  • min_frequency (int) –

  • dict_input (bool) –

  • total_symbols (bool) –

  • subword_nmt_repo (Optional[Path]) –