i6_core.bpe.train¶
This is an old location of bpe jobs kept for backwards compatibility, for new setups using the subword-nmt based BPE, please use i6_core.label.bpe, for other setups please switch to the sentencepiece implementation
- class i6_core.bpe.train.ReturnnTrainBpeJob(*args, **kwargs)¶
Create Bpe codes and vocab files compatible with RETURNN BytePairEncoding Repository:
- Parameters:
text_file – corpus text file, .gz compressed or uncompressed
bpe_size (int) – number of BPE merge operations
unk_label (str) – unknown label
subword_nmt_repo (Path|None) – subword nmt repository path. see also CloneGitRepositoryJob
- class i6_core.bpe.train.TrainBPEModelJob(*args, **kwargs)¶
Create a bpe codes file using the official subword-nmt repo, either installed from pip or https://github.com/rsennrich/subword-nmt
- Parameters:
text_corpus (Path) –
symbols (int) –
min_frequency (int) –
dict_input (bool) –
total_symbols (bool) –
subword_nmt_repo (Optional[Path]) –