i6_core.text.label.subword_nmt.train
¶
- class i6_core.text.label.subword_nmt.train.ReturnnTrainBpeJob(*args, **kwargs)¶
Create Bpe codes and vocab files compatible with RETURNN BytePairEncoding Repository:
This job can be used to produce BPE codes compatible to legacy (non-sisyphus) RETURNN setups.
- Outputs:
bpe_codes: the codes file to apply BPE to any text
- bpe_vocab: the index vocab in the form of {“<token>”: <index>, …} that can be used e.g. for RETURNN
Will contain <s> and </s> pointing to index 0 and the unk_label pointing to index 1
- bpe_dummy_count_vocab: a text file containing all words, to be used with the ApplyBPEToTextJob
DOES NOT INCLUDE COUNTS, but just set each count to -1. Is used to not cause invalid merges when converting text to the BPE form.
vocab_size: variable containing the number of indices
- Parameters:
text_file – corpus text file, .gz compressed or uncompressed
bpe_size (int) – number of BPE merge operations
unk_label (str) – unknown label
subword_nmt_repo (Path|None) – subword nmt repository path. see also CloneGitRepositoryJob
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]
- class i6_core.text.label.subword_nmt.train.TrainBPEModelJob(*args, **kwargs)¶
Create a bpe codes file using the official subword-nmt repo, either installed from pip or https://github.com/rsennrich/subword-nmt
This job is deprecated, to create BPE codes that are compatible with legacy (non-sisyphus) RETURNN setups using e.g. language models from Kazuki, please use the ReturnnTrainBpeJob.
Otherwise, please consider using the sentencepiece implementation.
- Parameters:
text_corpus (Path) –
symbols (int) –
min_frequency (int) –
dict_input (bool) –
total_symbols (bool) –
subword_nmt_repo (Optional[Path]) –
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]