`i6_core.text.label.subword_nmt.train`¶

class i6_core.text.label.subword_nmt.train.ReturnnTrainBpeJob(*args, **kwargs)¶

Create Bpe codes and vocab files compatible with RETURNN BytePairEncoding Repository:

https://github.com/rwth-i6/subword-nmt

This job can be used to produce BPE codes compatible to legacy (non-sisyphus) RETURNN setups.

Outputs:

bpe_codes: the codes file to apply BPE to any text
bpe_vocab: the index vocab in the form of {“<token>”: <index>, …} that can be used e.g. for RETURNN
Will contain <s> and </s> pointing to index 0 and the unk_label pointing to index 1
bpe_dummy_count_vocab: a text file containing all words, to be used with the ApplyBPEToTextJob
DOES NOT INCLUDE COUNTS, but just set each count to -1. Is used to not cause invalid merges when converting text to the BPE form.
vocab_size: variable containing the number of indices

Parameters:

text_file – corpus text file, .gz compressed or uncompressed
bpe_size (int) – number of BPE merge operations
unk_label (str) – unknown label
subword_nmt_repo (Path|None) – subword nmt repository path. see also CloneGitRepositoryJob

run()¶

tasks()¶

Returns:: yields Task’s
Return type:: list[sisyphus.task.Task]

class i6_core.text.label.subword_nmt.train.TrainBPEModelJob(*args, **kwargs)¶

Create a bpe codes file using the official subword-nmt repo, either installed from pip or https://github.com/rsennrich/subword-nmt

This job is deprecated, to create BPE codes that are compatible with legacy (non-sisyphus) RETURNN setups using e.g. language models from Kazuki, please use the ReturnnTrainBpeJob.

Otherwise, please consider using the sentencepiece implementation.

Parameters:

text_corpus (Path) –
symbols (int) –
min_frequency (int) –
dict_input (bool) –
total_symbols (bool) –
subword_nmt_repo (Optional[Path]) –

run()¶

tasks()¶

Returns:: yields Task’s
Return type:: list[sisyphus.task.Task]

`i6_core.text.label.subword_nmt.train`¶

i6_core

Navigation

Related Topics

i6_core.text.label.subword_nmt.train¶

`i6_core.text.label.subword_nmt.train`¶