i6_core.lm.vocabulary
¶
- class i6_core.lm.vocabulary.LmIndexVocabulary(vocab: sisyphus.job_path.Path, vocab_size: sisyphus.job_path.Variable, unknown_token: Union[sisyphus.job_path.Variable, str])¶
- unknown_token: Union[Variable, str]¶
- vocab: Path¶
- vocab_size: Variable¶
- class i6_core.lm.vocabulary.LmIndexVocabularyFromLexiconJob(*args, **kwargs)¶
Computes a <word>: <index> vocabulary file from a bliss lexicon for Word-Level LM training
Sentence begin/end will point to index 0, unknown to index 1. Both are taking directly from the lexicon via the “special” marking:
<lemma special=”sentence-begin”> -> index 0
<lemma special=”sentence-end”> -> index 0
<lemma special=”unknown”> -> index 1
If <synt> tokens are provided in a lemma, they will be used instead of <orth>
CAUTION: Be aware of: https://github.com/rwth-i6/returnn/issues/1245 when using Returnn’s LmDataset
- Parameters:
bliss_lexicon – us the lemmas from the lexicon to define the indices
count_ordering_text – optional text that can be used to define the index order based on the lemma count
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]
- class i6_core.lm.vocabulary.VocabularyFromLmJob(*args, **kwargs)¶
Extract the vocabulary from an existing LM. Currently supports only arpa files for input.
- Parameters:
lm_file (Path) – path to the lm arpa file
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]