i6_core.lm.vocabulary

class i6_core.lm.vocabulary.LmIndexVocabulary(vocab: sisyphus.job_path.Path, vocab_size: sisyphus.job_path.Variable, unknown_token: Union[sisyphus.job_path.Variable, str])
unknown_token: Union[Variable, str]
vocab: Path
vocab_size: Variable
class i6_core.lm.vocabulary.LmIndexVocabularyFromLexiconJob(*args, **kwargs)

Computes a <word>: <index> vocabulary file from a bliss lexicon for Word-Level LM training

Sentence begin/end will point to index 0, unknown to index 1. Both are taking directly from the lexicon via the “special” marking:

  • <lemma special=”sentence-begin”> -> index 0

  • <lemma special=”sentence-end”> -> index 0

  • <lemma special=”unknown”> -> index 1

If <synt> tokens are provided in a lemma, they will be used instead of <orth>

CAUTION: Be aware of: https://github.com/rwth-i6/returnn/issues/1245 when using Returnn’s LmDataset

Parameters:
  • bliss_lexicon – us the lemmas from the lexicon to define the indices

  • count_ordering_text – optional text that can be used to define the index order based on the lemma count

run()
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]

class i6_core.lm.vocabulary.VocabularyFromLmJob(*args, **kwargs)

Extract the vocabulary from an existing LM. Currently supports only arpa files for input.

Parameters:

lm_file (Path) – path to the lm arpa file

run()
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]