i6_core.lexicon.conversion

class i6_core.lexicon.conversion.FilterLexiconByWordListJob(*args, **kwargs)

Filter lemmata to given word list. Warning: case_sensitive parameter does the opposite. Kept for backwards-compatibility.

Parameters:
  • bliss_lexicon (tk.Path) – lexicon file to be handeled

  • word_list (tk.Path) – filter lexicon by this word list

  • case_sensitive (bool) – filter lemmata case-sensitive. Warning: parameter does the opposite.

  • check_synt_tok (bool) – keep also lemmata where the syntactic token matches word_list

run()
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]

class i6_core.lexicon.conversion.GraphemicLexiconFromWordListJob(*args, **kwargs)
default_transforms = {'+': 'PLUS', '.': 'DOT', '{': 'LBR', '}': 'RBR'}
run()
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]

class i6_core.lexicon.conversion.LexiconFromTextFileJob(*args, **kwargs)

Create a bliss lexicon from a regular text file, where each line contains: <WORD> <PHONEME1> <PHONEME2> … separated by tabs or spaces. The lemmata will be added in the order they appear in the text file, the phonemes will be sorted alphabetically. Phoneme variants of the same word need to appear next to each other.

WARNING: No special lemmas or phonemes are added, so do not use this lexicon with RASR directly!

As the splitting is taken from RASR and not fully tested, it might not work in all cases so do not use this job without checking the output manually on new lexica.

Parameters:
  • text_file (Path) –

  • compressed – save as .xml.gz

run()
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]

class i6_core.lexicon.conversion.LexiconToWordListJob(*args, **kwargs)
run()
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]

class i6_core.lexicon.conversion.LexiconUniqueOrthJob(*args, **kwargs)

Merge lemmata with the same orthography.

Parameters:
  • bliss_lexicon (tk.Path) – lexicon file to be handeled

  • merge_multi_orths_lemmata (bool) –

    if True, also lemmata containing multiple orths are merged based on their primary orth. Otherwise they are ignored.

    Merging strategy - orth/phon/eval

    all orth/phon/eval elements are merged together

    • synt
      synt element is only copied to target lemma when
      1. the target lemma does not already have one

      2. and the rest to-be-merged-lemmata have any synt element.

      ** having a synt <=> synt is not None

      this could lead to INFORMATION LOSS if there are several different synt token sequences in the to-be-merged lemmata

run()
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]

class i6_core.lexicon.conversion.SpellingConversionJob(*args, **kwargs)

Spelling conversion for lexicon.

Parameters:
  • bliss_lexicon (Path) – input lexicon, whose lemmata all have unique PRIMARY orth to reach the above requirements apply LexiconUniqueOrthJob

  • orth_mapping_file (str) –

    orthography mapping file: *.json *.json.gz *.txt *.gz in case of plain text file

    one can adjust mapping_delimiter a line starting with “#” is a comment line

  • mapping_file_delimiter (str) – delimiter of source and target orths in the mapping file relevant only if mapping is provided with a plain text file

:param Optional[List[Tuple[str, str, str]]] mapping_rules
a list of mapping rules, each rule is represented by 3 strings

(source orth-substring, target orth-substring, pos) where pos should be one of [“leading”, “trailing”, “any”]

e.g. the rule (“zation”, “sation”, “trailing”) will convert orth ending with -zation to orth ending with -sation set this ONLY when it’s clearly defined rules which can not generate any kind of ambiguities

Parameters:

invert_mapping (bool) – invert the input orth mapping NOTE: this also affects the pairs which are inferred from mapping_rules

run()
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]