i6_core.lexicon.conversion
¶
- class i6_core.lexicon.conversion.FilterLexiconByWordListJob(*args, **kwargs)¶
Filter lemmata to given word list. Warning: case_sensitive parameter does the opposite. Kept for backwards-compatibility.
- Parameters:
bliss_lexicon (tk.Path) – lexicon file to be handeled
word_list (tk.Path) – filter lexicon by this word list
case_sensitive (bool) – filter lemmata case-sensitive. Warning: parameter does the opposite.
check_synt_tok (bool) – keep also lemmata where the syntactic token matches word_list
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]
- class i6_core.lexicon.conversion.GraphemicLexiconFromWordListJob(*args, **kwargs)¶
- default_transforms = {'+': 'PLUS', '.': 'DOT', '{': 'LBR', '}': 'RBR'}¶
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]
- class i6_core.lexicon.conversion.LexiconFromTextFileJob(*args, **kwargs)¶
Create a bliss lexicon from a regular text file, where each line contains: <WORD> <PHONEME1> <PHONEME2> … separated by tabs or spaces. The lemmata will be added in the order they appear in the text file, the phonemes will be sorted alphabetically. Phoneme variants of the same word need to appear next to each other.
WARNING: No special lemmas or phonemes are added, so do not use this lexicon with RASR directly!
As the splitting is taken from RASR and not fully tested, it might not work in all cases so do not use this job without checking the output manually on new lexica.
- Parameters:
text_file (Path) –
compressed – save as .xml.gz
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]
- class i6_core.lexicon.conversion.LexiconToWordListJob(*args, **kwargs)¶
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]
- class i6_core.lexicon.conversion.LexiconUniqueOrthJob(*args, **kwargs)¶
Merge lemmata with the same orthography.
- Parameters:
bliss_lexicon (tk.Path) – lexicon file to be handeled
merge_multi_orths_lemmata (bool) –
if True, also lemmata containing multiple orths are merged based on their primary orth. Otherwise they are ignored.
Merging strategy - orth/phon/eval
all orth/phon/eval elements are merged together
- synt
- synt element is only copied to target lemma when
the target lemma does not already have one
and the rest to-be-merged-lemmata have any synt element.
** having a synt <=> synt is not None
this could lead to INFORMATION LOSS if there are several different synt token sequences in the to-be-merged lemmata
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]
- class i6_core.lexicon.conversion.SpellingConversionJob(*args, **kwargs)¶
Spelling conversion for lexicon.
- Parameters:
bliss_lexicon (Path) – input lexicon, whose lemmata all have unique PRIMARY orth to reach the above requirements apply LexiconUniqueOrthJob
orth_mapping_file (str) –
orthography mapping file: *.json *.json.gz *.txt *.gz in case of plain text file
one can adjust mapping_delimiter a line starting with “#” is a comment line
mapping_file_delimiter (str) – delimiter of source and target orths in the mapping file relevant only if mapping is provided with a plain text file
- :param Optional[List[Tuple[str, str, str]]] mapping_rules
- a list of mapping rules, each rule is represented by 3 strings
(source orth-substring, target orth-substring, pos) where pos should be one of [“leading”, “trailing”, “any”]
e.g. the rule (“zation”, “sation”, “trailing”) will convert orth ending with -zation to orth ending with -sation set this ONLY when it’s clearly defined rules which can not generate any kind of ambiguities
- Parameters:
invert_mapping (bool) – invert the input orth mapping NOTE: this also affects the pairs which are inferred from mapping_rules
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]