i6_core.corpus.convert
¶
- class i6_core.corpus.convert.CorpusReplaceOrthFromReferenceCorpus(*args, **kwargs)¶
Copies the orth tag from one corpus to another through matching segment names.
- Parameters:
bliss_corpus – Corpus in which the orth tag is to be replaced
reference_bliss_corpus – Corpus from which the orth tag replacement is taken
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]
- class i6_core.corpus.convert.CorpusReplaceOrthFromTxtJob(*args, **kwargs)¶
Merge raw text back into a bliss corpus
- Parameters:
bliss_corpus (Path) – Bliss corpus
text_file (Path) – a raw or gzipped text file
segment_file (Path|None) – only replace the segments as specified in the segment file
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]
- class i6_core.corpus.convert.CorpusToStmJob(*args, **kwargs)¶
Convert a Bliss corpus into a .stm file
- Parameters:
bliss_corpus – Path to Bliss corpus
exclude_non_speech – non speech tokens should be removed
non_speech_tokens – defines the list of non speech tokens
remove_punctuation – should punctuation be removed
punctuation_tokens – defines list/string of punctuation tokens
fix_whitespace – should white space be fixed. !!!be aware that the corpus loading already fixes white space!!!
name – new corpus name
tag_mapping – 3-string tuple contains (“short name”, “long name”, “description”) of each tag. and the Dict[int, tk.Path] is e.g. the out_single_segment_files of a FilterSegments*Jobs
- classmethod replace_recursive(orthography, token)¶
recursion is required to find repeated tokens string.replace is not sufficient some other solution might also work
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]
- class i6_core.corpus.convert.CorpusToTextDictJob(*args, **kwargs)¶
Extract the Text from a Bliss corpus to fit a “{key: text}” structure (e.g. for RETURNN)
- Parameters:
bliss_corpus (Path) – bliss corpus file
segment_file (Path|None) – a segment file as optional whitelist
invert_match (bool) – use segment file as blacklist (needs to contain full segment names then)
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]
- class i6_core.corpus.convert.CorpusToTxtJob(*args, **kwargs)¶
Extract orth from a Bliss corpus and store as raw txt or gzipped txt
- Parameters:
bliss_corpus (Path) – Bliss corpus
segment_file (Path) – segment file
gzip (bool) – gzip the output text file
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]