`i6_core.corpus.convert`¶

class i6_core.corpus.convert.CorpusReplaceOrthFromReferenceCorpus(*args, **kwargs)¶

Copies the orth tag from one corpus to another through matching segment names.

Parameters:

bliss_corpus – Corpus in which the orth tag is to be replaced
reference_bliss_corpus – Corpus from which the orth tag replacement is taken

tasks()¶

class i6_core.corpus.convert.CorpusReplaceOrthFromTxtJob(*args, **kwargs)¶

Merge raw text back into a bliss corpus

Parameters:

bliss_corpus (Path) – Bliss corpus
text_file (Path) – a raw or gzipped text file
segment_file (Path|None) – only replace the segments as specified in the segment file

tasks()¶

class i6_core.corpus.convert.CorpusToStmJob(*args, **kwargs)¶

Convert a Bliss corpus into a .stm file

Parameters:

bliss_corpus – Path to Bliss corpus
exclude_non_speech – non speech tokens should be removed
non_speech_tokens – defines the list of non speech tokens
remove_punctuation – should punctuation be removed
punctuation_tokens – defines list/string of punctuation tokens
fix_whitespace – should white space be fixed. !!!be aware that the corpus loading already fixes white space!!!
name – new corpus name
tag_mapping – 3-string tuple contains (“short name”, “long name”, “description”) of each tag. and the Dict[int, tk.Path] is e.g. the out_single_segment_files of a FilterSegments*Jobs

classmethod replace_recursive(orthography, token)¶: recursion is required to find repeated tokens string.replace is not sufficient some other solution might also work

tasks()¶

class i6_core.corpus.convert.CorpusToTextDictJob(*args, **kwargs)¶

Extract the Text from a Bliss corpus to fit a “{key: text}” structure (e.g. for RETURNN)

Parameters:

bliss_corpus (Path) – bliss corpus file
segment_file (Path|None) – a segment file as optional whitelist
invert_match (bool) – use segment file as blacklist (needs to contain full segment names then)

tasks()¶

class i6_core.corpus.convert.CorpusToTxtJob(*args, **kwargs)¶

Extract orth from a Bliss corpus and store as raw txt or gzipped txt

Parameters:

tasks()¶

i6_core