i6_core.corpus.convert

class i6_core.corpus.convert.CorpusReplaceOrthFromReferenceCorpus(*args, **kwargs)

Copies the orth tag from one corpus to another through matching segment names.

Parameters:
  • bliss_corpus – Corpus in which the orth tag is to be replaced

  • reference_bliss_corpus – Corpus from which the orth tag replacement is taken

run()
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]

class i6_core.corpus.convert.CorpusReplaceOrthFromTxtJob(*args, **kwargs)

Merge raw text back into a bliss corpus

Parameters:
  • bliss_corpus (Path) – Bliss corpus

  • text_file (Path) – a raw or gzipped text file

  • segment_file (Path|None) – only replace the segments as specified in the segment file

run()
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]

class i6_core.corpus.convert.CorpusToStmJob(*args, **kwargs)

Convert a Bliss corpus into a .stm file

Parameters:
  • bliss_corpus – Path to Bliss corpus

  • exclude_non_speech – non speech tokens should be removed

  • non_speech_tokens – defines the list of non speech tokens

  • remove_punctuation – should punctuation be removed

  • punctuation_tokens – defines list/string of punctuation tokens

  • fix_whitespace – should white space be fixed. !!!be aware that the corpus loading already fixes white space!!!

  • name – new corpus name

  • tag_mapping – 3-string tuple contains (“short name”, “long name”, “description”) of each tag. and the Dict[int, tk.Path] is e.g. the out_single_segment_files of a FilterSegments*Jobs

classmethod replace_recursive(orthography, token)

recursion is required to find repeated tokens string.replace is not sufficient some other solution might also work

run()
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]

class i6_core.corpus.convert.CorpusToTextDictJob(*args, **kwargs)

Extract the Text from a Bliss corpus to fit a “{key: text}” structure (e.g. for RETURNN)

Parameters:
  • bliss_corpus (Path) – bliss corpus file

  • segment_file (Path|None) – a segment file as optional whitelist

  • invert_match (bool) – use segment file as blacklist (needs to contain full segment names then)

run()
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]

class i6_core.corpus.convert.CorpusToTxtJob(*args, **kwargs)

Extract orth from a Bliss corpus and store as raw txt or gzipped txt

Parameters:
  • bliss_corpus (Path) – Bliss corpus

  • segment_file (Path) – segment file

  • gzip (bool) – gzip the output text file

run()
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]