i6_core.corpus.transform

class i6_core.corpus.transform.AddCacheToCorpusJob(*args, **kwargs)

Adds cache manager call to all audio paths in a corpus file :param Path bliss_corpus: bliss corpora file path

run()
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]

class i6_core.corpus.transform.ApplyLexiconToCorpusJob(*args, **kwargs)

Use a bliss lexicon to convert all words in a bliss corpus into their phoneme representation.

Currently only supports picking the first phoneme.

Parameters:
  • bliss_corpus (Path) – path to a bliss corpus xml

  • bliss_lexicon (Path) – path to a bliss lexicon file

  • word_separation_orth (str|None) – a default word separation lemma orth. The corresponding phoneme (or phonemes in some special cases) are inserted between each word. Usually it makes sense to use something like “[SILENCE]” or “[space]” or so).

  • strategy (LexiconStrategy) – strategy to determine which representation is selected

run()
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]

class i6_core.corpus.transform.CompressCorpusJob(*args, **kwargs)

Compresses a corpus by concatenating audio files and using a compression codec. Does currently not support corpora with subcorpora, files need to be .wav :param Path bliss_corpus: path to an xml corpus file with wave recordings :param str format: supported file formats, currently limited to mp3 :param str bitrate: bitrate as string, e.g. ‘32k’ or ‘192k’, can also be an integer e.g. 192000 :param int max_num_splits: maximum number of resulting audio files.

add_duration_to_recordings(c)

open each recording, extract the duration and add the duration to the recording object # TODO: this is a lengthy operation, but so far there was no alternative… :param corpus.Corpus c: :return:

info()

read the log.run file to extract the current status of the compression job :return:

run()
run_ffmpeg(ffmpeg_inputs, output_path)
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]

class i6_core.corpus.transform.MergeCorporaJob(*args, **kwargs)

Merges Bliss Corpora files into a single file as subcorpora or flat

Parameters:
  • bliss_corpora (Iterable[Path]) – any iterable of bliss corpora file paths to merge

  • name (str) – name of the new corpus (subcorpora will keep the original names)

  • merge_strategy (MergeStrategy) – how the corpora should be merged, e.g. as subcorpora or flat

run()
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]

class i6_core.corpus.transform.MergeCorpusSegmentsAndAudioJob(*args, **kwargs)

This job merges segments and audio files based on a rasr cluster map and a list of cluster_names. The cluster map should map segments to something like cluster.XXX where XXX is a natural number (starting with 1). The lines in the cluster_names file will be used as names for the recordings in the new corpus.

The job outputs a new corpus file + the corresponding audio files.

run()
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]

class i6_core.corpus.transform.MergeStrategy(value)

An enumeration.

CONCATENATE = 2
FLAT = 1
SUBCORPORA = 0
class i6_core.corpus.transform.ReplaceTranscriptionFromCtmJob(*args, **kwargs)
run()
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]

class i6_core.corpus.transform.ShiftCorpusSegmentStartJob(*args, **kwargs)

Shifts the start time of a corpus to change the fft window offset

Parameters:
  • bliss_corpus (Path) – path to a bliss corpus file

  • corpus_name (str) – name of the new corpus

  • shift (int) – shift in seconds

run()
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]