i6_core.datasets.switchboard
¶
Switchboard is conversational telephony speech with 8 Khz audio files. The training data consists of 300h hours. Reference: https://catalog.ldc.upenn.edu/LDC97S62
number of recordings: 4876 number of segments: 249624 number of speakers: 2260
- class i6_core.datasets.switchboard.CreateFisherTranscriptionsJob(*args, **kwargs)¶
Create the compressed text data based on the fisher transcriptions which can be used for LM training
Part 1: https://catalog.ldc.upenn.edu/LDC2004T19 Part 2: https://catalog.ldc.upenn.edu/LDC2005T19
- Parameters:
fisher_transcriptions1_folder – path to unpacked LDC2004T19.tgz, usually named fe_03_p1_tran
fisher_transcriptions2_folder – path to unpacked LDC2005T19.tgz, usually named fe_03_p2_tran
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]
- class i6_core.datasets.switchboard.CreateHub5e00CorpusJob(*args, **kwargs)¶
Creates the switchboard hub5e_00 corpus based on LDC2002S09 No speaker information attached
- Parameters:
wav_audio_folder – output of SwitchboardSphereToWave called on extracted LDC2002S09.tgz
hub5_transcriptions – extracted LDC2002T43.tgz named “2000_hub5_eng_eval_tr”
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]
- class i6_core.datasets.switchboard.CreateHub5e01CorpusJob(*args, **kwargs)¶
Creates the switchboard hub5e_01 corpus based on LDC2002S13
This corpus provides no glm, as the same as for Hub5e00 should be used
No speaker information attached
- Parameters:
wav_audio_folder – output of SwitchboardSphereToWave called on extracted LDC2002S13.tgz
hub5e01_folder – extracted LDC2002S13 named “hub5e_01”
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]
- class i6_core.datasets.switchboard.CreateLDCSwitchboardSpeakerListJob(*args, **kwargs)¶
This creates the speaker list according to the conversation and speaker table from the LDC documentation: https://catalog.ldc.upenn.edu/docs/LDC97S62
- The resulting file contains 520 speakers in the format of:
<speaker_id> <gender> <recording>
- Parameters:
caller_tab_file – caller_tab.csv from the Switchboard LDC documentation
conv_tab_file – conv_tab.csv from the Switchboard LDC documentation
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]
- class i6_core.datasets.switchboard.CreateRT03sCTSCorpusJob(*args, **kwargs)¶
Create the RT03 test set corpus, specifically the “CTS” subset of LDC2007S10
No speaker information attached
- Parameters:
wav_audio_folder – output of SwitchboardSphereToWave called on extracted LDC2007S10.tgz
rt03_folder – extracted LDC2007S10.tgz
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]
- class i6_core.datasets.switchboard.CreateSwitchboardBlissCorpusJob(*args, **kwargs)¶
Creates Switchboard bliss corpus xml
segment name format: sw2001B-ms98-a-<folder-name>
- Parameters:
audio_dir (tk.Path) – path for audio data
trans_dir (tk.Path) – path for transcription data. see DownloadSwitchboardTranscriptionAndDictJob
speakers_list_file (tk.Path) –
- path to a speakers list text file with format:
speaker_id gender recording<channel>, e.g. 1005 F 2452A
on each line. see CreateSwitchboardSpeakersListJob job
skip_empty_ldc_file (bool) – In the original corpus the sequence 2167B is mostly empty, thus exclude it from training (recommended, GMM will fail otherwise)
lowercase (bool) – lowercase the transcriptions of the corpus (recommended)
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]
- class i6_core.datasets.switchboard.CreateSwitchboardLexiconTextFileJob(*args, **kwargs)¶
This job creates SWB preprocessed dictionary text file consistent with the training corpus given a raw dictionary text file downloaded within the transcription directory using DownloadSwitchboardTranscriptionAndDictJob Job. The resulted dictionary text file will be passed as argument to LexiconFromTextFileJob job in order to create bliss xml lexicon.
- Parameters:
raw_dict_file (tk.Path) – path containing the raw dictionary text file
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]
- class i6_core.datasets.switchboard.CreateSwitchboardSpeakersListJob(*args, **kwargs)¶
- Given some speakers statistics info, this job creates a text file having on each line:
speaker_id gender recording
- Parameters:
speakers_stats_file (tk.Path) – speakers stats text file
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]
- class i6_core.datasets.switchboard.CreateSwitchboardSpokenFormBlissCorpusJob(*args, **kwargs)¶
Creates a special spoken form version of switchboard-1 used for e.g. BPE or Sentencepiece based models. It includes:
make sure everything is lowercased
conversion of numbers to written form (using a given conversion table)
conversion of some short forms into spoken forms (also using the table)
making special tokens uppercase again
- Parameters:
switchboard_bliss_corpus – out_corpus of CreateSwitchboardBlissCorpusJob
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]
- class i6_core.datasets.switchboard.DownloadSwitchboardSpeakersStatsJob(*args, **kwargs)¶
Note that this does not contain the speaker info for all recordings. We assume later that each recording has a unique speaker and a unique id is used for those recordings with unknown speakers info
- Parameters:
url (str) –
target_filename (str|None) – explicit output filename, if None tries to detect the filename from the url
checksum (str|None) – A sha256 checksum to verify the file
- classmethod hash(parsed_args)¶
- Parameters:
parsed_args (dict[str]) –
- Returns:
hash for job given the arguments
- Return type:
str
- class i6_core.datasets.switchboard.DownloadSwitchboardTranscriptionAndDictJob(*args, **kwargs)¶
Downloads switchboard training transcriptions and dictionary (or lexicon)
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]
- class i6_core.datasets.switchboard.SwitchboardSphereToWaveJob(*args, **kwargs)¶
Takes an audio folder from one of the switchboard LDC folders and converts dual channel .sph files with mulaw encoding to single channel .wav files with s16le encoding
- Parameters:
sph_audio_folder –
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]