i6_core.datasets.huggingface

https://huggingface.co/docs/datasets/

class i6_core.datasets.huggingface.DownloadAndPrepareHuggingFaceDatasetJob(*args, **kwargs)

https://huggingface.co/docs/datasets/ https://huggingface.co/datasets

pip install datasets

Basically wraps datasets.load_dataset(...).save_to_disk(out_dir).

Example for Librispeech:

DownloadAndPrepareHuggingFaceDatasetJob(“librispeech_asr”, “clean”) https://github.com/huggingface/datasets/issues/4179

Parameters:
  • path – Path or name of the dataset, parameter passed to Dataset.load_dataset

  • name – Name of the dataset configuration, parameter passed to Dataset.load_dataset

  • data_files – Path(s) to the source data file(s), parameter passed to Dataset.load_dataset

  • revision – Version of the dataset script, parameter passed to Dataset.load_dataset

  • time_rqmt (float) –

  • mem_rqmt (float) –

  • cpu_rqmt (int) –

  • mini_task (bool) – the job should be run as mini_task

classmethod hash(kwargs)
Parameters:

parsed_args (dict[str]) –

Returns:

hash for job given the arguments

Return type:

str

run()
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]