i6_core.datasets.tf_datasets
¶
This module adds jobs for TF datasets, as documented here: https://www.tensorflow.org/datasets
- class i6_core.datasets.tf_datasets.DownloadAndPrepareTfDatasetJob(*args, **kwargs)¶
This job downloads and prepares a TF dataset. The processed files are stored in a data_dir folder, from where it can be loaded again (see https://www.tensorflow.org/datasets/overview#load_a_dataset)
Install the dependencies:
pip install tensorflow-datasets
It further needs some extra dependencies, for example for ‘librispeech’:
pip install apache_beam pip install pydub # ffmpeg installed
See here for some more: https://github.com/tensorflow/datasets/blob/master/setup.py
Also maybe:
pip install datasets # for Huggingface community datasets
- Parameters:
dataset_name – Name of the dataset in the official TF catalog or community catalog. Available datasets can be found here: https://www.tensorflow.org/datasets/overview https://www.tensorflow.org/datasets/catalog/overview https://www.tensorflow.org/datasets/community_catalog/huggingface
max_simultaneous_downloads – simultaneous downloads for tfds.load, some datasets might not work with the internal defaults, so use e.g. 1 in the case of librispeech. (https://github.com/tensorflow/datasets/issues/3885)
max_workers – max workers for download extractor and Apache Beam, the default (cpu core count) might cause high memory load, so reduce this to a number smaller than the number of cores. (https://github.com/tensorflow/datasets/issues/3887)
- classmethod hash(kwargs)¶
- Parameters:
parsed_args (dict[str]) –
- Returns:
hash for job given the arguments
- Return type:
str
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]