`i6_core.datasets.tf_datasets`¶

This module adds jobs for TF datasets, as documented here: https://www.tensorflow.org/datasets

class i6_core.datasets.tf_datasets.DownloadAndPrepareTfDatasetJob(*args, **kwargs)¶

This job downloads and prepares a TF dataset. The processed files are stored in a data_dir folder, from where it can be loaded again (see https://www.tensorflow.org/datasets/overview#load_a_dataset)

Install the dependencies:

pip install tensorflow-datasets

It further needs some extra dependencies, for example for ‘librispeech’:

pip install apache_beam pip install pydub # ffmpeg installed

Also maybe:

pip install datasets  # for Huggingface community datasets

Parameters:

dataset_name – Name of the dataset in the official TF catalog or community catalog. Available datasets can be found here: https://www.tensorflow.org/datasets/overview https://www.tensorflow.org/datasets/catalog/overview https://www.tensorflow.org/datasets/community_catalog/huggingface
max_simultaneous_downloads – simultaneous downloads for tfds.load, some datasets might not work with the internal defaults, so use e.g. 1 in the case of librispeech. (https://github.com/tensorflow/datasets/issues/3885)
max_workers – max workers for download extractor and Apache Beam, the default (cpu core count) might cause high memory load, so reduce this to a number smaller than the number of cores. (https://github.com/tensorflow/datasets/issues/3887)

classmethod hash(kwargs)¶

tasks()¶

i6_core