i6_core.returnn.training

class i6_core.returnn.training.AverageTFCheckpointsJob(*args, **kwargs)

Compute the average of multiple specified Tensorflow checkpoints using the tf_avg_checkpoints script from Returnn

Parameters:
  • model_dir – model dir from ReturnnTrainingJob

  • epochs – manually specified epochs or out_epoch from GetBestEpochJob

  • returnn_python_exe – file path to the executable for running returnn (python binary or .sh)

  • returnn_root – file path to the RETURNN repository root folder

run()
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]

class i6_core.returnn.training.Checkpoint(index_path)

Checkpoint object which holds the (Tensorflow) index file path as tk.Path, and will return the checkpoint path as common prefix of the .index/.meta/.data[…]

A checkpoint object should directly assigned to a RasrConfig entry (do not call .ckpt_path) so that the hash will resolve correctly

Parameters:

index_path (Path) –

property ckpt_path
exists()
class i6_core.returnn.training.GetBestEpochJob(*args, **kwargs)

Provided a RETURNN model directory and a score key, finds the best epoch. The sorting is lower=better, so to access the model with the highest values use negative index values (e.g. -1 for the model with the highest score, error or “loss”)

Parameters:
  • model_dir – model_dir output from a RETURNNTrainingJob

  • learning_rates – learning_rates output from a RETURNNTrainingJob

  • key – a key from the learning rate file that is used to sort the models, e.g. “dev_score_output/output_prob”

  • index – index of the sorted list to access, 0 for the lowest, -1 for the highest score/error/loss

run()
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]

class i6_core.returnn.training.GetBestPtCheckpointJob(*args, **kwargs)

Analog to GetBestTFCheckpointJob, just for torch checkpoints.

Parameters:
  • model_dir (Path) – model_dir output from a ReturnnTrainingJob

  • learning_rates (Path) – learning_rates output from a ReturnnTrainingJob

  • key (str) – a key from the learning rate file that is used to sort the models e.g. “dev_score_output/output_prob”

  • index (int) – index of the sorted list to access, 0 for the lowest, -1 for the highest score

run()
class i6_core.returnn.training.GetBestTFCheckpointJob(*args, **kwargs)

Returns the best checkpoint given a training model dir and a learning-rates file The best checkpoint will be HARD-linked if possible, so that no space is wasted but also the model not deleted in case that the training folder is removed.

Parameters:
  • model_dir (Path) – model_dir output from a RETURNNTrainingJob

  • learning_rates (Path) – learning_rates output from a RETURNNTrainingJob

  • key (str) – a key from the learning rate file that is used to sort the models e.g. “dev_score_output/output_prob”

  • index (int) – index of the sorted list to access, 0 for the lowest, -1 for the highest score

run()
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]

class i6_core.returnn.training.PtCheckpoint(path: Path)

Checkpoint object pointing to a PyTorch checkpoint .pt file

Parameters:

path – .pt file

exists()
class i6_core.returnn.training.ReturnnModel(returnn_config_file, model, epoch)

Defines a RETURNN model as config, checkpoint meta file and epoch

This is deprecated, use Checkpoint instead.

Parameters:
  • returnn_config_file (Path) – Path to a returnn config file

  • model (Path) – Path to a RETURNN checkpoint (only the .meta for Tensorflow)

  • epoch (int) –

class i6_core.returnn.training.ReturnnTrainingFromFileJob(*args, **kwargs)

The Job allows to directly execute returnn config files. The config files have to have the line ext_model = config.value(“ext_model”, None) and model = ext_model to correctly set the model path

If the learning rate file should be available, add ext_learning_rate_file = config.value(“ext_learning_rate_file”, None) and learning_rate_file = ext_learning_rate_file

Other externally controllable parameters may also defined in the same way, and can be set by providing the parameter value in the parameter_dict. The “ext_” prefix is used for naming convention only, but should be used for all external parameters to clearly mark them instead of simply overwriting any normal parameter.

Also make sure that task=”train” is set.

Parameters:
  • returnn_config_file (tk.Path|str) – a returnn training config file

  • parameter_dict (dict) – provide external parameters to the rnn.py call

  • time_rqmt (int|str) –

  • mem_rqmt (int|str) –

  • returnn_python_exe (Optional[Path]) – file path to the executable for running returnn (python binary or .sh)

  • returnn_root (Optional[Path]) – file path to the RETURNN repository root folder

create_files()
get_parameter_list()
classmethod hash(kwargs)
Parameters:

parsed_args (dict[str]) –

Returns:

hash for job given the arguments

Return type:

str

path_available(path)

Returns True if given path is available yet

Parameters:

path – path to check

Returns:

run()
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]

class i6_core.returnn.training.ReturnnTrainingJob(*args, **kwargs)

Train a RETURNN model using the rnn.py entry point.

Only returnn_config, returnn_python_exe and returnn_root influence the hash.

The outputs provided are:

  • out_returnn_config_file: the finalized Returnn config which is used for the rnn.py call

  • out_learning_rates: the file containing the learning rates and training scores (e.g. use to select the best checkpoint or generate plots)

  • out_model_dir: the model directory, which can be used in succeeding jobs to select certain models or do combinations

    note that the model dir is DIRECTLY AVAILABLE when the job starts running, so jobs that do not have other conditions need to implement an “update” method to check if the required checkpoints are already existing

  • out_checkpoints: a dictionary containing all created checkpoints. Note that when using the automatic checkpoint cleaning

    function of Returnn not all checkpoints are actually available.

Parameters:
  • returnn_config

  • log_verbosity – RETURNN log verbosity from 1 (least verbose) to 5 (most verbose)

  • device – “cpu” or “gpu”

  • num_epochs – number of epochs to run, will also set num_epochs in the config file. Note that this value is NOT HASHED, so that this number can be increased to continue the training.

  • save_interval – save a checkpoint each n-th epoch

  • keep_epochs – specify which checkpoints are kept, use None for the RETURNN default This will also limit the available output checkpoints to those defined. If you want to specify the keep behavior without this limitation, provide cleanup_old_models/keep in the post-config and use None here.

  • time_rqmt

  • mem_rqmt

  • cpu_rqmt

  • horovod_num_processes – If used without multi_node_slots, then single node, otherwise multi node.

  • multi_node_slots – multi-node multi-GPU training. See Sisyphus rqmt documentation. Currently only with Horovod, and horovod_num_processes should be set as well, usually to the same value. See https://returnn.readthedocs.io/en/latest/advanced/multi_gpu.html.

  • returnn_python_exe – file path to the executable for running returnn (python binary or .sh)

  • returnn_root – file path to the RETURNN repository root folder

check_blacklisted_parameters(returnn_config)

Check for parameters that should not be set in the config directly

Parameters:

returnn_config (ReturnnConfig) –

Returns:

create_files()
classmethod create_returnn_config(returnn_config, log_verbosity, device, num_epochs, save_interval, keep_epochs, horovod_num_processes, **kwargs)
classmethod hash(kwargs)
Parameters:

parsed_args (dict[str]) –

Returns:

hash for job given the arguments

Return type:

str

info()

Returns information about the currently running job to be displayed on the web interface and the manager view :return: string to be displayed or None if not available :rtype: str

path_available(path)

Returns True if given path is available yet

Parameters:

path – path to check

Returns:

plot()
run()
tasks()
Returns:

yields Task’s

Return type:

list[sisyphus.task.Task]