i6_core.returnn.training
¶
- class i6_core.returnn.training.AverageTFCheckpointsJob(*args, **kwargs)¶
Compute the average of multiple specified Tensorflow checkpoints using the tf_avg_checkpoints script from Returnn
- Parameters:
model_dir – model dir from ReturnnTrainingJob
epochs – manually specified epochs or out_epoch from GetBestEpochJob
returnn_python_exe – file path to the executable for running returnn (python binary or .sh)
returnn_root – file path to the RETURNN repository root folder
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]
- class i6_core.returnn.training.Checkpoint(index_path)¶
Checkpoint object which holds the (Tensorflow) index file path as tk.Path, and will return the checkpoint path as common prefix of the .index/.meta/.data[…]
A checkpoint object should directly assigned to a RasrConfig entry (do not call .ckpt_path) so that the hash will resolve correctly
- Parameters:
index_path (Path) –
- property ckpt_path¶
- exists()¶
- class i6_core.returnn.training.GetBestEpochJob(*args, **kwargs)¶
Provided a RETURNN model directory and a score key, finds the best epoch. The sorting is lower=better, so to access the model with the highest values use negative index values (e.g. -1 for the model with the highest score, error or “loss”)
- Parameters:
model_dir – model_dir output from a RETURNNTrainingJob
learning_rates – learning_rates output from a RETURNNTrainingJob
key – a key from the learning rate file that is used to sort the models, e.g. “dev_score_output/output_prob”
index – index of the sorted list to access, 0 for the lowest, -1 for the highest score/error/loss
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]
- class i6_core.returnn.training.GetBestPtCheckpointJob(*args, **kwargs)¶
Analog to GetBestTFCheckpointJob, just for torch checkpoints.
- Parameters:
model_dir (Path) – model_dir output from a ReturnnTrainingJob
learning_rates (Path) – learning_rates output from a ReturnnTrainingJob
key (str) – a key from the learning rate file that is used to sort the models e.g. “dev_score_output/output_prob”
index (int) – index of the sorted list to access, 0 for the lowest, -1 for the highest score
- run()¶
- class i6_core.returnn.training.GetBestTFCheckpointJob(*args, **kwargs)¶
Returns the best checkpoint given a training model dir and a learning-rates file The best checkpoint will be HARD-linked if possible, so that no space is wasted but also the model not deleted in case that the training folder is removed.
- Parameters:
model_dir (Path) – model_dir output from a RETURNNTrainingJob
learning_rates (Path) – learning_rates output from a RETURNNTrainingJob
key (str) – a key from the learning rate file that is used to sort the models e.g. “dev_score_output/output_prob”
index (int) – index of the sorted list to access, 0 for the lowest, -1 for the highest score
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]
- class i6_core.returnn.training.PtCheckpoint(path: Path)¶
Checkpoint object pointing to a PyTorch checkpoint .pt file
- Parameters:
path – .pt file
- exists()¶
- class i6_core.returnn.training.ReturnnModel(returnn_config_file, model, epoch)¶
Defines a RETURNN model as config, checkpoint meta file and epoch
This is deprecated, use
Checkpoint
instead.- Parameters:
returnn_config_file (Path) – Path to a returnn config file
model (Path) – Path to a RETURNN checkpoint (only the .meta for Tensorflow)
epoch (int) –
- class i6_core.returnn.training.ReturnnTrainingFromFileJob(*args, **kwargs)¶
The Job allows to directly execute returnn config files. The config files have to have the line ext_model = config.value(“ext_model”, None) and model = ext_model to correctly set the model path
If the learning rate file should be available, add ext_learning_rate_file = config.value(“ext_learning_rate_file”, None) and learning_rate_file = ext_learning_rate_file
Other externally controllable parameters may also defined in the same way, and can be set by providing the parameter value in the parameter_dict. The “ext_” prefix is used for naming convention only, but should be used for all external parameters to clearly mark them instead of simply overwriting any normal parameter.
Also make sure that task=”train” is set.
- Parameters:
returnn_config_file (tk.Path|str) – a returnn training config file
parameter_dict (dict) – provide external parameters to the rnn.py call
time_rqmt (int|str) –
mem_rqmt (int|str) –
returnn_python_exe (Optional[Path]) – file path to the executable for running returnn (python binary or .sh)
returnn_root (Optional[Path]) – file path to the RETURNN repository root folder
- create_files()¶
- get_parameter_list()¶
- classmethod hash(kwargs)¶
- Parameters:
parsed_args (dict[str]) –
- Returns:
hash for job given the arguments
- Return type:
str
- path_available(path)¶
Returns True if given path is available yet
- Parameters:
path – path to check
- Returns:
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]
- class i6_core.returnn.training.ReturnnTrainingJob(*args, **kwargs)¶
Train a RETURNN model using the rnn.py entry point.
Only returnn_config, returnn_python_exe and returnn_root influence the hash.
The outputs provided are:
out_returnn_config_file: the finalized Returnn config which is used for the rnn.py call
out_learning_rates: the file containing the learning rates and training scores (e.g. use to select the best checkpoint or generate plots)
- out_model_dir: the model directory, which can be used in succeeding jobs to select certain models or do combinations
note that the model dir is DIRECTLY AVAILABLE when the job starts running, so jobs that do not have other conditions need to implement an “update” method to check if the required checkpoints are already existing
- out_checkpoints: a dictionary containing all created checkpoints. Note that when using the automatic checkpoint cleaning
function of Returnn not all checkpoints are actually available.
- Parameters:
returnn_config –
log_verbosity – RETURNN log verbosity from 1 (least verbose) to 5 (most verbose)
device – “cpu” or “gpu”
num_epochs – number of epochs to run, will also set num_epochs in the config file. Note that this value is NOT HASHED, so that this number can be increased to continue the training.
save_interval – save a checkpoint each n-th epoch
keep_epochs – specify which checkpoints are kept, use None for the RETURNN default This will also limit the available output checkpoints to those defined. If you want to specify the keep behavior without this limitation, provide cleanup_old_models/keep in the post-config and use None here.
time_rqmt –
mem_rqmt –
cpu_rqmt –
horovod_num_processes – If used without multi_node_slots, then single node, otherwise multi node.
multi_node_slots – multi-node multi-GPU training. See Sisyphus rqmt documentation. Currently only with Horovod, and horovod_num_processes should be set as well, usually to the same value. See https://returnn.readthedocs.io/en/latest/advanced/multi_gpu.html.
returnn_python_exe – file path to the executable for running returnn (python binary or .sh)
returnn_root – file path to the RETURNN repository root folder
- check_blacklisted_parameters(returnn_config)¶
Check for parameters that should not be set in the config directly
- Parameters:
returnn_config (ReturnnConfig) –
- Returns:
- create_files()¶
- classmethod create_returnn_config(returnn_config, log_verbosity, device, num_epochs, save_interval, keep_epochs, horovod_num_processes, **kwargs)¶
- classmethod hash(kwargs)¶
- Parameters:
parsed_args (dict[str]) –
- Returns:
hash for job given the arguments
- Return type:
str
- info()¶
Returns information about the currently running job to be displayed on the web interface and the manager view :return: string to be displayed or None if not available :rtype: str
- path_available(path)¶
Returns True if given path is available yet
- Parameters:
path – path to check
- Returns:
- plot()¶
- run()¶
- tasks()¶
- Returns:
yields Task’s
- Return type:
list[sisyphus.task.Task]