Skip to content

Training

This guide covers how to train models using the PyTorch Lightning framework with Hydra configuration management.

Basic Training

Quick Start

To start training with default configuration:

uv run python -m {{cookiecutter.project_slug}}.scripts.train

This will use the default configuration defined in your project's config files.

Training with Experiment Configs

It is recommended to use predefined experiment configurations for reproducible training setups and run the experiment with experiment=new-defined-one.

Experiment configs allow you to overwrite parameters from main config.
For example, you can use them to version control best hyperparameters for each combination of model and dataset.

Show example experiment config
# @package _global_

# to execute this experiment run:
# uv run python -m {{cookiecutter.project_slug}}.scripts.train experiment=example

defaults:
  - override /data: mnist.yaml
  - override /model: mnist.yaml
  - override /callbacks: default.yaml
  - override /trainer: default.yaml

# all parameters below will be merged with parameters from default configurations set above
# this allows you to overwrite only specified parameters

tags: ["mnist", "simple_dense_net"]

seed: 12345

trainer:
  min_epochs: 10
  max_epochs: 10
  gradient_clip_val: 0.5

model:
  optimizer:
    lr: 0.002
  net:
    lin1_size: 128
    lin2_size: 256
    lin3_size: 64

data:
  batch_size: 64

logger:
  wandb:
    tags: ${tags}
    group: "mnist"

Training with Custom Parameters

You can override any configuration parameter directly from the command line:

# Change learning rate and batch size
uv run python -m {{cookiecutter.project_slug}}.scripts.train model.optimizer.lr=0.001 data.batch_size=64

# Train for specific number of epochs
uv run python -m {{cookiecutter.project_slug}}.scripts.train trainer.max_epochs=50

# Use different model architecture
uv run python -m {{cookiecutter.project_slug}}.scripts.train model=mnist

Hardware Configuration

GPU Training

PyTorch Lightning automatically detects and uses available GPUs:

# Single GPU training (automatic)
uv run python -m {{cookiecutter.project_slug}}.scripts.train

# Specify number of GPUs
uv run python -m {{cookiecutter.project_slug}}.scripts.train trainer.devices=2

# Multi-GPU training with specific strategy
uv run python -m {{cookiecutter.project_slug}}.scripts.train trainer.devices=4 trainer.strategy=ddp

CPU Training

Force CPU training:

uv run python -m {{cookiecutter.project_slug}}.scripts.train trainer.accelerator=cpu

Mixed Precision Training

Enable automatic mixed precision for faster training:

uv run python -m {{cookiecutter.project_slug}}.scripts.train trainer.precision=16-mixed

Advanced Training Options

Resume Training

Resume from a checkpoint:

uv run python -m {{cookiecutter.project_slug}}.scripts.train ckpt_path="/path/to/checkpoint.ckpt"

Training with Different Experiments

Use predefined experiment configurations:

# Use specific experiment config
uv run python -m {{cookiecutter.project_slug}}.scripts.train experiment=mnist_experiment

# Override experiment parameters
uv run python -m {{cookiecutter.project_slug}}.scripts.train experiment=mnist_experiment model.optimizer.lr=0.01

Logging and Monitoring

TensorBoard (Default)

TensorBoard logs are automatically saved. View them with:

tensorboard --logdir logs/

Weights & Biases

If configured, enable W&B logging:

uv run python -m {{cookiecutter.project_slug}}.scripts.train logger=wandb

Multiple Loggers

Use multiple loggers simultaneously:

uv run python -m {{cookiecutter.project_slug}}.scripts.train logger=many_loggers

Training Strategies

Single Machine Training

# Standard single GPU
uv run python -m {{cookiecutter.project_slug}}.scripts.train trainer.devices=1

# Single machine, multiple GPUs
uv run python -m {{cookiecutter.project_slug}}.scripts.train trainer=ddp trainer.devices=4

Distributed Training

For multi-node training, 2 nodes, 4 gpu in each node

export MASTER_PORT=1234
export MASTER_ADDR=$MASTER_ADDR
export WORLD_SIZE=$NUM_NODES
export NODE_RANK=$NODE_RANK

# Node 0 (master)
uv run python -m {{cookiecutter.project_slug}}.scripts.train trainer.devices=4 trainer.num_nodes=$NUM_NODES

# Node 1
uv run python -m {{cookiecutter.project_slug}}.scripts.train trainer.devices=4 trainer.num_nodes=$NUM_NODES

Model Checkpointing

Automatic Checkpointing

The framework automatically saves checkpoints based on validation metrics:

# Save top 3 models based on validation accuracy
uv run python -m {{cookiecutter.project_slug}}.scripts.train callbacks.model_checkpoint.save_top_k=3 \
                   callbacks.model_checkpoint.monitor="val/acc"

Manual Checkpointing

Save checkpoints at regular intervals:

uv run python -m {{cookiecutter.project_slug}}.scripts.train callbacks.model_checkpoint.every_n_epochs=10

Training Monitoring

Progress Bars

Customize training progress display:

# change different progress bar
uv run python -m {{cookiecutter.project_slug}}.scripts.train callback.progress_bar=rich

Early Stopping

Enable early stopping to prevent overfitting:

uv run python -m {{cookiecutter.project_slug}}.scripts.train callbacks.early_stopping.monitor="val/loss" \
                   callbacks.early_stopping.patience=10 \
                   callbacks.early_stopping.mode="min"

Common Training Workflows

Development Training

Quick training for development and debugging:

# limit training data
uv run python -m {{cookiecutter.project_slug}}.scripts.train debug=limit

Production Training

Full training with all features enabled:

uv run python -m {{cookiecutter.project_slug}}.scripts.train \
  trainer.max_epochs=100 \
  trainer=ddp \
  trainer.devices=4 \
  trainer.precision=16-mixed \
  callbacks.model_checkpoint.save_top_k=5 \
  callbacks.early_stopping.patience=15

Hyperparameter Tuning

Use Hydra's multirun feature for hyperparameter sweeps:

# Grid search over learning rates
uv run python -m {{cookiecutter.project_slug}}.scripts.train -m model.optimizer.lr=0.001,0.01,0.1

# Random search with Optuna
uv run python -m {{cookiecutter.project_slug}}.scripts.train -m hparams_search=optuna experiment=example

Debugging Training

Debug Mode

Enable debug mode for detailed logging:

uv run python -m {{cookiecutter.project_slug}}.scripts.train debug=default

Profiling

Profile your training code:

uv run python -m {{cookiecutter.project_slug}}.scripts.train debug=profiler trainer.profiler=simple

# Advanced profiling
uv run python -m {{cookiecutter.project_slug}}.scripts.train debug=profiler trainer.profiler=advanced

Detect Anomalies

Enable anomaly detection:

uv run python -m {{cookiecutter.project_slug}}.scripts.train trainer.detect_anomaly=true

Troubleshooting

Out of Memory (OOM)

  • Reduce batch size: data.batch_size=16
  • Use mixed precision: trainer.precision=16-mixed

Slow Training

  • Increase number of workers: data.num_workers=8
  • Use faster data loading: data.pin_memory=true
  • Enable compiled model: model.compile=true

Unstable Training

  • Reduce learning rate: model.optimizer.lr=0.0001
  • Add gradient clipping: trainer.gradient_clip_val=0.5
  • Use learning rate scheduler: model.scheduler.step_size=30