Advanced Optimizers

Advanced Optimizers: Getting the Most out of Your Models

Table of Contents

Why Use Advanced Optimizers?
AdamW
RAdam
NAdam
Lookahead
Practical Tips & Benchmarks
Hands‑on Exercises

Why Use Advanced Optimizers?

Standard stochastic gradient descent (SGD) with momentum works well for many problems, but recent research shows that optimizers like AdamW, RAdam, and Lookahead often converge faster and achieve better generalization on deep networks.

Key benefits include:

Adaptive learning rates per parameter
Weight‑decay decoupling (AdamW)
Rectified variance for stable early training (RAdam)
Combination of fast and slow weights (Lookahead)

AdamW – Decoupled Weight Decay

AdamW separates L2 regularization from the gradient update, leading to better convergence for transformer‑style architectures.

import torch
optimizer = torch.optim.AdamW(model.parameters(),
                              lr=3e-4,
                              weight_decay=0.01)

RAdam – Rectified Adam

RAdam automatically rectifies the variance of the adaptive learning rate, making it robust during the early epochs where Adam may be unstable.

from torch_optimizer import RAdam
optimizer = RAdam(model.parameters(),
                  lr=1e-3,
                  weight_decay=5e-4)

NAdam – Nesterov‑accelerated Adam

NAdam merges Nesterov momentum with Adam’s adaptive updates, yielding smoother trajectories.

optimizer = torch.optim.NAdam(model.parameters(),
                               lr=2e-4,
                               weight_decay=0)

Lookahead – Slow & Fast Weights

Lookahead maintains two sets of weights: a fast learner (e.g., Adam) and a slow learner that periodically “looks ahead”.

from torch_optimizer import Lookahead, AdamW
base_opt = AdamW(model.parameters(), lr=1e-3)
optimizer = Lookahead(base_opt, k=5, alpha=0.5)

Practical Tips & Benchmarks

Below is a quick benchmark on CIFAR‑10 using a ResNet‑18 backbone.

Optimizer	Final Test Accuracy	Epochs to 90%
SGD + Momentum	92.1%	78
AdamW	93.4%	62
RAdam	93.1%	65
Lookahead (AdamW)	93.6%	58

Hands‑on Exercises

Replace the optimizer in train.py with RAdam and observe the training curve.
Combine Lookahead with AdamW and tune k and alpha.
Implement a custom learning‑rate schedule that warms up for 5 epochs before decaying.