Optimizer Best | Yogi
: Unlike Adam, which uses a multiplicative update that can lead to rapid changes in the learning rate, Yogi uses an additive update based on the sign of the difference between the current squared gradient and the previous second-moment estimate.
Yogi modifies how the "second moment" (the moving average of squared gradients) is updated. In Adam, this update is multiplicative, which can cause the denominator to grow too quickly and "forget" past gradients in sparse settings. Yogi changes this to an update using the sign of the difference between the current squared gradient and the previous estimate. 🚀 Key Improvements over Adam
: Stays effective with very little hyperparameter tuning compared to other adaptive methods. tff.learning.optimizers.build_yogi | TensorFlow Federated yogi optimizer
This is where the modifies the equation.
optimizer = optax.yogi( learning_rate=0.01, b1=0.9, b2=0.999, eps=1e-3 ) : Unlike Adam, which uses a multiplicative update
PyTorch does not include Yogi in its core library, but it is available via torch_optimizer or can be implemented in a few lines.
While is highly effective for many deep learning tasks, it can struggle with convergence issues in certain convex and nonconvex landscapes. Specifically, Adam's second-moment estimate—which tracks the squared gradients—can sometimes "forget" past values too quickly if updates are sparse or gradients have high variance. This can lead to the effective learning rate blowing up, causing the model to diverge or oscillate. How Yogi Optimizes Performance Yogi changes this to an update using the
The is an adaptive gradient algorithm designed to solve the non-convergence and stability issues found in the popular Adam optimizer . Developed by Zaheer et al. (2018), it is particularly effective for training large-scale deep learning models in vision and natural language processing. 💡 Core Concept
