Keywords:-
Article Content:-
Abstract
The core of machine learning and deep learning models relates to optimization. The present paper summarizes the principal categories of optimization algorithms in AI and deep learning and describes the mathematical concepts underpinning them in non-technical language, and relates these mathematical concepts to training recipes and engineering considerations. We discuss first-order stochastic algorithms (SGD and variations), adaptive algorithms (AdaGrad, RMProp, Adam)(4)(5) momentum and acceleration, second-order concepts and approximations (Newton, quasi-Newton, natural gradient, K-FAC), and recent issues like learning-rate schedules, batch normalization behavior, sharp vs. flat minima, generalization, and training large models. Pseudo code, best practice hyper parameters, and practical traps are given to ensure that this paper serves as both a conceptual primer, as well as a practical guide.
References:-
References
Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics.
Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics.
Nesterov, Y. (1983). A method for solving the convex programming problem with convergence rate O(1/k2)O(1/k^2)O(1/k2).
Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization (AdaGrad).
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization.
Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization (AdamW).
Keskar, N. S., et al. (2017). On large-batch training for deep learning: Generalization gap and sharp minima.
Zhang, Y., et al. (2020). A brief survey of adaptive optimization methods in deep learning (survey paper).
Foret, Pierre; Kleiner, Ariel; Mobahi, Hossein; Neyshabur, Behnam. Sharpness-Aware Minimization for Efficiently Improving Generalization. arXiv:2010.01412 (2020).
Kwon, Jungmin; Kim, Jeongseop; Park, Hyunseo; Choi, In Kwon. ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks. ICML 2021.
Du, Jiawei; Yan, Hanshu; Feng, Jiashi; Zhou, Joey Tianyi; Zhen, Liangli; Goh, Rick Siow Mong; Tan, Vincent Y. F. Efficient Sharpness-aware Minimization for Improved Training of Neural Networks. arXiv:2110.03141 (2021).
Yue, Yun; Jiang, Jiadi; Ye, Zhiling; Gao, Ning; Liu, Yongchao; Zhang, Ke. Sharpness-Aware Minimization Revisited: Weighted Sharpness as a Regularization Term (WSAM). KDD 2023.
Yu, Runsheng; Zhang, Youzhi; Kwok, James. Improving Sharpness-Aware Minimization by Lookahead. ICML 2024.
Zhou, Zhanpeng; Wang, Mingze; Mao, Yuchen; Li, Bingrui; Yan, Junchi. Sharpness-Aware Minimization Efficiently Selects Flatter Minima Late In Training. ICLR 2025.
Martens, James; Grosse, Roger. Optimizing Neural Networks with Kronecker-factored Approximate Curvature (K-FAC). arXiv:1503.05671 (2015).
Luk, Kevin; Grosse, Roger. A Coordinate-Free Construction of Scalable Natural Gradient. arXiv:1808.10340 (2018).
Bae, Juhan; Zhang, Guodong; Grosse, Roger. Eigenvalue Corrected Noisy Natural Gradient. arXiv:1811.12565 (2018).
Surianarayanan, Chellammal; Lawrence, John Jeyasekaran; Chelliah, Pethuru Raj; Prakash, Edmond; Hewage, Chaminda. A Survey on Optimization Techniques for Edge Artificial Intelligence (AI). Sensors, 2023.
Lee, Yu-Ang; Yi, Guan-Ting; Liu, Mei-Yi; Lu, Jui-Chao; Yang, Guan-Bo; Chen, Yun-Nung. Compound AI Systems Optimization: A Survey of Methods, Challenges, and Future Directions. arXiv:2506.08234 (2025)