A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization

Last update: Dec 31, 2022

Related tags

Overview

MADGRAD Optimization Method

A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization

pip install madgrad

Try it out! A best-of-both-worlds optimizer with the generalization performance of SGD and at least as fast convergence as that of Adam, often faster. A drop-in torch.optim implementation madgrad.MADGRAD is provided, as well as a FairSeq wrapped instance. For FairSeq, just import madgrad anywhere in your project files and use the --optimizer madgrad command line option, together with --weight-decay, --momentum, and optionally --madgrad_eps.

The madgrad.py file containing the optimizer can be directly dropped into any PyTorch project if you don't want to install via pip. If you are using fairseq, you need the acompanying fairseq_madgrad.py file as well.

Documentation availiable at https://madgrad.readthedocs.io/en/latest/.

Things to note:

You may need to use a lower weight decay than you are accustomed to. Often 0.
You should do a full learning rate sweep as the optimal learning rate will be different from SGD or Adam. Best LR values we found were 2.5e-4 for 152 layer PreActResNet on CIFAR10, 0.001 for ResNet-50 on ImageNet, 0.025 for IWSLT14 using transformer_iwslt_de_en and 0.005 for RoBERTa training on BookWiki using BERT_BASE. On NLP models gradient clipping also helped.

Tech Report

Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization

We introduce MADGRAD, a novel optimization method in the family of AdaGrad adaptive gradient methods. MADGRAD shows excellent performance on deep learning optimization problems from multiple fields, including classification and image-to-image tasks in vision, and recurrent and bidirectionally-masked models in natural language processing. For each of these tasks, MADGRAD matches or outperforms both SGD and ADAM in test set performance, even on problems for which adaptive methods normally perform poorly.

@misc{defazio2021adaptivity,
      title={Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization}, 
      author={Aaron Defazio and Samy Jelassi},
      year={2021},
      eprint={2101.11075},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Results

License

MADGRAD is licensed under the MIT License.

A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization

Related tags

Overview

MADGRAD Optimization Method

Things to note:

Tech Report

Results

License

Owner

Meta Research

Multivariate Time Series Forecasting with efficient Transformers. Code for the paper "Long-Range Transformers for Dynamic Spatiotemporal Forecasting."

Gated-Shape CNN for Semantic Segmentation (ICCV 2019)

OpenMMLab Semantic Segmentation Toolbox and Benchmark.

PyTorch implementation of Asymmetric Siamese (https://arxiv.org/abs/2204.00613)

Implementation of Memformer, a Memory-augmented Transformer, in Pytorch

Pytorch implementation for "Distribution-Balanced Loss for Multi-Label Classification in Long-Tailed Datasets" (ECCV 2020 Spotlight)

Repo for code associated with Modeling the Mitral Valve.

Resources for the "Evaluating the Factual Consistency of Abstractive Text Summarization" paper

Pytorch implementation of Nueral Style transfer

Data-depth-inference - Data depth inference with python

Text mining project; Using distilBERT to predict authors in the classification task authorship attribution.

Awesome AI Learning with +100 AI Cheat-Sheets, Free online Books, Top Courses, Best Videos and Lectures, Papers, Tutorials, +99 Researchers, Premium Websites, +121 Datasets, Conferences, Frameworks, Tools

PyTorch 1.0 inference in C++ on Windows10 platforms

Official pytorch implementation of paper "Image-to-image Translation via Hierarchical Style Disentanglement".

Organseg dags - The repository contains the codebase for multi-organ segmentation with directed acyclic graphs (DAGs) in CT.

Motion planning algorithms commonly used on autonomous vehicles. (path planning + path tracking)

A pytorch implementation of Pytorch-Sketch-RNN

This is a collection of our NAS and Vision Transformer work.

The repository for freeCodeCamp's YouTube course, Algorithmic Trading in Python

Implementation of gaze tracking and demo