Benchmark for Tuning Accuracy and Efficiency

Overview

The benchmark includes our efforts in using Colossal-AI to train different tasks to achieve SOTA results. We are interested in both validataion accuracy and training speed, and prefer larger batch size to take advantage of more GPU devices. For example, we trained vision transformer with batch size 512 on CIFAR10 and 4096 on ImageNet1k, which are basically not used in existing works. Some of the results in the benchmark trained with 8x A100 are shown below.

Task	Model	Training Time	Top-1 Accuracy
CIFAR10	ViT-Lite-7/4	~ 16 min	~ 90.5%
ImageNet1k	ViT-S/16	~ 16.5 h	~ 74.5%

The train.py script in each task runs training with the specific configuration script in configs/ for different parallelisms. Supported parallelisms include data parallel only (ends with vanilla), 1D (ends with 1d), 2D (ends with 2d), 2.5D (ends with 2p5d), 3D (ends with 3d).

Each configuration scripts basically includes the following elements, taking ImageNet1k task as example:

TOTAL_BATCH_SIZE = 4096
LEARNING_RATE = 3e-3
WEIGHT_DECAY = 0.3

NUM_EPOCHS = 300
WARMUP_EPOCHS = 32

# data parallel only
TENSOR_PARALLEL_SIZE = 1    
TENSOR_PARALLEL_MODE = None

# parallelism setting
parallel = dict(
    pipeline=1,
    tensor=dict(mode=TENSOR_PARALLEL_MODE, size=TENSOR_PARALLEL_SIZE),
)

fp16 = dict(mode=AMP_TYPE.TORCH, ) # amp setting

gradient_accumulation = 2 # accumulate 2 steps for gradient update

BATCH_SIZE = TOTAL_BATCH_SIZE // gradient_accumulation # actual batch size for dataloader

clip_grad_norm = 1.0 # clip gradient with norm 1.0

Upper case elements are basically what train.py needs, and lower case elements are what Colossal-AI needs to initialize the training.

Usage

To start training, use the following command to run each worker:

$ DATA=/path/to/dataset python train.py --world_size=WORLD_SIZE \
                                        --rank=RANK \
                                        --local_rank=LOCAL_RANK \
                                        --host=MASTER_IP_ADDRESS \
                                        --port=MASTER_PORT \
                                        --config=CONFIG_FILE

It is also recommended to start training with torchrun as:

$ DATA=/path/to/dataset torchrun --nproc_per_node=NUM_GPUS_PER_NODE \
                                 --nnodes=NUM_NODES \
                                 --node_rank=NODE_RANK \
                                 --master_addr=MASTER_IP_ADDRESS \
                                 --master_port=MASTER_PORT \
                                 train.py --config=CONFIG_FILE

ColossalAI-Benchmark - Performance benchmarking with ColossalAI

Related tags

Overview

Benchmark for Tuning Accuracy and Efficiency

Overview

Usage

Owner

HPC-AI Tech

Jingju baseline - A baseline model of our project of Beijing opera script generation

League of Legends Reinforcement Learning Environment (LoLRLE) multiple training scenarios using PPO.

Metadata-Extractor - Metadata Extractor Script can be used to read in exif metadata

Link prediction using Multiple Order Local Information (MOLI)

Autonomous Driving on Curvy Roads without Reliance on Frenet Frame: A Cartesian-based Trajectory Planning Method

Este conversor criará a medida exata para sua receita de capuccino gelado da grandiosa Rafaella Ballerini!

diablo2 resurrected loot filter

Code for reproducible experiments presented in KSD Aggregated Goodness-of-fit Test.

render sprites into your desktop environment as shaped windows using GTK

This project contains an implemented version of Face Detection using OpenCV and Mediapipe. This is a code snippet and can be used in projects.

Neural Contours: Learning to Draw Lines from 3D Shapes (CVPR2020)

Mitsuba 2: A Retargetable Forward and Inverse Renderer

EDPN: Enhanced Deep Pyramid Network for Blurry Image Restoration

This repo contains the code for the paper "Efficient hierarchical Bayesian inference for spatio-temporal regression models in neuroimaging" that has been accepted to NeurIPS 2021.

MLJetReconstruction - using machine learning to reconstruct jets for CMS

HackBMU-5.0-Team-Ctrl-Alt-Elite - HackBMU 5.0 Team Ctrl Alt Elite

A PyTorch implementation of Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks

A robust pointcloud registration pipeline based on correlation.

EfficientNetv2 TensorRT int8

Artstation-Artistic-face-HQ Dataset (AAHQ)