Model ARCHitecture experiments

Research conducted under Prof. Kurt Keutzer at Berkeley Artificial Intelligence Research (BAIR).

Example setup

git clone https://github.com/bri25yu/march
cd march

conda env create --file environment.yml
conda activate march

deepspeed run.py

Experimental Setup

All of the following experiments are over constant data budget, model parameters, and compute unless noted otherwise. The data budget is determined by the number of steps taken and the number of tokens per step, for a total number of tokens seen over training. The number of model parameters is determined by counting the total number of trainable parameters in a model prior to training. The compute is approximated by how long the run took. All experiments are run on a single node consisting of 8 NVIDIA A5000 GPUs.

We train models for 1000 steps, enough for the models to start learning and to make their behavior/performance differentiable from other models. Every step, the model sees 1M tokens. Every experiment sees 1000 steps * 1M tokens per step = 1B tokens. We use the Wikipedia dataset.

The baseline model has 220M parameters to match with t5-base and by default every subsequent model matches this budget. Specifically, the baseline model has an encoder-decoder architecture, absolute position embeddings for the position encoding, 12 layers each in the encoder and decoder (for 24 layers total), 768 model dimension, 64 query-key-value dimension (for an equivalent 12 attention heads), and 768 * 4 = 3072 feedforward dimension.

The models are optimized using AdamW using 90% old gradient in the gradient exponential moving average (EMA) and 95% old hessian approximiation in the hessian approximation EMA (equivalently 10% new gradient and 5% new hessian approx). We use a constant learning rate schedule and a learning rate value of 1e-4.

The models are trained in BF16.

We follow the scaling law fitting approach of Kaplan et al, 2020 (https://arxiv.org/pdf/2001.08361.pdf).

Results

Our re-implementation is comparable to the T5 baseline

We compare our reimplementation with the implementation in Raffel et al, Oct 2019.

Gated Linear Units are better

This is a successful replication of Shazeer et al, Feb 2020.

More model dimension less layers is better

Working in a branch

First, modify the name parameter in the environment.yml file.