Notes by Sarah Chieng | Reference Paper | Supplementary slides

Special thanks to Harinath Kamepalli for his help making these notes. If you have any questions, comments, or clarifications, reach out to me on X: @SarahChieng

This paper introduces weight streaming, a training execution flow that separates parameter storage from primary compute. This enables the efficient and scalable training of large models.

In weight streaming, model parameters are stored externally rather than in the compute units' memory. This decoupling allows cluster size to scale independently of model size.

Cerebras’s key advantage is their ability to scale, and how quickly they can scale

#1 Overview of NLP Model Training

For optional context, this section gives a high level overview of model training

Model training involves processing data in batches, with the following steps performed for each batch

Step #1: Forward Propagation

Feed the training data through each model layer sequentially.
At each layer, compute activations, which represent the output of neurons after applying an activation function.
The final layer produces the model's prediction.

Step #2: Compute Loss

Compare the model’s predictions to the true labels using a loss function to measure prediction error