Notes by Sarah Chieng | Reference Paper| Supplementary slides
Special thanks to Harinath Kamepalli for his help making these notes. If you have any questions, comments, or clarifications, reach out to me on X: @SarahChieng
This paper introduces weight streaming, a training execution flow that separates parameter storage from primary compute. This enables the efficient and scalable training of large models.
In weight streaming, model parameters are stored externally rather than in the compute units' memory. This decoupling allows cluster size to scale independently of model size.
Cerebras’s key advantage is their ability to scale, and how quickly they can scale
For optional context, this section gives a high level overview of model training
Model training involves processing data in batches, with the following steps performed for each batch