Concept of Data Parallelism#

Authors: Jinwon Kim

Data Parallelism is a widely-used technique for training deep learning models in parallel. It involves distributing the training data across multiple processing units, such as GPUs, each of which has a copy of the model parameters. The data is divided into subsets, and each unit independently computes the gradients for its subset. The gradients are then aggregated to update the model parameters. This approach enables efficient parallelization of the training process and can accelerate the training of deep learning models on large datasets.

Oslo supports Zero Redundancy Optimizer (ZeRO) to easily scale deep learning model.

Optimizer-Level Parallel#

Zero Redundancy Optimizer DP

References#

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Concept of Data Parallelism

Contents

Concept of Data Parallelism#

Optimizer-Level Parallel#

References#