Multi-GPU Training: Setup Guide for Beginners

When a single GPU is not enough -- either because your model does not fit in memory or training takes too long -- you need multi-GPU training. This guide walks you through everything from basic concepts to a working multi-GPU setup in the cloud.

Why Multi-GPU Training?

There are two main reasons to use multiple GPUs:

Speed:: Distribute data across GPUs to train N times faster

Memory:: Split the model across GPUs when it does not fit on one

Types of Multi-GPU Parallelism

Data Parallelism (Most Common)

Each GPU gets a copy of the model and processes a different batch of data. Gradients are averaged across GPUs.

Best for:: Models that fit on one GPU, but training is too slow

Scaling:: Near-linear speedup up to 8 GPUs

Tools:: PyTorch DDP, Hugging Face Accelerate

Model Parallelism (Tensor Parallelism)

The model itself is split across GPUs. Each GPU holds part of the model.

Best for:: Models too large for one GPU (70B+)

Scaling:: Communication overhead limits efficiency

Tools:: Megatron-LM, DeepSpeed

Pipeline Parallelism

Different layers of the model run on different GPUs in a pipeline.

Best for:: Very deep models with many layers

Tools:: DeepSpeed, GPipe

Step-by-Step: Multi-GPU with PyTorch DDP

1. Set Up Your Cloud Instance

On RunPod, select a multi-GPU pod:

2x RTX 4090 ($0.88/hr) for small models

2x A100 80GB ($3.78/hr) for large models

4x H100 ($9.96/hr) for 70B+ models

2. Basic DDP Training Script

```python

import torch

import torch.distributed as dist

from torch.nn.parallel import DistributedDataParallel as DDP

from torch.utils.data.distributed import DistributedSampler

def setup(rank, world_size):

dist.init_process_group("nccl", rank=rank, world_size=world_size)

torch.cuda.set_device(rank)

def train(rank, world_size):

setup(rank, world_size)

model = YourModel().to(rank)

model = DDP(model, device_ids=[rank])

dataset = YourDataset()

sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)

dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

for epoch in range(10):

sampler.set_epoch(epoch)

for batch in dataloader:

batch = batch.to(rank)

loss = model(batch)

loss.backward()

optimizer.step()

optimizer.zero_grad()

dist.destroy_process_group()

if __name__ == "__main__":

world_size = torch.cuda.device_count()

torch.multiprocessing.spawn(train, args=(world_size,), nprocs=world_size)

```

3. Launch with torchrun

```bash

torchrun --nproc_per_node=2 train.py

```

The Easy Way: Hugging Face Accelerate

For Hugging Face models, Accelerate makes multi-GPU trivial:

```bash

pip install accelerate

accelerate config # Answer questions about your setup

accelerate launch train.py

```

In your training script, minimal changes are needed:

```python

from accelerate import Accelerator

accelerator = Accelerator()

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

for batch in dataloader:

outputs = model(**batch)

loss = outputs.loss

accelerator.backward(loss)

optimizer.step()

optimizer.zero_grad()

```

DeepSpeed for Large Models

When your model does not fit on a single GPU, DeepSpeed ZeRO stages help:

ZeRO Stage 1:: Splits optimizer states across GPUs

ZeRO Stage 2:: Splits optimizer states + gradients

ZeRO Stage 3:: Splits optimizer states + gradients + parameters

```json

{

"zero_optimization": {

"stage": 2,

"offload_optimizer": { "device": "cpu" },

"allgather_partitions": true,

"reduce_scatter": true

"fp16": { "enabled": true },

"train_batch_size": 64,

"train_micro_batch_size_per_gpu": 8

}

```

Cloud Provider Multi-GPU Options

|----------|-------------------|--------|------------------|

| RunPod | 8x H100 | Yes | $9.96/hr |

| Lambda Labs | 8x H100 | Yes | $9.96/hr |

| Vast.ai | 8x A100 | Varies | ~$7.60/hr |

| AWS (p5.48xlarge) | 8x H100 | Yes | $31.12/hr |

Common Pitfalls

Not scaling learning rate:: Multiply LR by number of GPUs

Forgetting DistributedSampler:: Data must be properly sharded

Saving checkpoints on all ranks:: Only save from rank 0

Not setting NCCL environment variables:: Set `NCCL_P2P_LEVEL=NVL` for NVLink

The Bottom Line

Multi-GPU training is essential for serious AI work. Start with **PyTorch DDP or Hugging Face Accelerate** for models that fit on one GPU but need speed. Use **DeepSpeed ZeRO** when your model exceeds single-GPU memory. Cloud providers like RunPod and Lambda make multi-GPU setups accessible to everyone.

Find multi-GPU cloud instances --> →

Multi-GPU Training: Setup Guide for Beginners

Multi-GPU Training: Setup Guide for Beginners

Why Multi-GPU Training?

Types of Multi-GPU Parallelism

Data Parallelism (Most Common)

Model Parallelism (Tensor Parallelism)

Pipeline Parallelism

Step-by-Step: Multi-GPU with PyTorch DDP

1. Set Up Your Cloud Instance

2. Basic DDP Training Script

3. Launch with torchrun

The Easy Way: Hugging Face Accelerate

DeepSpeed for Large Models

Cloud Provider Multi-GPU Options

Common Pitfalls

The Bottom Line

Prêt à économiser ?

Articles Connexes

LLM Inference Optimization: Get More Tokens Per Dollar

PyTorch Distributed Training on Cloud GPUs: Complete Guide

Cheapest GPU Cloud Providers in 2026