Multi-GPU Training: Setup Guide for Beginners
Multi-GPU Training: Setup Guide for Beginners
When a single GPU is not enough -- either because your model does not fit in memory or training takes too long -- you need multi-GPU training. This guide walks you through everything from basic concepts to a working multi-GPU setup in the cloud.
Why Multi-GPU Training?
There are two main reasons to use multiple GPUs:
Speed:: Distribute data across GPUs to train N times faster
Memory:: Split the model across GPUs when it does not fit on one
Types of Multi-GPU Parallelism
Data Parallelism (Most Common)
Each GPU gets a copy of the model and processes a different batch of data. Gradients are averaged across GPUs.
Model Parallelism (Tensor Parallelism)
The model itself is split across GPUs. Each GPU holds part of the model.
Pipeline Parallelism
Different layers of the model run on different GPUs in a pipeline.
Step-by-Step: Multi-GPU with PyTorch DDP
1. Set Up Your Cloud Instance
On RunPod, select a multi-GPU pod:
2. Basic DDP Training Script
```python
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler
def setup(rank, world_size):
dist.init_process_group("nccl", rank=rank, world_size=world_size)
torch.cuda.set_device(rank)
def train(rank, world_size):
setup(rank, world_size)
model = YourModel().to(rank)
model = DDP(model, device_ids=[rank])
dataset = YourDataset()
sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
for epoch in range(10):
sampler.set_epoch(epoch)
for batch in dataloader:
batch = batch.to(rank)
loss = model(batch)
loss.backward()
optimizer.step()
optimizer.zero_grad()
dist.destroy_process_group()
if __name__ == "__main__":
world_size = torch.cuda.device_count()
torch.multiprocessing.spawn(train, args=(world_size,), nprocs=world_size)
```
3. Launch with torchrun
```bash
torchrun --nproc_per_node=2 train.py
```
The Easy Way: Hugging Face Accelerate
For Hugging Face models, Accelerate makes multi-GPU trivial:
```bash
pip install accelerate
accelerate config # Answer questions about your setup
accelerate launch train.py
```
In your training script, minimal changes are needed:
```python
from accelerate import Accelerator
accelerator = Accelerator()
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
for batch in dataloader:
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
optimizer.zero_grad()
```
DeepSpeed for Large Models
When your model does not fit on a single GPU, DeepSpeed ZeRO stages help:
```json
{
"zero_optimization": {
"stage": 2,
"offload_optimizer": { "device": "cpu" },
"allgather_partitions": true,
"reduce_scatter": true
},
"fp16": { "enabled": true },
"train_batch_size": 64,
"train_micro_batch_size_per_gpu": 8
}
```
Cloud Provider Multi-GPU Options
| Provider | Max GPUs per Node | NVLink | Price (4x H100) |
|----------|-------------------|--------|------------------|
| RunPod | 8x H100 | Yes | $9.96/hr |
| Lambda Labs | 8x H100 | Yes | $9.96/hr |
| Vast.ai | 8x A100 | Varies | ~$7.60/hr |
| AWS (p5.48xlarge) | 8x H100 | Yes | $31.12/hr |
Common Pitfalls
Not scaling learning rate:: Multiply LR by number of GPUs
Forgetting DistributedSampler:: Data must be properly sharded
Saving checkpoints on all ranks:: Only save from rank 0
Not setting NCCL environment variables:: Set `NCCL_P2P_LEVEL=NVL` for NVLink
The Bottom Line
Multi-GPU training is essential for serious AI work. Start with **PyTorch DDP or Hugging Face Accelerate** for models that fit on one GPU but need speed. Use **DeepSpeed ZeRO** when your model exceeds single-GPU memory. Cloud providers like RunPod and Lambda make multi-GPU setups accessible to everyone.
Marina Costa
Cloud Infrastructure Lead
Managed GPU clusters at three different cloud providers before joining BestGPUCloud. I know firsthand why provider X charges 30% more — and whether it's worth it.
Готовы экономить?
Сравните цены на GPU облака и найдите лучшего провайдера для вашего случая.
Начать СравнениеПохожие Статьи
LLM Inference Optimization: Get More Tokens Per Dollar
Cut your inference costs dramatically with the right serving framework, quantisation strategy, and batching configuration. A practical guide to vLLM, TGI, quantisation, and KV cache tuning.
PyTorch Distributed Training on Cloud GPUs: Complete Guide
Complete guide to DDP setup, torchrun commands, multi-node on RunPod, gradient checkpointing, mixed precision, and debugging distributed training jobs.
Cheapest GPU Cloud Providers in 2026
A comprehensive ranking of the most affordable GPU cloud providers in 2026. Find the lowest prices for H100, A100, RTX 4090, and more.