PyTorch Distributed Training on Cloud GPUs: Complete Guide
PyTorch Distributed Training on Cloud GPUs: Complete Guide
Why Distributed Training?
When a single GPU is too slow or your model doesn't fit in one GPU's VRAM, distributed training is the answer. PyTorch's DistributedDataParallel (DDP) is the standard for 2026 — more efficient than DataParallel and battle-tested across thousands of production training runs.
Setting Up DDP: The Minimal Example
```python
train.py
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler
def setup(rank, world_size):
dist.init_process_group("nccl", rank=rank, world_size=world_size)
torch.cuda.set_device(rank)
def cleanup():
dist.destroy_process_group()
def train(rank, world_size):
setup(rank, world_size)
model = YourModel().to(rank)
model = DDP(model, device_ids=[rank])
sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)
for epoch in range(epochs):
sampler.set_epoch(epoch) # Important for shuffling!
for batch in dataloader:
# training step
pass
cleanup()
```
Launching with torchrun
```bash
Single node, 4 GPUs
torchrun --standalone --nproc_per_node=4 train.py
With custom arguments
torchrun --standalone --nproc_per_node=4 train.py --epochs 10 --batch_size 16 --lr 1e-4
```
Multi-Node Configuration on RunPod
1. Launch two RunPod instances of the same GPU type
2. Note the **private IP** of each pod (shown in pod details)
3. On **node 0** (master):
```bash
torchrun --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr="10.0.0.1" --master_port=12355 train.py
```
4. On **node 1**:
```bash
torchrun --nproc_per_node=4 --nnodes=2 --node_rank=1 --master_addr="10.0.0.1" --master_port=12355 train.py
```
Gradient Checkpointing
Reduces memory by ~60% at the cost of ~20% more compute time:
```python
from torch.utils.checkpoint import checkpoint_sequential
For sequential models
output = checkpoint_sequential(model.layers, segments=4, input=x)
For HuggingFace models
model.gradient_checkpointing_enable()
```
Mixed Precision with torch.cuda.amp
```python
scaler = torch.cuda.amp.GradScaler()
for batch in dataloader:
optimizer.zero_grad()
with torch.cuda.amp.autocast():
output = model(batch)
loss = criterion(output, targets)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
```
Mixed precision typically gives **1.5–2x speedup** with no accuracy loss when using BF16 on A100/H100.
Debugging Distributed Training
**Hang on init_process_group?**
**Loss diverges vs single GPU?**
**NCCL errors?**
```bash
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1 # if InfiniBand not available
```
Cost Optimization Tips
Conclusion
DDP on cloud GPUs is production-ready and straightforward with `torchrun`. Start with single-node multi-GPU, add gradient checkpointing and mixed precision, and only scale to multi-node when necessary.
Related Articles
Multi-GPU Training: Setup Guide for Beginners
Learn how to distribute your training across multiple GPUs. Step-by-step tutorial covering PyTorch DDP, DeepSpeed, and cloud multi-GPU setups.
LLM Inference Optimization: Get More Tokens Per Dollar
Cut your inference costs dramatically with the right serving framework, quantisation strategy, and batching configuration. A practical guide to vLLM, TGI, quantisation, and KV cache tuning.
Cheapest GPU Cloud Providers in 2026
A comprehensive ranking of the most affordable GPU cloud providers in 2026. Find the lowest prices for H100, A100, RTX 4090, and more.