PyTorch Distributed Training on Cloud GPUs: Complete Guide

Why Distributed Training?

When a single GPU is too slow or your model doesn't fit in one GPU's VRAM, distributed training is the answer. PyTorch's DistributedDataParallel (DDP) is the standard for 2026 — more efficient than DataParallel and battle-tested across thousands of production training runs.

Setting Up DDP: The Minimal Example

```python

train.py

import torch

import torch.distributed as dist

from torch.nn.parallel import DistributedDataParallel as DDP

from torch.utils.data.distributed import DistributedSampler

def setup(rank, world_size):

dist.init_process_group("nccl", rank=rank, world_size=world_size)

torch.cuda.set_device(rank)

def cleanup():

dist.destroy_process_group()

def train(rank, world_size):

setup(rank, world_size)

model = YourModel().to(rank)

model = DDP(model, device_ids=[rank])

sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)

dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)

for epoch in range(epochs):

sampler.set_epoch(epoch) # Important for shuffling!

for batch in dataloader:

# training step

pass

cleanup()

```

Launching with torchrun

```bash

Single node, 4 GPUs

torchrun --standalone --nproc_per_node=4 train.py

With custom arguments

torchrun --standalone --nproc_per_node=4 train.py --epochs 10 --batch_size 16 --lr 1e-4

```

Multi-Node Configuration on RunPod

1. Launch two RunPod instances of the same GPU type

2. Note the **private IP** of each pod (shown in pod details)

3. On **node 0** (master):

```bash

torchrun --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr="10.0.0.1" --master_port=12355 train.py

```

4. On **node 1**:

```bash

torchrun --nproc_per_node=4 --nnodes=2 --node_rank=1 --master_addr="10.0.0.1" --master_port=12355 train.py

```

Gradient Checkpointing

Reduces memory by ~60% at the cost of ~20% more compute time:

```python

from torch.utils.checkpoint import checkpoint_sequential

For sequential models

output = checkpoint_sequential(model.layers, segments=4, input=x)

For HuggingFace models

model.gradient_checkpointing_enable()

```

Mixed Precision with torch.cuda.amp

```python

scaler = torch.cuda.amp.GradScaler()

for batch in dataloader:

optimizer.zero_grad()

with torch.cuda.amp.autocast():

output = model(batch)

loss = criterion(output, targets)

scaler.scale(loss).backward()

scaler.step(optimizer)

scaler.update()

```

Mixed precision typically gives **1.5–2x speedup** with no accuracy loss when using BF16 on A100/H100.

Debugging Distributed Training

**Hang on init_process_group?**

Check firewall: port 12355 must be open between nodes

Verify all nodes can reach master_addr

**Loss diverges vs single GPU?**

Learning rate scales with world size: `lr = base_lr * world_size`

Or use gradient averaging: `loss = loss / world_size`

**NCCL errors?**

```bash

export NCCL_DEBUG=INFO

export NCCL_IB_DISABLE=1 # if InfiniBand not available

```

Cost Optimization Tips

Profile before scaling:: Use `torch.profiler` to find if you're compute or I/O bound

Gradient accumulation first:: Simulate large batches without multi-GPU

Use `find_unused_parameters=False`: in DDP if all parameters are used — saves communication overhead

Pin memory in DataLoader:: `DataLoader(dataset, pin_memory=True, num_workers=4)`

The Bottom Line

DDP on cloud GPUs is production-ready and straightforward with `torchrun`. Start with single-node multi-GPU, add gradient checkpointing and mixed precision, and only scale to multi-node when necessary.

Find multi-GPU instances → →

PyTorch Distributed Training on Cloud GPUs: Complete Guide

PyTorch Distributed Training on Cloud GPUs: Complete Guide

Why Distributed Training?

Setting Up DDP: The Minimal Example

train.py

Launching with torchrun

Single node, 4 GPUs

With custom arguments

Multi-Node Configuration on RunPod

Gradient Checkpointing

For sequential models

For HuggingFace models

Mixed Precision with torch.cuda.amp

Debugging Distributed Training

Cost Optimization Tips

The Bottom Line

Pronto pra economizar?

Artigos Relacionados

Multi-GPU Training: Setup Guide for Beginners

LLM Inference Optimization: Get More Tokens Per Dollar

Cheapest GPU Cloud Providers in 2026