PyTorch Distributed Training on Cloud GPUs: Complete Guide

Why Distributed Training?

When a single GPU is too slow or your model doesn't fit in one GPU's VRAM, distributed training is the answer. PyTorch's DistributedDataParallel (DDP) is the standard for 2026 — more efficient than DataParallel and battle-tested across thousands of production training runs.

Setting Up DDP: The Minimal Example

```python

train.py

import torch

import torch.distributed as dist

from torch.nn.parallel import DistributedDataParallel as DDP

from torch.utils.data.distributed import DistributedSampler

def setup(rank, world_size):

dist.init_process_group("nccl", rank=rank, world_size=world_size)

torch.cuda.set_device(rank)

def cleanup():

dist.destroy_process_group()

def train(rank, world_size):

setup(rank, world_size)

model = YourModel().to(rank)

model = DDP(model, device_ids=[rank])

sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)

dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)

for epoch in range(epochs):

sampler.set_epoch(epoch) # Important for shuffling!

for batch in dataloader:

# training step

pass

cleanup()

```

Launching with torchrun

```bash

Single node, 4 GPUs

torchrun --standalone --nproc_per_node=4 train.py

With custom arguments

torchrun --standalone --nproc_per_node=4 train.py --epochs 10 --batch_size 16 --lr 1e-4

```

Multi-Node Configuration on RunPod

1. Launch two RunPod instances of the same GPU type

2. Note the **private IP** of each pod (shown in pod details)

3. On **node 0** (master):

```bash

torchrun --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr="10.0.0.1" --master_port=12355 train.py

```

4. On **node 1**:

```bash

torchrun --nproc_per_node=4 --nnodes=2 --node_rank=1 --master_addr="10.0.0.1" --master_port=12355 train.py

```

Gradient Checkpointing

Reduces memory by ~60% at the cost of ~20% more compute time:

```python

from torch.utils.checkpoint import checkpoint_sequential

For sequential models

output = checkpoint_sequential(model.layers, segments=4, input=x)

For HuggingFace models

model.gradient_checkpointing_enable()

```

Mixed Precision with torch.cuda.amp

```python

scaler = torch.cuda.amp.GradScaler()

for batch in dataloader:

optimizer.zero_grad()

with torch.cuda.amp.autocast():

output = model(batch)

loss = criterion(output, targets)

scaler.scale(loss).backward()

scaler.step(optimizer)

scaler.update()

```

Mixed precision typically gives **1.5–2x speedup** with no accuracy loss when using BF16 on A100/H100.

Debugging Distributed Training

**Hang on init_process_group?**

Check firewall: port 12355 must be open between nodes

Verify all nodes can reach master_addr

**Loss diverges vs single GPU?**

Learning rate scales with world size: `lr = base_lr * world_size`

Or use gradient averaging: `loss = loss / world_size`

**NCCL errors?**

```bash

export NCCL_DEBUG=INFO

export NCCL_IB_DISABLE=1 # if InfiniBand not available

```

Cost Optimization Tips

Profile before scaling:: Use `torch.profiler` to find if you're compute or I/O bound

Gradient accumulation first:: Simulate large batches without multi-GPU

Use `find_unused_parameters=False`: in DDP if all parameters are used — saves communication overhead

Pin memory in DataLoader:: `DataLoader(dataset, pin_memory=True, num_workers=4)`

The Bottom Line

DDP on cloud GPUs is production-ready and straightforward with `torchrun`. Start with single-node multi-GPU, add gradient checkpointing and mixed precision, and only scale to multi-node when necessary.

Find multi-GPU instances → →

PyTorch Distributed Training on Cloud GPUs: Complete Guide

PyTorch Distributed Training on Cloud GPUs: Complete Guide

Why Distributed Training?

Setting Up DDP: The Minimal Example

train.py

Launching with torchrun

Single node, 4 GPUs

With custom arguments

Multi-Node Configuration on RunPod

Gradient Checkpointing

For sequential models

For HuggingFace models

Mixed Precision with torch.cuda.amp

Debugging Distributed Training

Cost Optimization Tips

The Bottom Line

Готовы экономить?

Похожие Статьи

Multi-GPU Training: Setup Guide for Beginners

LLM Inference Optimization: Get More Tokens Per Dollar

Cheapest GPU Cloud Providers in 2026