Back to blog
Tutorial

PyTorch Distributed Training on Cloud GPUs: Complete Guide

10.03.2026
11 min read

PyTorch Distributed Training on Cloud GPUs: Complete Guide

Why Distributed Training?

When a single GPU is too slow or your model doesn't fit in one GPU's VRAM, distributed training is the answer. PyTorch's DistributedDataParallel (DDP) is the standard for 2026 — more efficient than DataParallel and battle-tested across thousands of production training runs.

Setting Up DDP: The Minimal Example

```python

train.py

import torch

import torch.distributed as dist

from torch.nn.parallel import DistributedDataParallel as DDP

from torch.utils.data.distributed import DistributedSampler

def setup(rank, world_size):

dist.init_process_group("nccl", rank=rank, world_size=world_size)

torch.cuda.set_device(rank)

def cleanup():

dist.destroy_process_group()

def train(rank, world_size):

setup(rank, world_size)

model = YourModel().to(rank)

model = DDP(model, device_ids=[rank])

sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)

dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)

for epoch in range(epochs):

sampler.set_epoch(epoch) # Important for shuffling!

for batch in dataloader:

# training step

pass

cleanup()

```

Launching with torchrun

```bash

Single node, 4 GPUs

torchrun --standalone --nproc_per_node=4 train.py

With custom arguments

torchrun --standalone --nproc_per_node=4 train.py --epochs 10 --batch_size 16 --lr 1e-4

```

Multi-Node Configuration on RunPod

1. Launch two RunPod instances of the same GPU type

2. Note the **private IP** of each pod (shown in pod details)

3. On **node 0** (master):

```bash

torchrun --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr="10.0.0.1" --master_port=12355 train.py

```

4. On **node 1**:

```bash

torchrun --nproc_per_node=4 --nnodes=2 --node_rank=1 --master_addr="10.0.0.1" --master_port=12355 train.py

```

Gradient Checkpointing

Reduces memory by ~60% at the cost of ~20% more compute time:

```python

from torch.utils.checkpoint import checkpoint_sequential

For sequential models

output = checkpoint_sequential(model.layers, segments=4, input=x)

For HuggingFace models

model.gradient_checkpointing_enable()

```

Mixed Precision with torch.cuda.amp

```python

scaler = torch.cuda.amp.GradScaler()

for batch in dataloader:

optimizer.zero_grad()

with torch.cuda.amp.autocast():

output = model(batch)

loss = criterion(output, targets)

scaler.scale(loss).backward()

scaler.step(optimizer)

scaler.update()

```

Mixed precision typically gives **1.5–2x speedup** with no accuracy loss when using BF16 on A100/H100.

Debugging Distributed Training

**Hang on init_process_group?**

  • Check firewall: port 12355 must be open between nodes
  • Verify all nodes can reach master_addr
  • **Loss diverges vs single GPU?**

  • Learning rate scales with world size: `lr = base_lr * world_size`
  • Or use gradient averaging: `loss = loss / world_size`
  • **NCCL errors?**

    ```bash

    export NCCL_DEBUG=INFO

    export NCCL_IB_DISABLE=1 # if InfiniBand not available

    ```

    Cost Optimization Tips

  • Profile before scaling:: Use `torch.profiler` to find if you're compute or I/O bound
  • Gradient accumulation first:: Simulate large batches without multi-GPU
  • Use `find_unused_parameters=False`: in DDP if all parameters are used — saves communication overhead
  • Pin memory in DataLoader:: `DataLoader(dataset, pin_memory=True, num_workers=4)`
  • Conclusion

    DDP on cloud GPUs is production-ready and straightforward with `torchrun`. Start with single-node multi-GPU, add gradient checkpointing and mixed precision, and only scale to multi-node when necessary.

    Find multi-GPU instances →

    Ready to save?

    Compare GPU cloud prices and find the best provider for your use case.

    Start Comparing

    Related Articles

    Tutorial

    Multi-GPU Training: Setup Guide for Beginners

    Learn how to distribute your training across multiple GPUs. Step-by-step tutorial covering PyTorch DDP, DeepSpeed, and cloud multi-GPU setups.

    13.03.202614 min
    Read More
    Tutorial

    LLM Inference Optimization: Get More Tokens Per Dollar

    Cut your inference costs dramatically with the right serving framework, quantisation strategy, and batching configuration. A practical guide to vLLM, TGI, quantisation, and KV cache tuning.

    12.03.20269 min
    Read More
    Guia

    Cheapest GPU Cloud Providers in 2026

    A comprehensive ranking of the most affordable GPU cloud providers in 2026. Find the lowest prices for H100, A100, RTX 4090, and more.

    16.03.202610 min
    Read More