मुख्य सामग्री पर जाएं
ब्लॉग पर वापस
Tutorial

PyTorch Distributed Training on Cloud GPUs: Complete Guide

10/3/2026
11 min पढ़ने का समय

PyTorch Distributed Training on Cloud GPUs: Complete Guide

Why Distributed Training?

When a single GPU is too slow or your model doesn't fit in one GPU's VRAM, distributed training is the answer. PyTorch's DistributedDataParallel (DDP) is the standard for 2026 — more efficient than DataParallel and battle-tested across thousands of production training runs.

Setting Up DDP: The Minimal Example

```python

train.py

import torch

import torch.distributed as dist

from torch.nn.parallel import DistributedDataParallel as DDP

from torch.utils.data.distributed import DistributedSampler

def setup(rank, world_size):

dist.init_process_group("nccl", rank=rank, world_size=world_size)

torch.cuda.set_device(rank)

def cleanup():

dist.destroy_process_group()

def train(rank, world_size):

setup(rank, world_size)

model = YourModel().to(rank)

model = DDP(model, device_ids=[rank])

sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)

dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)

for epoch in range(epochs):

sampler.set_epoch(epoch) # Important for shuffling!

for batch in dataloader:

# training step

pass

cleanup()

```

Launching with torchrun

```bash

Single node, 4 GPUs

torchrun --standalone --nproc_per_node=4 train.py

With custom arguments

torchrun --standalone --nproc_per_node=4 train.py --epochs 10 --batch_size 16 --lr 1e-4

```

Multi-Node Configuration on RunPod

1. Launch two RunPod instances of the same GPU type

2. Note the **private IP** of each pod (shown in pod details)

3. On **node 0** (master):

```bash

torchrun --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr="10.0.0.1" --master_port=12355 train.py

```

4. On **node 1**:

```bash

torchrun --nproc_per_node=4 --nnodes=2 --node_rank=1 --master_addr="10.0.0.1" --master_port=12355 train.py

```

Gradient Checkpointing

Reduces memory by ~60% at the cost of ~20% more compute time:

```python

from torch.utils.checkpoint import checkpoint_sequential

For sequential models

output = checkpoint_sequential(model.layers, segments=4, input=x)

For HuggingFace models

model.gradient_checkpointing_enable()

```

Mixed Precision with torch.cuda.amp

```python

scaler = torch.cuda.amp.GradScaler()

for batch in dataloader:

optimizer.zero_grad()

with torch.cuda.amp.autocast():

output = model(batch)

loss = criterion(output, targets)

scaler.scale(loss).backward()

scaler.step(optimizer)

scaler.update()

```

Mixed precision typically gives **1.5–2x speedup** with no accuracy loss when using BF16 on A100/H100.

Debugging Distributed Training

**Hang on init_process_group?**

  • Check firewall: port 12355 must be open between nodes
  • Verify all nodes can reach master_addr
  • **Loss diverges vs single GPU?**

  • Learning rate scales with world size: `lr = base_lr * world_size`
  • Or use gradient averaging: `loss = loss / world_size`
  • **NCCL errors?**

    ```bash

    export NCCL_DEBUG=INFO

    export NCCL_IB_DISABLE=1 # if InfiniBand not available

    ```

    Cost Optimization Tips

  • Profile before scaling:: Use `torch.profiler` to find if you're compute or I/O bound
  • Gradient accumulation first:: Simulate large batches without multi-GPU
  • Use `find_unused_parameters=False`: in DDP if all parameters are used — saves communication overhead
  • Pin memory in DataLoader:: `DataLoader(dataset, pin_memory=True, num_workers=4)`
  • The Bottom Line

    DDP on cloud GPUs is production-ready and straightforward with `torchrun`. Start with single-node multi-GPU, add gradient checkpointing and mixed precision, and only scale to multi-node when necessary.

    Find multi-GPU instances →

    DS

    Daniel Santos

    Founder & ML Engineer

    Building GPU price comparison tools since 2024. Previously trained LLMs at scale for fintech startups in São Paulo. Obsessed with finding the best $/TFLOP ratios across cloud providers.

    GPU CloudLLM TrainingCost OptimizationMLOps

    बचत के लिए तैयार?

    GPU क्लाउड कीमतों की तुलना करें और अपने उपयोग के लिए सबसे अच्छा प्रदाता खोजें।

    तुलना शुरू करें

    संबंधित लेख

    Tutorial

    Multi-GPU Training: Setup Guide for Beginners

    Learn how to distribute your training across multiple GPUs. Step-by-step tutorial covering PyTorch DDP, DeepSpeed, and cloud multi-GPU setups.

    13/3/202614 min
    Read More
    Tutorial

    LLM Inference Optimization: Get More Tokens Per Dollar

    Cut your inference costs dramatically with the right serving framework, quantisation strategy, and batching configuration. A practical guide to vLLM, TGI, quantisation, and KV cache tuning.

    12/3/20269 min
    Read More
    Guia

    Cheapest GPU Cloud Providers in 2026

    A comprehensive ranking of the most affordable GPU cloud providers in 2026. Find the lowest prices for H100, A100, RTX 4090, and more.

    16/3/202610 min
    Read More