Aller au contenu principal
Retour au blog
Tutorial

Multi-GPU Training: Setup Guide for Beginners

13/03/2026
14 min de lecture

Multi-GPU Training: Setup Guide for Beginners

When a single GPU is not enough -- either because your model does not fit in memory or training takes too long -- you need multi-GPU training. This guide walks you through everything from basic concepts to a working multi-GPU setup in the cloud.

Why Multi-GPU Training?

There are two main reasons to use multiple GPUs:

Speed:: Distribute data across GPUs to train N times faster

Memory:: Split the model across GPUs when it does not fit on one

Types of Multi-GPU Parallelism

Data Parallelism (Most Common)

Each GPU gets a copy of the model and processes a different batch of data. Gradients are averaged across GPUs.

  • Best for:: Models that fit on one GPU, but training is too slow
  • Scaling:: Near-linear speedup up to 8 GPUs
  • Tools:: PyTorch DDP, Hugging Face Accelerate
  • Model Parallelism (Tensor Parallelism)

    The model itself is split across GPUs. Each GPU holds part of the model.

  • Best for:: Models too large for one GPU (70B+)
  • Scaling:: Communication overhead limits efficiency
  • Tools:: Megatron-LM, DeepSpeed
  • Pipeline Parallelism

    Different layers of the model run on different GPUs in a pipeline.

  • Best for:: Very deep models with many layers
  • Tools:: DeepSpeed, GPipe
  • Step-by-Step: Multi-GPU with PyTorch DDP

    1. Set Up Your Cloud Instance

    On RunPod, select a multi-GPU pod:

  • 2x RTX 4090 ($0.88/hr) for small models
  • 2x A100 80GB ($3.78/hr) for large models
  • 4x H100 ($9.96/hr) for 70B+ models
  • 2. Basic DDP Training Script

    ```python

    import torch

    import torch.distributed as dist

    from torch.nn.parallel import DistributedDataParallel as DDP

    from torch.utils.data.distributed import DistributedSampler

    def setup(rank, world_size):

    dist.init_process_group("nccl", rank=rank, world_size=world_size)

    torch.cuda.set_device(rank)

    def train(rank, world_size):

    setup(rank, world_size)

    model = YourModel().to(rank)

    model = DDP(model, device_ids=[rank])

    dataset = YourDataset()

    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)

    dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)

    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

    for epoch in range(10):

    sampler.set_epoch(epoch)

    for batch in dataloader:

    batch = batch.to(rank)

    loss = model(batch)

    loss.backward()

    optimizer.step()

    optimizer.zero_grad()

    dist.destroy_process_group()

    if __name__ == "__main__":

    world_size = torch.cuda.device_count()

    torch.multiprocessing.spawn(train, args=(world_size,), nprocs=world_size)

    ```

    3. Launch with torchrun

    ```bash

    torchrun --nproc_per_node=2 train.py

    ```

    The Easy Way: Hugging Face Accelerate

    For Hugging Face models, Accelerate makes multi-GPU trivial:

    ```bash

    pip install accelerate

    accelerate config # Answer questions about your setup

    accelerate launch train.py

    ```

    In your training script, minimal changes are needed:

    ```python

    from accelerate import Accelerator

    accelerator = Accelerator()

    model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

    for batch in dataloader:

    outputs = model(**batch)

    loss = outputs.loss

    accelerator.backward(loss)

    optimizer.step()

    optimizer.zero_grad()

    ```

    DeepSpeed for Large Models

    When your model does not fit on a single GPU, DeepSpeed ZeRO stages help:

  • ZeRO Stage 1:: Splits optimizer states across GPUs
  • ZeRO Stage 2:: Splits optimizer states + gradients
  • ZeRO Stage 3:: Splits optimizer states + gradients + parameters
  • ```json

    {

    "zero_optimization": {

    "stage": 2,

    "offload_optimizer": { "device": "cpu" },

    "allgather_partitions": true,

    "reduce_scatter": true

    },

    "fp16": { "enabled": true },

    "train_batch_size": 64,

    "train_micro_batch_size_per_gpu": 8

    }

    ```

    Cloud Provider Multi-GPU Options

    | Provider | Max GPUs per Node | NVLink | Price (4x H100) |

    |----------|-------------------|--------|------------------|

    | RunPod | 8x H100 | Yes | $9.96/hr |

    | Lambda Labs | 8x H100 | Yes | $9.96/hr |

    | Vast.ai | 8x A100 | Varies | ~$7.60/hr |

    | AWS (p5.48xlarge) | 8x H100 | Yes | $31.12/hr |

    Common Pitfalls

    Not scaling learning rate:: Multiply LR by number of GPUs

    Forgetting DistributedSampler:: Data must be properly sharded

    Saving checkpoints on all ranks:: Only save from rank 0

    Not setting NCCL environment variables:: Set `NCCL_P2P_LEVEL=NVL` for NVLink

    The Bottom Line

    Multi-GPU training is essential for serious AI work. Start with **PyTorch DDP or Hugging Face Accelerate** for models that fit on one GPU but need speed. Use **DeepSpeed ZeRO** when your model exceeds single-GPU memory. Cloud providers like RunPod and Lambda make multi-GPU setups accessible to everyone.

    Find multi-GPU cloud instances -->

    MC

    Marina Costa

    Cloud Infrastructure Lead

    Managed GPU clusters at three different cloud providers before joining BestGPUCloud. I know firsthand why provider X charges 30% more — and whether it's worth it.

    Cloud InfrastructureKubernetesMulti-cloudCost Management

    Prêt à économiser ?

    Comparez les prix du GPU cloud et trouvez le meilleur fournisseur pour votre cas d'utilisation.

    Commencer à Comparer

    Articles Connexes

    Tutorial

    LLM Inference Optimization: Get More Tokens Per Dollar

    Cut your inference costs dramatically with the right serving framework, quantisation strategy, and batching configuration. A practical guide to vLLM, TGI, quantisation, and KV cache tuning.

    12/03/20269 min
    Read More
    Tutorial

    PyTorch Distributed Training on Cloud GPUs: Complete Guide

    Complete guide to DDP setup, torchrun commands, multi-node on RunPod, gradient checkpointing, mixed precision, and debugging distributed training jobs.

    10/03/202611 min
    Read More
    Guia

    Cheapest GPU Cloud Providers in 2026

    A comprehensive ranking of the most affordable GPU cloud providers in 2026. Find the lowest prices for H100, A100, RTX 4090, and more.

    16/03/202610 min
    Read More