Pular para o conteúdo principal
Voltar ao blog
Guide

How to Save 60% on GPU Cloud with Spot Instances

07/03/2026
10 min de leitura

How to Save 60% on GPU Cloud with Spot Instances

If you are spending more than $100/month on GPU cloud, you are likely overpaying. Spot instances (also called interruptible or preemptible instances) offer the exact same GPU hardware at 40-60% less than on-demand prices. Here is everything you need to know to start saving immediately.

What Are Spot Instances?

Spot instances are unused GPU capacity that cloud providers sell at a discount. The trade-off is that your instance can be interrupted (taken away) with short notice when demand increases. In practice, interruption rates are much lower than you might expect.

Spot vs On-Demand Pricing (March 2026)

| GPU | On-Demand | Spot Price | Savings |

|-----|----------|-----------|---------|

| H100 80GB | $2.49/hr | $1.49/hr | **40%** |

| A100 80GB | $1.89/hr | $0.89/hr | **53%** |

| A100 40GB | $1.29/hr | $0.59/hr | **54%** |

| RTX 4090 | $0.44/hr | $0.19/hr | **57%** |

| RTX 4080 | $0.34/hr | $0.14/hr | **59%** |

| RTX 3090 | $0.29/hr | $0.12/hr | **59%** |

Where to Find Spot Instances

1. RunPod Spot (Community Cloud)

  • Savings:: 30-50% off on-demand
  • Interruption Notice:: 5 seconds
  • Best For:: Training jobs with checkpointing
  • How:: Select "Spot" when creating a pod
  • 2. Vast.ai Interruptible

  • Savings:: 40-60% off on-demand
  • Interruption Notice:: Varies by host
  • Best For:: Batch processing, experiments
  • How:: Filter by "interruptible" in search
  • 3. Lambda Labs Spot

  • Savings:: 25-40% off on-demand
  • Interruption Notice:: 2 minutes
  • Best For:: Enterprise workloads needing more notice
  • How:: Available via API
  • 4. AWS Spot Instances (p4d, p5)

  • Savings:: 60-70% off on-demand
  • Interruption Notice:: 2 minutes
  • Best For:: Large-scale distributed training
  • How:: Use Spot Instance requests in EC2
  • How to Make Spot Instances Reliable

    The key to using spot instances successfully is building fault tolerance into your workflow.

    1. Checkpoint Your Training

    Save model checkpoints every 15-30 minutes:

    ```python

    from transformers import TrainingArguments

    training_args = TrainingArguments(

    output_dir="./checkpoints",

    save_strategy="steps",

    save_steps=500, # Save every 500 steps

    save_total_limit=3, # Keep last 3 checkpoints

    resume_from_checkpoint=True,

    )

    ```

    2. Use Persistent Storage

    Always save checkpoints to persistent storage, not the instance disk:

  • RunPod:: Use Network Volumes ($0.10/GB/month)
  • Vast.ai:: Use external storage (S3, GCS)
  • AWS:: Use EBS or S3
  • 3. Auto-Resume Scripts

    Create a script that automatically resumes from the latest checkpoint:

    ```bash

    #!/bin/bash

    LATEST_CHECKPOINT=$(ls -t checkpoints/ | head -1)

    if [ -n "$LATEST_CHECKPOINT" ]; then

    echo "Resuming from $LATEST_CHECKPOINT"

    python train.py --resume_from checkpoints/$LATEST_CHECKPOINT

    else

    echo "Starting fresh training"

    python train.py

    fi

    ```

    4. Use Spot Instance Managers

    Tools that automatically re-provision spot instances when interrupted:

  • SkyPilot: - Open-source spot orchestrator
  • RunPod Serverless: - Auto-scales across spot capacity
  • AWS Spot Fleet: - Automatically replaces interrupted instances
  • Real Savings Calculator

    Example 1: Fine-Tuning a 7B Model (48 hours on A100 80GB)

  • On-Demand:: $1.89 x 48 = $90.72
  • Spot:: $0.89 x 48 = $42.72
  • You Save:: $48.00 (53%)
  • Example 2: Monthly Research Budget (160 hours on RTX 4090)

  • On-Demand:: $0.44 x 160 = $70.40/mo
  • Spot:: $0.19 x 160 = $30.40/mo
  • You Save:: $40.00/mo (57%)
  • Example 3: Startup Running Training 24/7 (H100, monthly)

  • On-Demand:: $2.49 x 720 = $1,792/mo
  • Spot:: $1.49 x 720 = $1,073/mo
  • You Save:: $719/mo (40%)
  • When NOT to Use Spot Instances

  • Production inference: - Downtime means lost revenue
  • Real-time services: - Cannot tolerate interruptions
  • Tight deadlines: - Risk of losing progress near deadline
  • Stateful workloads: - Databases, long-running servers
  • Best Practices Summary

    Always checkpoint: - Every 15-30 minutes minimum

    Use persistent storage: - Never rely on instance disk alone

    Start with off-peak hours: - Lower interruption rates at night/weekends

    Mix spot and on-demand: - Use spot for training, on-demand for inference

    Monitor prices: - Spot prices fluctuate; set alerts with BestGPUCloud

    Have a fallback plan: - Know your on-demand cost if spot is unavailable

    The Bottom Line

    Spot instances are the single best way to reduce your GPU cloud costs. With proper checkpointing and fault tolerance, you can save 40-60% on every training job. Start by comparing spot prices across providers on BestGPUCloud.

    Compare spot prices now -->

    DS

    Daniel Santos

    Founder & ML Engineer

    Building GPU price comparison tools since 2024. Previously trained LLMs at scale for fintech startups in São Paulo. Obsessed with finding the best $/TFLOP ratios across cloud providers.

    GPU CloudLLM TrainingCost OptimizationMLOps

    Pronto pra economizar?

    Compare preços de GPU cloud e encontre o melhor provedor pro seu caso de uso.

    Começar a Comparar

    Artigos Relacionados

    Guide

    Best GPU Cloud Providers in 2026: Complete Ranking

    We ranked the top GPU cloud providers of 2026 on price, reliability, GPU selection, and developer experience. Here is who comes out on top — and who is best for your specific use case.

    16/03/202610 min
    Read More
    Guide

    Best GPU for LLaMA 3 Fine-Tuning in 2026

    Complete guide comparing H100 vs A100 for LLaMA 3 fine-tuning. Cost breakdowns, performance benchmarks, and provider recommendations.

    14/03/202612 min
    Read More
    Guide

    Best GPU Cloud for Stable Diffusion in 2026

    GPU requirements for SD 1.5, SDXL, and SD 3.0, best cloud providers with pricing, and how to set up ComfyUI on RunPod for maximum throughput per dollar.

    11/03/20267 min
    Read More