How to Save 60% on GPU Cloud with Spot Instances
How to Save 60% on GPU Cloud with Spot Instances
If you are spending more than $100/month on GPU cloud, you are likely overpaying. Spot instances (also called interruptible or preemptible instances) offer the exact same GPU hardware at 40-60% less than on-demand prices. Here is everything you need to know to start saving immediately.
What Are Spot Instances?
Spot instances are unused GPU capacity that cloud providers sell at a discount. The trade-off is that your instance can be interrupted (taken away) with short notice when demand increases. In practice, interruption rates are much lower than you might expect.
Spot vs On-Demand Pricing (March 2026)
| GPU | On-Demand | Spot Price | Savings |
|-----|----------|-----------|---------|
| H100 80GB | $2.49/hr | $1.49/hr | **40%** |
| A100 80GB | $1.89/hr | $0.89/hr | **53%** |
| A100 40GB | $1.29/hr | $0.59/hr | **54%** |
| RTX 4090 | $0.44/hr | $0.19/hr | **57%** |
| RTX 4080 | $0.34/hr | $0.14/hr | **59%** |
| RTX 3090 | $0.29/hr | $0.12/hr | **59%** |
Where to Find Spot Instances
1. RunPod Spot (Community Cloud)
2. Vast.ai Interruptible
3. Lambda Labs Spot
4. AWS Spot Instances (p4d, p5)
How to Make Spot Instances Reliable
The key to using spot instances successfully is building fault tolerance into your workflow.
1. Checkpoint Your Training
Save model checkpoints every 15-30 minutes:
```python
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./checkpoints",
save_strategy="steps",
save_steps=500, # Save every 500 steps
save_total_limit=3, # Keep last 3 checkpoints
resume_from_checkpoint=True,
)
```
2. Use Persistent Storage
Always save checkpoints to persistent storage, not the instance disk:
3. Auto-Resume Scripts
Create a script that automatically resumes from the latest checkpoint:
```bash
#!/bin/bash
LATEST_CHECKPOINT=$(ls -t checkpoints/ | head -1)
if [ -n "$LATEST_CHECKPOINT" ]; then
echo "Resuming from $LATEST_CHECKPOINT"
python train.py --resume_from checkpoints/$LATEST_CHECKPOINT
else
echo "Starting fresh training"
python train.py
fi
```
4. Use Spot Instance Managers
Tools that automatically re-provision spot instances when interrupted:
Real Savings Calculator
Example 1: Fine-Tuning a 7B Model (48 hours on A100 80GB)
Example 2: Monthly Research Budget (160 hours on RTX 4090)
Example 3: Startup Running Training 24/7 (H100, monthly)
When NOT to Use Spot Instances
Best Practices Summary
Always checkpoint: - Every 15-30 minutes minimum
Use persistent storage: - Never rely on instance disk alone
Start with off-peak hours: - Lower interruption rates at night/weekends
Mix spot and on-demand: - Use spot for training, on-demand for inference
Monitor prices: - Spot prices fluctuate; set alerts with BestGPUCloud
Have a fallback plan: - Know your on-demand cost if spot is unavailable
The Bottom Line
Spot instances are the single best way to reduce your GPU cloud costs. With proper checkpointing and fault tolerance, you can save 40-60% on every training job. Start by comparing spot prices across providers on BestGPUCloud.
Daniel Santos
Founder & ML Engineer
Building GPU price comparison tools since 2024. Previously trained LLMs at scale for fintech startups in São Paulo. Obsessed with finding the best $/TFLOP ratios across cloud providers.
Related Articles
Best GPU Cloud Providers in 2026: Complete Ranking
We ranked the top GPU cloud providers of 2026 on price, reliability, GPU selection, and developer experience. Here is who comes out on top — and who is best for your specific use case.
Best GPU for LLaMA 3 Fine-Tuning in 2026
Complete guide comparing H100 vs A100 for LLaMA 3 fine-tuning. Cost breakdowns, performance benchmarks, and provider recommendations.
Best GPU Cloud for Stable Diffusion in 2026
GPU requirements for SD 1.5, SDXL, and SD 3.0, best cloud providers with pricing, and how to set up ComfyUI on RunPod for maximum throughput per dollar.