How to Save 60% on GPU Cloud with Spot Instances

If you are spending more than $100/month on GPU cloud, you are likely overpaying. Spot instances (also called interruptible or preemptible instances) offer the exact same GPU hardware at 40-60% less than on-demand prices. Here is everything you need to know to start saving immediately.

What Are Spot Instances?

Spot instances are unused GPU capacity that cloud providers sell at a discount. The trade-off is that your instance can be interrupted (taken away) with short notice when demand increases. In practice, interruption rates are much lower than you might expect.

Spot vs On-Demand Pricing (March 2026)

|-----|----------|-----------|---------|

| H100 80GB | $2.49/hr | $1.49/hr | **40%** |

| A100 80GB | $1.89/hr | $0.89/hr | **53%** |

| A100 40GB | $1.29/hr | $0.59/hr | **54%** |

| RTX 4090 | $0.44/hr | $0.19/hr | **57%** |

| RTX 4080 | $0.34/hr | $0.14/hr | **59%** |

| RTX 3090 | $0.29/hr | $0.12/hr | **59%** |

Where to Find Spot Instances

1. RunPod Spot (Community Cloud)

Savings:: 30-50% off on-demand

Interruption Notice:: 5 seconds

Best For:: Training jobs with checkpointing

How:: Select "Spot" when creating a pod

2. Vast.ai Interruptible

Savings:: 40-60% off on-demand

Interruption Notice:: Varies by host

Best For:: Batch processing, experiments

How:: Filter by "interruptible" in search

3. Lambda Labs Spot

Savings:: 25-40% off on-demand

Interruption Notice:: 2 minutes

Best For:: Enterprise workloads needing more notice

How:: Available via API

4. AWS Spot Instances (p4d, p5)

Savings:: 60-70% off on-demand

Interruption Notice:: 2 minutes

Best For:: Large-scale distributed training

How:: Use Spot Instance requests in EC2

How to Make Spot Instances Reliable

The key to using spot instances successfully is building fault tolerance into your workflow.

1. Checkpoint Your Training

Save model checkpoints every 15-30 minutes:

```python

from transformers import TrainingArguments

training_args = TrainingArguments(

output_dir="./checkpoints",

save_strategy="steps",

save_steps=500, # Save every 500 steps

save_total_limit=3, # Keep last 3 checkpoints

resume_from_checkpoint=True,

)

```

2. Use Persistent Storage

Always save checkpoints to persistent storage, not the instance disk:

RunPod:: Use Network Volumes ($0.10/GB/month)

Vast.ai:: Use external storage (S3, GCS)

AWS:: Use EBS or S3

3. Auto-Resume Scripts

Create a script that automatically resumes from the latest checkpoint:

```bash

#!/bin/bash

LATEST_CHECKPOINT=$(ls -t checkpoints/ | head -1)

if [ -n "$LATEST_CHECKPOINT" ]; then

echo "Resuming from $LATEST_CHECKPOINT"

python train.py --resume_from checkpoints/$LATEST_CHECKPOINT

else

echo "Starting fresh training"

python train.py

```

4. Use Spot Instance Managers

Tools that automatically re-provision spot instances when interrupted:

SkyPilot: - Open-source spot orchestrator

RunPod Serverless: - Auto-scales across spot capacity

AWS Spot Fleet: - Automatically replaces interrupted instances

Real Savings Calculator

Example 1: Fine-Tuning a 7B Model (48 hours on A100 80GB)

On-Demand:: $1.89 x 48 = $90.72

Spot:: $0.89 x 48 = $42.72

You Save:: $48.00 (53%)

Example 2: Monthly Research Budget (160 hours on RTX 4090)

On-Demand:: $0.44 x 160 = $70.40/mo

Spot:: $0.19 x 160 = $30.40/mo

You Save:: $40.00/mo (57%)

Example 3: Startup Running Training 24/7 (H100, monthly)

On-Demand:: $2.49 x 720 = $1,792/mo

Spot:: $1.49 x 720 = $1,073/mo

You Save:: $719/mo (40%)

When NOT to Use Spot Instances

Production inference: - Downtime means lost revenue

Real-time services: - Cannot tolerate interruptions

Tight deadlines: - Risk of losing progress near deadline

Stateful workloads: - Databases, long-running servers

Best Practices Summary

Always checkpoint: - Every 15-30 minutes minimum

Use persistent storage: - Never rely on instance disk alone

Start with off-peak hours: - Lower interruption rates at night/weekends

Mix spot and on-demand: - Use spot for training, on-demand for inference

Monitor prices: - Spot prices fluctuate; set alerts with BestGPUCloud

Have a fallback plan: - Know your on-demand cost if spot is unavailable

The Bottom Line

Spot instances are the single best way to reduce your GPU cloud costs. With proper checkpointing and fault tolerance, you can save 40-60% on every training job. Start by comparing spot prices across providers on BestGPUCloud.

Compare spot prices now --> →

How to Save 60% on GPU Cloud with Spot Instances

How to Save 60% on GPU Cloud with Spot Instances

What Are Spot Instances?

Spot vs On-Demand Pricing (March 2026)

Where to Find Spot Instances

1. RunPod Spot (Community Cloud)

2. Vast.ai Interruptible

3. Lambda Labs Spot

4. AWS Spot Instances (p4d, p5)

How to Make Spot Instances Reliable

1. Checkpoint Your Training

2. Use Persistent Storage

3. Auto-Resume Scripts

4. Use Spot Instance Managers

Real Savings Calculator

Example 1: Fine-Tuning a 7B Model (48 hours on A100 80GB)

Example 2: Monthly Research Budget (160 hours on RTX 4090)

Example 3: Startup Running Training 24/7 (H100, monthly)

When NOT to Use Spot Instances

Best Practices Summary

The Bottom Line

Ready to save?

Related Articles

Best GPU Cloud Providers in 2026: Complete Ranking

Best GPU for LLaMA 3 Fine-Tuning in 2026

Best GPU Cloud for Stable Diffusion in 2026