Understanding GPU Memory: VRAM Guide for AI

GPU memory (VRAM) is the most common bottleneck in AI workloads. Understanding how VRAM works, how much you need, and how to optimize usage is essential for every ML practitioner. This guide covers everything from basics to advanced memory management techniques.

What is VRAM?

VRAM (Video Random Access Memory) is the dedicated memory on a GPU. Unlike system RAM, VRAM sits directly on the GPU die (or package) and provides the massive bandwidth needed for parallel computation.

VRAM Types in Modern GPUs

| Type | Bandwidth | GPUs |

|------|-----------|------|

| HBM3 | 3,350 GB/s | H100, H200 |

| HBM2e | 2,039 GB/s | A100 |

| GDDR6X | 1,008 GB/s | RTX 4090 |

| GDDR6 | 768 GB/s | RTX 4080, L4 |

**Key insight:** HBM (High Bandwidth Memory) is 2-3x faster than GDDR6X. This is why A100 and H100 excel at memory-bandwidth-bound workloads like LLM inference.

How Much VRAM Do You Need?

For Training (Full Fine-Tuning, FP16)

A model requires approximately **2 bytes per parameter** in FP16. But training needs additional memory for gradients, optimizer states, and activations:

```

Total VRAM = Model weights + Gradients + Optimizer states + Activations

Model weights (FP16): 2 bytes x parameters

Gradients (FP16): 2 bytes x parameters

Optimizer (Adam, FP32): 8 bytes x parameters

Activations: Varies (often 2-4x model size)

Rule of thumb: ~16-20 bytes per parameter for full training

```

| Model | Parameters | Training VRAM (FP16 + Adam) |

|-------|------------|---------------------------|

| GPT-2 | 1.5B | ~24 GB |

| LLaMA 3 8B | 8B | ~128 GB |

| LLaMA 3 13B | 13B | ~208 GB |

| LLaMA 3 70B | 70B | ~1,120 GB |

For QLoRA Fine-Tuning

QLoRA dramatically reduces memory by using 4-bit quantization plus low-rank adapters:

```

Total VRAM = Quantized model (4-bit) + LoRA adapters + Small optimizer

Quantized weights: ~0.5 bytes x parameters

LoRA adapters: Typically 0.1-1% of model parameters

```

| Model | Parameters | QLoRA VRAM |

|-------|------------|-----------|

| LLaMA 3 8B | 8B | ~6 GB |

| LLaMA 3 13B | 13B | ~10 GB |

| LLaMA 3 70B | 70B | ~40 GB |

For Inference

Inference only needs model weights plus KV-cache for context:

|-------|-----------|-----------|-----------|

| LLaMA 3 8B | 16 GB | 8 GB | 4 GB |

| LLaMA 3 13B | 26 GB | 13 GB | 7 GB |

| LLaMA 3 70B | 140 GB | 70 GB | 35 GB |

**Plus KV-cache:** Add ~2-8 GB depending on context length and batch size.

GPU VRAM Comparison (2026)

| GPU | VRAM | Best For |

|-----|------|----------|

| RTX 4080 | 16 GB | Small model inference, light training |

| RTX 4090 | 24 GB | QLoRA up to 13B, inference up to 30B (quantized) |

| L40S | 48 GB | QLoRA up to 70B, inference up to 70B |

| A100 40GB | 40 GB | Training up to 7B, inference up to 34B |

| A100 80GB | 80 GB | Training up to 13B, inference up to 70B |

| H100 80GB | 80 GB | Training up to 13B (fastest), inference at scale |

| H200 141GB | 141 GB | Training 34B+, largest models |

VRAM Optimization Techniques

1. Gradient Checkpointing

Trade compute for memory by recomputing activations during backward pass:

```python

model.gradient_checkpointing_enable()

Reduces activation memory by 60-80%

Increases training time by ~20%

```

2. Mixed Precision (FP16/BF16)

Cut memory in half by using 16-bit instead of 32-bit:

```python

from transformers import TrainingArguments

args = TrainingArguments(bf16=True)

```

3. Gradient Accumulation

Reduce effective batch size in VRAM while maintaining large logical batch:

```python

args = TrainingArguments(

per_device_train_batch_size=2, # Small VRAM footprint

gradient_accumulation_steps=16, # Effective batch = 32

)

```

4. Quantization (QLoRA)

Compress model weights to 4-bit:

```python

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(

load_in_4bit=True,

bnb_4bit_compute_dtype=torch.bfloat16,

bnb_4bit_quant_type="nf4",

)

```

5. CPU Offloading (DeepSpeed)

Move optimizer states to CPU RAM when GPU VRAM is full:

```json

{

"zero_optimization": {

"stage": 3,

"offload_optimizer": { "device": "cpu" },

"offload_param": { "device": "cpu" }

}

```

Common VRAM Errors and Fixes

"CUDA out of memory"

1. Reduce batch size

2. Enable gradient checkpointing

3. Use mixed precision (BF16)

4. Switch to QLoRA

5. Use a GPU with more VRAM

"RuntimeError: cuDNN error"

Often caused by VRAM fragmentation. Try:

1. Set `PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512`

2. Restart your training from checkpoint

The Bottom Line

VRAM is the key constraint for AI workloads. Understanding your VRAM requirements lets you choose the right GPU and avoid expensive mistakes. Use the tables in this guide to estimate your needs, and apply optimization techniques like QLoRA, gradient checkpointing, and mixed precision to stretch your VRAM further.

Find the right GPU for your VRAM needs --> →

Understanding GPU Memory: VRAM Guide for AI

Understanding GPU Memory: VRAM Guide for AI

What is VRAM?

VRAM Types in Modern GPUs

How Much VRAM Do You Need?

For Training (Full Fine-Tuning, FP16)

For QLoRA Fine-Tuning

For Inference

GPU VRAM Comparison (2026)

VRAM Optimization Techniques

1. Gradient Checkpointing

Reduces activation memory by 60-80%

Increases training time by ~20%

2. Mixed Precision (FP16/BF16)

3. Gradient Accumulation

4. Quantization (QLoRA)

5. CPU Offloading (DeepSpeed)

Common VRAM Errors and Fixes

"CUDA out of memory"

"RuntimeError: cuDNN error"

The Bottom Line

Готовы экономить?

Похожие Статьи

Cheapest GPU Cloud Providers in 2026

How to Choose the Right GPU for Machine Learning

Best GPU for Inference: A Complete Guide