Перейти к основному содержимому
Назад к блогу
Guia

Understanding GPU Memory: VRAM Guide for AI

08.03.2026
14 min чтения

Understanding GPU Memory: VRAM Guide for AI

GPU memory (VRAM) is the most common bottleneck in AI workloads. Understanding how VRAM works, how much you need, and how to optimize usage is essential for every ML practitioner. This guide covers everything from basics to advanced memory management techniques.

What is VRAM?

VRAM (Video Random Access Memory) is the dedicated memory on a GPU. Unlike system RAM, VRAM sits directly on the GPU die (or package) and provides the massive bandwidth needed for parallel computation.

VRAM Types in Modern GPUs

| Type | Bandwidth | GPUs |

|------|-----------|------|

| HBM3 | 3,350 GB/s | H100, H200 |

| HBM2e | 2,039 GB/s | A100 |

| GDDR6X | 1,008 GB/s | RTX 4090 |

| GDDR6 | 768 GB/s | RTX 4080, L4 |

**Key insight:** HBM (High Bandwidth Memory) is 2-3x faster than GDDR6X. This is why A100 and H100 excel at memory-bandwidth-bound workloads like LLM inference.

How Much VRAM Do You Need?

For Training (Full Fine-Tuning, FP16)

A model requires approximately **2 bytes per parameter** in FP16. But training needs additional memory for gradients, optimizer states, and activations:

```

Total VRAM = Model weights + Gradients + Optimizer states + Activations

Model weights (FP16): 2 bytes x parameters

Gradients (FP16): 2 bytes x parameters

Optimizer (Adam, FP32): 8 bytes x parameters

Activations: Varies (often 2-4x model size)

Rule of thumb: ~16-20 bytes per parameter for full training

```

| Model | Parameters | Training VRAM (FP16 + Adam) |

|-------|------------|---------------------------|

| GPT-2 | 1.5B | ~24 GB |

| LLaMA 3 8B | 8B | ~128 GB |

| LLaMA 3 13B | 13B | ~208 GB |

| LLaMA 3 70B | 70B | ~1,120 GB |

For QLoRA Fine-Tuning

QLoRA dramatically reduces memory by using 4-bit quantization plus low-rank adapters:

```

Total VRAM = Quantized model (4-bit) + LoRA adapters + Small optimizer

Quantized weights: ~0.5 bytes x parameters

LoRA adapters: Typically 0.1-1% of model parameters

```

| Model | Parameters | QLoRA VRAM |

|-------|------------|-----------|

| LLaMA 3 8B | 8B | ~6 GB |

| LLaMA 3 13B | 13B | ~10 GB |

| LLaMA 3 70B | 70B | ~40 GB |

For Inference

Inference only needs model weights plus KV-cache for context:

| Model | FP16 VRAM | INT8 VRAM | INT4 VRAM |

|-------|-----------|-----------|-----------|

| LLaMA 3 8B | 16 GB | 8 GB | 4 GB |

| LLaMA 3 13B | 26 GB | 13 GB | 7 GB |

| LLaMA 3 70B | 140 GB | 70 GB | 35 GB |

**Plus KV-cache:** Add ~2-8 GB depending on context length and batch size.

GPU VRAM Comparison (2026)

| GPU | VRAM | Best For |

|-----|------|----------|

| RTX 4080 | 16 GB | Small model inference, light training |

| RTX 4090 | 24 GB | QLoRA up to 13B, inference up to 30B (quantized) |

| L40S | 48 GB | QLoRA up to 70B, inference up to 70B |

| A100 40GB | 40 GB | Training up to 7B, inference up to 34B |

| A100 80GB | 80 GB | Training up to 13B, inference up to 70B |

| H100 80GB | 80 GB | Training up to 13B (fastest), inference at scale |

| H200 141GB | 141 GB | Training 34B+, largest models |

VRAM Optimization Techniques

1. Gradient Checkpointing

Trade compute for memory by recomputing activations during backward pass:

```python

model.gradient_checkpointing_enable()

Reduces activation memory by 60-80%

Increases training time by ~20%

```

2. Mixed Precision (FP16/BF16)

Cut memory in half by using 16-bit instead of 32-bit:

```python

from transformers import TrainingArguments

args = TrainingArguments(bf16=True)

```

3. Gradient Accumulation

Reduce effective batch size in VRAM while maintaining large logical batch:

```python

args = TrainingArguments(

per_device_train_batch_size=2, # Small VRAM footprint

gradient_accumulation_steps=16, # Effective batch = 32

)

```

4. Quantization (QLoRA)

Compress model weights to 4-bit:

```python

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(

load_in_4bit=True,

bnb_4bit_compute_dtype=torch.bfloat16,

bnb_4bit_quant_type="nf4",

)

```

5. CPU Offloading (DeepSpeed)

Move optimizer states to CPU RAM when GPU VRAM is full:

```json

{

"zero_optimization": {

"stage": 3,

"offload_optimizer": { "device": "cpu" },

"offload_param": { "device": "cpu" }

}

}

```

Common VRAM Errors and Fixes

"CUDA out of memory"

1. Reduce batch size

2. Enable gradient checkpointing

3. Use mixed precision (BF16)

4. Switch to QLoRA

5. Use a GPU with more VRAM

"RuntimeError: cuDNN error"

Often caused by VRAM fragmentation. Try:

1. Set `PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512`

2. Restart your training from checkpoint

The Bottom Line

VRAM is the key constraint for AI workloads. Understanding your VRAM requirements lets you choose the right GPU and avoid expensive mistakes. Use the tables in this guide to estimate your needs, and apply optimization techniques like QLoRA, gradient checkpointing, and mixed precision to stretch your VRAM further.

Find the right GPU for your VRAM needs -->

LF

Lucas Ferreira

Senior AI Engineer

Ex-NVIDIA, spent 3 years benchmarking data center GPUs. Now helps teams pick the right hardware for their ML workloads. Ran inference benchmarks on every GPU generation since Volta.

GPU BenchmarksInference OptimizationCUDAHardware

Готовы экономить?

Сравните цены на GPU облака и найдите лучшего провайдера для вашего случая.

Начать Сравнение

Похожие Статьи

Guia

Cheapest GPU Cloud Providers in 2026

A comprehensive ranking of the most affordable GPU cloud providers in 2026. Find the lowest prices for H100, A100, RTX 4090, and more.

16.03.202610 min
Read More
Guia

How to Choose the Right GPU for Machine Learning

A practical decision guide to selecting the perfect GPU for your ML workload. Covers VRAM requirements, performance benchmarks, and budget considerations.

15.03.202612 min
Read More
Guia

Best GPU for Inference: A Complete Guide

Find the optimal GPU for deploying AI models in production. Covers latency benchmarks, throughput tests, and cost-per-token analysis across all major GPUs.

14.03.202611 min
Read More