LLM Inference Optimization: Get More Tokens Per Dollar
LLM Inference Optimization: Get More Tokens Per Dollar
Why Inference Optimization Matters
Training a model is a one-time cost. Inference is forever. A production LLM serving thousands of users daily can cost 10–100x more over its lifetime than the original training run. Optimising inference is one of the highest-leverage engineering investments you can make.
This guide covers the main levers: serving framework, quantisation, batching, and KV cache management.
Serving Framework Comparison
vLLM
vLLM (from UC Berkeley) is the current gold standard for high-throughput LLM serving. Its key innovation is **PagedAttention** — a KV cache management system inspired by virtual memory paging that dramatically reduces memory fragmentation and enables much larger batch sizes.
TGI (Text Generation Inference)
HuggingFace's Text Generation Inference is a production-ready server with strong ecosystem integration. It supports continuous batching, Flash Attention, and tensor parallelism out of the box.
Ollama
Ollama prioritises ease of use over maximum throughput. It runs GGUF-quantised models efficiently on CPU+GPU hybrid setups.
Quantisation Strategies
Quantisation reduces model precision to shrink VRAM usage and speed up matrix multiplications.
| Method | VRAM Saving | Speed Gain | Quality Loss |
|--------|-------------|------------|--------------|
| FP16 (baseline) | — | — | None |
| GPTQ 4-bit | ~50% | +20–40% | Minimal |
| AWQ 4-bit | ~50% | +20–40% | Slightly less than GPTQ |
| GGUF Q4_K_M | ~55% | CPU-friendly | Minimal |
| FP8 (H100+) | ~50% | +30–50% | Near-zero |
**Recommendation:** Use AWQ or GPTQ for production; GGUF for CPU/hybrid deployments; FP8 on H100/H200 for maximum throughput.
Batching Strategies
**Static batching** (naive): process one request at a time. Catastrophic for throughput.
**Dynamic batching**: accumulate requests and process them together. Better, but requests of different lengths waste compute.
**Continuous batching** (vLLM/TGI): new requests are slotted into the batch as soon as a sequence finishes. Near-optimal GPU utilisation. This is what makes vLLM so efficient — never wait for the slowest sequence in a batch.
KV Cache Optimization
The KV cache stores key/value tensors for attention, growing linearly with sequence length and batch size. Poor KV cache management is the top cause of out-of-memory errors and throughput degradation.
Key settings in vLLM:
Cost per 1M Tokens: A Comparison
Running Llama 3 70B AWQ on various hardware:
| GPU | $/hr | Tokens/sec | $/1M tokens |
|-----|------|-----------|-------------|
| RTX 4090 (x1) | $0.50 | ~400 | ~$0.35 |
| RTX 5090 (x1) | $0.80 | ~750 | ~$0.30 |
| H100 80GB (x1) | $2.60 | ~2200 | ~$0.33 |
| H100 80GB (x2) | $5.20 | ~4000 | ~$0.36 |
At the 70B model size, a single RTX 5090 with AWQ quantisation rivals an H100 on cost-per-token — with much lower hourly spend.
Practical Quickstart with vLLM
```bash
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-70B-Instruct \
--quantization awq \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--enable-prefix-caching
```
This single command gets you an OpenAI-compatible API endpoint with continuous batching, AWQ quantisation, and prefix caching enabled.
Related Articles
Multi-GPU Training: Setup Guide for Beginners
Learn how to distribute your training across multiple GPUs. Step-by-step tutorial covering PyTorch DDP, DeepSpeed, and cloud multi-GPU setups.
PyTorch Distributed Training on Cloud GPUs: Complete Guide
Complete guide to DDP setup, torchrun commands, multi-node on RunPod, gradient checkpointing, mixed precision, and debugging distributed training jobs.
Cheapest GPU Cloud Providers in 2026
A comprehensive ranking of the most affordable GPU cloud providers in 2026. Find the lowest prices for H100, A100, RTX 4090, and more.