LLM Inference Optimization: Get More Tokens Per Dollar

Why Inference Optimization Matters

Training a model is a one-time cost. Inference is forever. A production LLM serving thousands of users daily can cost 10–100x more over its lifetime than the original training run. Optimising inference is one of the highest-leverage engineering investments you can make.

This guide covers the main levers: serving framework, quantisation, batching, and KV cache management.

Serving Framework Comparison

vLLM

vLLM (from UC Berkeley) is the current gold standard for high-throughput LLM serving. Its key innovation is **PagedAttention** — a KV cache management system inspired by virtual memory paging that dramatically reduces memory fragmentation and enables much larger batch sizes.

Best for:: High-throughput production inference, API servers

Throughput:: Typically 2–4x higher than naive HuggingFace serving

VRAM efficiency:: Excellent — near-zero KV cache waste

Supports:: GPTQ, AWQ, FP8 quantisation natively

TGI (Text Generation Inference)

HuggingFace's Text Generation Inference is a production-ready server with strong ecosystem integration. It supports continuous batching, Flash Attention, and tensor parallelism out of the box.

Best for:: Teams already in the HuggingFace ecosystem

Throughput:: Competitive with vLLM; sometimes better for streaming use cases

Docker-first:: Easy deployment, strong Kubernetes support

Ollama

Ollama prioritises ease of use over maximum throughput. It runs GGUF-quantised models efficiently on CPU+GPU hybrid setups.

Best for:: Local development, single-user inference, trying models quickly

Throughput:: Lower than vLLM/TGI at scale

CPU fallback:: Can run on CPU when GPU VRAM is insufficient

Quantisation Strategies

Quantisation reduces model precision to shrink VRAM usage and speed up matrix multiplications.

|--------|-------------|------------|--------------|

| FP16 (baseline) | — | — | None |

| GPTQ 4-bit | ~50% | +20–40% | Minimal |

| AWQ 4-bit | ~50% | +20–40% | Slightly less than GPTQ |

| FP8 (H100+) | ~50% | +30–50% | Near-zero |

**Recommendation:** Use AWQ or GPTQ for production; GGUF for CPU/hybrid deployments; FP8 on H100/H200 for maximum throughput.

Batching Strategies

**Static batching** (naive): process one request at a time. Catastrophic for throughput.

**Dynamic batching**: accumulate requests and process them together. Better, but requests of different lengths waste compute.

**Continuous batching** (vLLM/TGI): new requests are slotted into the batch as soon as a sequence finishes. Near-optimal GPU utilisation. This is what makes vLLM so efficient — never wait for the slowest sequence in a batch.

KV Cache Optimization

The KV cache stores key/value tensors for attention, growing linearly with sequence length and batch size. Poor KV cache management is the top cause of out-of-memory errors and throughput degradation.

Key settings in vLLM:

`--gpu-memory-utilization 0.90` — allocate 90% of VRAM to KV cache

`--max-model-len` — cap context length to free up cache space

Prefix caching: — cache KV tensors for shared system prompts (massive win for chatbots with long system prompts)

Cost per 1M Tokens: A Comparison

Running Llama 3 70B AWQ on various hardware:

| GPU | $/hr | Tokens/sec | $/1M tokens |

|-----|------|-----------|-------------|

| RTX 4090 (x1) | $0.50 | ~400 | ~$0.35 |

| RTX 5090 (x1) | $0.80 | ~750 | ~$0.30 |

| H100 80GB (x1) | $2.60 | ~2200 | ~$0.33 |

| H100 80GB (x2) | $5.20 | ~4000 | ~$0.36 |

At the 70B model size, a single RTX 5090 with AWQ quantisation rivals an H100 on cost-per-token — with much lower hourly spend.

Practical Quickstart with vLLM

```bash

pip install vllm

python -m vllm.entrypoints.openai.api_server \

--model meta-llama/Meta-Llama-3-70B-Instruct \

--quantization awq \

--gpu-memory-utilization 0.90 \

--max-model-len 8192 \

--enable-prefix-caching

```

This single command gets you an OpenAI-compatible API endpoint with continuous batching, AWQ quantisation, and prefix caching enabled.

Find the best GPU for your inference workload → →

LLM Inference Optimization: Get More Tokens Per Dollar

LLM Inference Optimization: Get More Tokens Per Dollar

Why Inference Optimization Matters

Serving Framework Comparison

vLLM

TGI (Text Generation Inference)

Ollama

Quantisation Strategies

Batching Strategies

KV Cache Optimization

Cost per 1M Tokens: A Comparison

Practical Quickstart with vLLM

准备好省钱了吗？

相关文章

Multi-GPU Training: Setup Guide for Beginners

PyTorch Distributed Training on Cloud GPUs: Complete Guide

Cheapest GPU Cloud Providers in 2026