Back to blog
Tutorial

LLM Inference Optimization: Get More Tokens Per Dollar

12.3.2026
9 min read

LLM Inference Optimization: Get More Tokens Per Dollar

Why Inference Optimization Matters

Training a model is a one-time cost. Inference is forever. A production LLM serving thousands of users daily can cost 10–100x more over its lifetime than the original training run. Optimising inference is one of the highest-leverage engineering investments you can make.

This guide covers the main levers: serving framework, quantisation, batching, and KV cache management.

Serving Framework Comparison

vLLM

vLLM (from UC Berkeley) is the current gold standard for high-throughput LLM serving. Its key innovation is **PagedAttention** — a KV cache management system inspired by virtual memory paging that dramatically reduces memory fragmentation and enables much larger batch sizes.

  • Best for:: High-throughput production inference, API servers
  • Throughput:: Typically 2–4x higher than naive HuggingFace serving
  • VRAM efficiency:: Excellent — near-zero KV cache waste
  • Supports:: GPTQ, AWQ, FP8 quantisation natively
  • TGI (Text Generation Inference)

    HuggingFace's Text Generation Inference is a production-ready server with strong ecosystem integration. It supports continuous batching, Flash Attention, and tensor parallelism out of the box.

  • Best for:: Teams already in the HuggingFace ecosystem
  • Throughput:: Competitive with vLLM; sometimes better for streaming use cases
  • Docker-first:: Easy deployment, strong Kubernetes support
  • Ollama

    Ollama prioritises ease of use over maximum throughput. It runs GGUF-quantised models efficiently on CPU+GPU hybrid setups.

  • Best for:: Local development, single-user inference, trying models quickly
  • Throughput:: Lower than vLLM/TGI at scale
  • CPU fallback:: Can run on CPU when GPU VRAM is insufficient
  • Quantisation Strategies

    Quantisation reduces model precision to shrink VRAM usage and speed up matrix multiplications.

    | Method | VRAM Saving | Speed Gain | Quality Loss |

    |--------|-------------|------------|--------------|

    | FP16 (baseline) | — | — | None |

    | GPTQ 4-bit | ~50% | +20–40% | Minimal |

    | AWQ 4-bit | ~50% | +20–40% | Slightly less than GPTQ |

    | GGUF Q4_K_M | ~55% | CPU-friendly | Minimal |

    | FP8 (H100+) | ~50% | +30–50% | Near-zero |

    **Recommendation:** Use AWQ or GPTQ for production; GGUF for CPU/hybrid deployments; FP8 on H100/H200 for maximum throughput.

    Batching Strategies

    **Static batching** (naive): process one request at a time. Catastrophic for throughput.

    **Dynamic batching**: accumulate requests and process them together. Better, but requests of different lengths waste compute.

    **Continuous batching** (vLLM/TGI): new requests are slotted into the batch as soon as a sequence finishes. Near-optimal GPU utilisation. This is what makes vLLM so efficient — never wait for the slowest sequence in a batch.

    KV Cache Optimization

    The KV cache stores key/value tensors for attention, growing linearly with sequence length and batch size. Poor KV cache management is the top cause of out-of-memory errors and throughput degradation.

    Key settings in vLLM:

  • `--gpu-memory-utilization 0.90` — allocate 90% of VRAM to KV cache
  • `--max-model-len` — cap context length to free up cache space
  • Prefix caching: — cache KV tensors for shared system prompts (massive win for chatbots with long system prompts)
  • Cost per 1M Tokens: A Comparison

    Running Llama 3 70B AWQ on various hardware:

    | GPU | $/hr | Tokens/sec | $/1M tokens |

    |-----|------|-----------|-------------|

    | RTX 4090 (x1) | $0.50 | ~400 | ~$0.35 |

    | RTX 5090 (x1) | $0.80 | ~750 | ~$0.30 |

    | H100 80GB (x1) | $2.60 | ~2200 | ~$0.33 |

    | H100 80GB (x2) | $5.20 | ~4000 | ~$0.36 |

    At the 70B model size, a single RTX 5090 with AWQ quantisation rivals an H100 on cost-per-token — with much lower hourly spend.

    Practical Quickstart with vLLM

    ```bash

    pip install vllm

    python -m vllm.entrypoints.openai.api_server \

    --model meta-llama/Meta-Llama-3-70B-Instruct \

    --quantization awq \

    --gpu-memory-utilization 0.90 \

    --max-model-len 8192 \

    --enable-prefix-caching

    ```

    This single command gets you an OpenAI-compatible API endpoint with continuous batching, AWQ quantisation, and prefix caching enabled.

    Find the best GPU for your inference workload →

    Ready to save?

    Compare GPU cloud prices and find the best provider for your use case.

    Start Comparing

    Related Articles

    Tutorial

    Multi-GPU Training: Setup Guide for Beginners

    Learn how to distribute your training across multiple GPUs. Step-by-step tutorial covering PyTorch DDP, DeepSpeed, and cloud multi-GPU setups.

    13.3.202614 min
    Read More
    Tutorial

    PyTorch Distributed Training on Cloud GPUs: Complete Guide

    Complete guide to DDP setup, torchrun commands, multi-node on RunPod, gradient checkpointing, mixed precision, and debugging distributed training jobs.

    10.3.202611 min
    Read More
    Guia

    Cheapest GPU Cloud Providers in 2026

    A comprehensive ranking of the most affordable GPU cloud providers in 2026. Find the lowest prices for H100, A100, RTX 4090, and more.

    16.3.202610 min
    Read More