मुख्य सामग्री पर जाएं
ब्लॉग पर वापस
Tutorial

LLM Inference Optimization: Get More Tokens Per Dollar

12/3/2026
9 min पढ़ने का समय

LLM Inference Optimization: Get More Tokens Per Dollar

Why Inference Optimization Matters

Training a model is a one-time cost. Inference is forever. A production LLM serving thousands of users daily can cost 10–100x more over its lifetime than the original training run. Optimising inference is one of the highest-leverage engineering investments you can make.

This guide covers the main levers: serving framework, quantisation, batching, and KV cache management.

Serving Framework Comparison

vLLM

vLLM (from UC Berkeley) is the current gold standard for high-throughput LLM serving. Its key innovation is **PagedAttention** — a KV cache management system inspired by virtual memory paging that dramatically reduces memory fragmentation and enables much larger batch sizes.

  • Best for:: High-throughput production inference, API servers
  • Throughput:: Typically 2–4x higher than naive HuggingFace serving
  • VRAM efficiency:: Excellent — near-zero KV cache waste
  • Supports:: GPTQ, AWQ, FP8 quantisation natively
  • TGI (Text Generation Inference)

    HuggingFace's Text Generation Inference is a production-ready server with strong ecosystem integration. It supports continuous batching, Flash Attention, and tensor parallelism out of the box.

  • Best for:: Teams already in the HuggingFace ecosystem
  • Throughput:: Competitive with vLLM; sometimes better for streaming use cases
  • Docker-first:: Easy deployment, strong Kubernetes support
  • Ollama

    Ollama prioritises ease of use over maximum throughput. It runs GGUF-quantised models efficiently on CPU+GPU hybrid setups.

  • Best for:: Local development, single-user inference, trying models quickly
  • Throughput:: Lower than vLLM/TGI at scale
  • CPU fallback:: Can run on CPU when GPU VRAM is insufficient
  • Quantisation Strategies

    Quantisation reduces model precision to shrink VRAM usage and speed up matrix multiplications.

    | Method | VRAM Saving | Speed Gain | Quality Loss |

    |--------|-------------|------------|--------------|

    | FP16 (baseline) | — | — | None |

    | GPTQ 4-bit | ~50% | +20–40% | Minimal |

    | AWQ 4-bit | ~50% | +20–40% | Slightly less than GPTQ |

    | GGUF Q4_K_M | ~55% | CPU-friendly | Minimal |

    | FP8 (H100+) | ~50% | +30–50% | Near-zero |

    **Recommendation:** Use AWQ or GPTQ for production; GGUF for CPU/hybrid deployments; FP8 on H100/H200 for maximum throughput.

    Batching Strategies

    **Static batching** (naive): process one request at a time. Catastrophic for throughput.

    **Dynamic batching**: accumulate requests and process them together. Better, but requests of different lengths waste compute.

    **Continuous batching** (vLLM/TGI): new requests are slotted into the batch as soon as a sequence finishes. Near-optimal GPU utilisation. This is what makes vLLM so efficient — never wait for the slowest sequence in a batch.

    KV Cache Optimization

    The KV cache stores key/value tensors for attention, growing linearly with sequence length and batch size. Poor KV cache management is the top cause of out-of-memory errors and throughput degradation.

    Key settings in vLLM:

  • `--gpu-memory-utilization 0.90` — allocate 90% of VRAM to KV cache
  • `--max-model-len` — cap context length to free up cache space
  • Prefix caching: — cache KV tensors for shared system prompts (massive win for chatbots with long system prompts)
  • Cost per 1M Tokens: A Comparison

    Running Llama 3 70B AWQ on various hardware:

    | GPU | $/hr | Tokens/sec | $/1M tokens |

    |-----|------|-----------|-------------|

    | RTX 4090 (x1) | $0.50 | ~400 | ~$0.35 |

    | RTX 5090 (x1) | $0.80 | ~750 | ~$0.30 |

    | H100 80GB (x1) | $2.60 | ~2200 | ~$0.33 |

    | H100 80GB (x2) | $5.20 | ~4000 | ~$0.36 |

    At the 70B model size, a single RTX 5090 with AWQ quantisation rivals an H100 on cost-per-token — with much lower hourly spend.

    Practical Quickstart with vLLM

    ```bash

    pip install vllm

    python -m vllm.entrypoints.openai.api_server \

    --model meta-llama/Meta-Llama-3-70B-Instruct \

    --quantization awq \

    --gpu-memory-utilization 0.90 \

    --max-model-len 8192 \

    --enable-prefix-caching

    ```

    This single command gets you an OpenAI-compatible API endpoint with continuous batching, AWQ quantisation, and prefix caching enabled.

    Find the best GPU for your inference workload →

    DS

    Daniel Santos

    Founder & ML Engineer

    Building GPU price comparison tools since 2024. Previously trained LLMs at scale for fintech startups in São Paulo. Obsessed with finding the best $/TFLOP ratios across cloud providers.

    GPU CloudLLM TrainingCost OptimizationMLOps

    बचत के लिए तैयार?

    GPU क्लाउड कीमतों की तुलना करें और अपने उपयोग के लिए सबसे अच्छा प्रदाता खोजें।

    तुलना शुरू करें

    संबंधित लेख

    Tutorial

    Multi-GPU Training: Setup Guide for Beginners

    Learn how to distribute your training across multiple GPUs. Step-by-step tutorial covering PyTorch DDP, DeepSpeed, and cloud multi-GPU setups.

    13/3/202614 min
    Read More
    Tutorial

    PyTorch Distributed Training on Cloud GPUs: Complete Guide

    Complete guide to DDP setup, torchrun commands, multi-node on RunPod, gradient checkpointing, mixed precision, and debugging distributed training jobs.

    10/3/202611 min
    Read More
    Guia

    Cheapest GPU Cloud Providers in 2026

    A comprehensive ranking of the most affordable GPU cloud providers in 2026. Find the lowest prices for H100, A100, RTX 4090, and more.

    16/3/202610 min
    Read More