I recently spent time understanding the inference engine landscape because I needed to pick one for a model-serving workload and realized I had no mental model for the tradeoffs. Everything I found was either a benchmark post comparing numbers without context, or a getting-started tutorial that never explained why that particular engine was the right fit. So I built my own reference. Here is what I learned about the main options and when each one earns its place.
The problem all of them solve
Running a model naively (load weights, call forward pass) wastes GPU memory and gives poor throughput under concurrency. The KV cache, which stores intermediate attention state for every active request, grows with sequence length and fragments VRAM. Inference engines optimize memory management, batching, and compute scheduling so you get more out of the same hardware.
vLLM
The most popular open-source option for serving transformers right now. Loads HuggingFace model weights as-is and applies optimizations at runtime, no pre-compilation step required.
- Key innovation: PagedAttention. Manages the KV cache like OS virtual memory, with fixed-size blocks allocated on demand. Eliminates fragmentation and achieves near-100% VRAM utilization versus 20-40% with naive serving.
- Also gives you: continuous batching (new requests join mid-generation), an OpenAI-compatible API server, tensor parallelism across GPUs, and quantization support.
- Pick when: you need to serve a HuggingFace model to multiple concurrent users with good throughput, or you want a turnkey HTTP server around your model without writing serving code.
TGI (Text Generation Inference)
HuggingFace's own serving engine. Same category as vLLM, offering continuous batching, tensor parallelism, and quantization. Tighter integration with the HF ecosystem.
- Difference from vLLM: simpler deployment if you are already deep in HuggingFace tooling. Slightly less throughput optimization at extreme concurrency. Better integration with model cards, tokenizers, and the Hub.
- Pick when: your workflow already lives in HuggingFace and you value simplicity over squeezing the last tokens-per-second at 100+ concurrent requests.
TensorRT-LLM
NVIDIA's ahead-of-time compiler. You feed it your model, it produces an optimized engine binary with fused CUDA kernels tuned for your specific GPU architecture. Think GCC versus the Python interpreter.
- Key tradeoff: a compilation step before serving (30+ minutes). The output binary only runs on the target GPU family. Change GPU type or model version and you recompile. Fastest raw inference on NVIDIA hardware because it exploits hardware-specific features like FP8 on Hopper.
- Pick when: you have a stable model that rarely changes, you are committed to NVIDIA GPUs, and you need absolute maximum tokens-per-second. Common in high-scale production where the compilation cost is amortized across millions of requests.
llama.cpp / GGML
Runs models on CPUs and consumer hardware. Heavy quantization (4-bit, 5-bit, 8-bit), minimal dependencies, pure C/C++. Can offload layers to a GPU for hybrid execution.
- Difference from everything above: not designed for high-throughput serving. Designed for running models locally with minimal resources. Aggressive quantization trades some precision for accessibility.
- Quality note: for text generation, 4-bit quantization is often acceptable. For audio waveform generation or anything where precision matters at the decimal level, quality degradation may be noticeable.
- Pick when: you are running on a MacBook, a gaming PC, or any environment without datacenter GPUs. Prototyping, local development, or workloads where you do not need a server.
ONNX Runtime
Microsoft's hardware-agnostic engine. Converts models to ONNX format and runs them on NVIDIA GPUs, AMD GPUs, Intel accelerators, CPUs, mobile, and edge devices.
- Difference from others: not specialized for LLMs or autoregressive generation. More general-purpose, covering classical ML, vision models, and smaller networks. Jack of all trades.
- Pick when: hardware portability matters (multi-cloud, edge, AMD GPUs), or you are serving non-LLM models where the autoregressive optimizations of vLLM/TGI are irrelevant.
PyTorch native
No engine, no server, no optimization layer. Load the weights, call model.generate(), and let PyTorch handle it. PyTorch 2.x's torch.compile can JIT-optimize some operations, but there is no memory management or batching infrastructure.
- Pick when: simplicity matters more than performance. Prototyping, batch jobs that process one item at a time, or low-traffic internal tools where GPU utilization efficiency is not the concern.
How I decide
After going through all of this, my mental model collapsed into a few questions I ask in order:
- What hardware am I targeting? CPU-only or consumer hardware points to llama.cpp. Must be hardware-agnostic points to ONNX Runtime. NVIDIA GPU means continue below.
- How many concurrent users? Single-request or low concurrency: PyTorch native or vLLM (for serving convenience). High concurrency (10+ simultaneous): vLLM or TGI.
- Latency or flexibility? If the model is stable and I need maximum speed, TensorRT-LLM. If I need to iterate quickly and swap model versions, vLLM or TGI.
This decision is not permanent. Most workloads start with PyTorch native or vLLM because the priority is validating the model works at all. The engine swap happens later, when the workload is stable and performance or cost becomes the constraint. Picking the "best" engine before knowing the workload shape is premature optimization in the most literal sense.