I recently spent time understanding the inference engine landscape because I needed to pick one for a model-serving workload and realized I had no mental model for the tradeoffs. Everything I found was either a benchmark post comparing numbers without context, or a getting-started tutorial that never explained why that particular engine was the right fit. So I built my own reference. Here is what I learned about the main options and when each one earns its place.

The problem all of them solve

Running a model naively (load weights, call forward pass) wastes GPU memory and gives poor throughput under concurrency. The KV cache, which stores intermediate attention state for every active request, grows with sequence length and fragments VRAM. Inference engines optimize memory management, batching, and compute scheduling so you get more out of the same hardware.

vLLM

The most popular open-source option for serving transformers right now. Loads HuggingFace model weights as-is and applies optimizations at runtime, no pre-compilation step required.

TGI (Text Generation Inference)

HuggingFace's own serving engine. Same category as vLLM, offering continuous batching, tensor parallelism, and quantization. Tighter integration with the HF ecosystem.

TensorRT-LLM

NVIDIA's ahead-of-time compiler. You feed it your model, it produces an optimized engine binary with fused CUDA kernels tuned for your specific GPU architecture. Think GCC versus the Python interpreter.

llama.cpp / GGML

Runs models on CPUs and consumer hardware. Heavy quantization (4-bit, 5-bit, 8-bit), minimal dependencies, pure C/C++. Can offload layers to a GPU for hybrid execution.

ONNX Runtime

Microsoft's hardware-agnostic engine. Converts models to ONNX format and runs them on NVIDIA GPUs, AMD GPUs, Intel accelerators, CPUs, mobile, and edge devices.

PyTorch native

No engine, no server, no optimization layer. Load the weights, call model.generate(), and let PyTorch handle it. PyTorch 2.x's torch.compile can JIT-optimize some operations, but there is no memory management or batching infrastructure.

How I decide

After going through all of this, my mental model collapsed into a few questions I ask in order:

This decision is not permanent. Most workloads start with PyTorch native or vLLM because the priority is validating the model works at all. The engine swap happens later, when the workload is stable and performance or cost becomes the constraint. Picking the "best" engine before knowing the workload shape is premature optimization in the most literal sense.