Benchmarking LLM Serving Stacks: Realistic Loads and Production Patterns
- Mark Chomiczewski
- 27 April 2026
- 0 Comments
You've spent weeks fine-tuning a model, and it looks great in your notebook. But the moment you deploy it to a production environment with a hundred concurrent users, the system crawls to a halt. This gap between a successful demo and a stable production service is where most AI projects fail. To bridge it, you need to move beyond simple accuracy metrics and start LLM serving stacks benchmarking-essentially, stress-testing your infrastructure to see exactly where it breaks.
The Difference Between Load Testing and Performance Benchmarking
It's common to use these terms interchangeably, but in the world of inference, they solve two different problems. Think of performance benchmarking as checking how fast a car can go on a straight track, while load testing is checking if the car can handle a full load of passengers while driving up a mountain in a storm.
Performance Benchmarking is the process of measuring the raw efficiency of a model and its serving software. Here, you're looking for the maximum throughput and the lowest possible latency. You want to know: "If I have one request, how fast is the token generation?"
Load Testing is simulating high volumes of concurrent traffic to find the breaking point of the system. This identifies bottlenecks in autoscaling, network congestion, and memory overflows. It answers: "How many users can I support before the system crashes or the response time becomes unacceptable?"
If you only do performance benchmarking, you'll be blindsided by concurrency issues. If you only do load testing, you won't know if your model is running inefficiently. You need both to build a reliable production pattern.
Crucial Metrics That Actually Matter
Forget about "average latency." In LLM serving, averages lie. A few very long requests can skew your data, making the system look slower than it is for most users, or hiding the fact that 10% of your users are experiencing 30-second delays.
Instead, focus on these specific indicators:
- Time-to-First-Token (TTFT): This is the "perceived latency." It's the time from when a user hits enter to when the first character appears. In a chat interface, a high TTFT makes the app feel broken, even if the rest of the text streams in quickly.
- Tokens Per Second (TPS): The overall speed of generation. This tells you how much "work" the GPU is doing.
- Queries Per Second (QPS): The number of distinct requests the server handles. This is your primary scaling metric.
- P90 and P99 Latency: Instead of averages, look at the 90th and 99th percentiles. If your P99 TTFT is 5 seconds, it means 1% of your users are having a terrible experience. That's the number you need to optimize.
| Metric | What it Measures | Ideal Value | Impact on User |
|---|---|---|---|
| TTFT | Responsiveness | Low (e.g., < 0.8s) | Perceived speed of the app |
| TPS | Generation Speed | High | Reading comfort / UX |
| QPS | System Capacity | High | Ability to scale to more users |
| P99 Latency | Worst-case scenario | Low/Stable | Consistency of experience |
Client-Side vs. Server-Side Benchmarking
Where you run your benchmark scripts changes the results drastically. If you run your tests on the same machine as your model server (localhost), you are performing server-side benchmarking. This is great for isolating hardware performance. You eliminate the "noise" of the internet, meaning you can tell if a NVIDIA H100 is actually performing better than an L40S without worrying about a slow WiFi connection.
However, your users aren't running the model on their own machines. They are hitting an API over the web. Client-side benchmarking-running the test from a separate machine-introduces real-world variables like DNS resolution, TLS handshakes, and network jitter.
The rule of thumb is simple: use server-side tests to optimize your hardware and software stack, but use client-side tests to set your Service Level Objectives (SLOs) for your customers.
Implementing Realistic Production Patterns
Synthetic loads (sending the same "Hello, how are you?" request 1,000 times) are useless because they over-optimize the cache. Modern LLM servers are very good at caching identical prompts. To get a realistic picture, you need a "tiling pattern" or a diverse dataset.
A realistic production benchmark should include:
- Diverse Context Windows: Mix short prompts with massive 32k token documents. This tests how the KV Cache handles memory pressure and causes evictions.
- Randomized Input: Append random questions to contexts to force the model to actually compute rather than just retrieving from a cache.
- Continuous Batching: Use stacks that support continuous batching-like vLLM or SGL. This allows the server to insert new requests into the batch as soon as another request finishes, rather than waiting for the whole batch to complete.
- Warm-up Periods: Always ignore the first 30 to 60 seconds of your data. The "cold start" effect, where the GPU is initializing and the cache is empty, will spike your latency and ruin your averages.
For those deploying massive models, consider the trade-off between Tensor Parallelism (TP) and Pipeline Parallelism (PP). For instance, running a Qwen3-32B model across multiple GPUs with a TP2:PP1 configuration often yields a better balance of tokens-per-second-per-dollar than a single-GPU setup that's constantly swapping memory to disk.
The Iteration Loop: Speeding Up the Feedback Cycle
Benchmarking is not a one-and-done task; it's a loop. You change a parameter in your serving stack, you run the test, you analyze the P99s, and you tweak again. If this process takes an hour per iteration, you'll stop doing it. To keep the momentum, focus on Developer Experience (DevEx).
Automate the start-measure-analyze cycle. A simple shell script that launches the server, runs a Locust or Apache JMeter load test, and pipes the result into a CSV is worth more than the most expensive monitoring dashboard. Also, implement weight caching so you aren't downloading a 100GB model from a registry every time you restart your server.
Common Pitfalls to Avoid
Many teams fall into the trap of ignoring infrastructure overhead. NVIDIA's GenAI-Perf research shows that in low-concurrency scenarios, things like prompt generation and response storage can account for up to 33% of the total time. If you don't isolate the model's actual inference time from the API wrapper's overhead, you'll spend weeks optimizing the wrong thing.
Another mistake is relying on a single GPU. Production traffic is bursty. If your benchmark shows a steady 5 requests per second, but your real-world traffic spikes to 50 for ten seconds every hour, your queue will build up and your TTFT will skyrocket for everyone. You must test "burstiness" to see how your scheduler handles a sudden queue of pending requests.
Why is Time-to-First-Token (TTFT) more important than total latency?
TTFT determines the perceived responsiveness of an application. Because LLMs stream their output, the user only cares about how long they have to wait before the first word appears. Once the streaming starts, the human brain is generally okay with the generation speed as long as it's faster than the reading speed.
What is the best tool for simulating LLM load?
It depends on your needs. For high-level API stress testing, Locust and Gatling are excellent. For deep-dive hardware and model performance, NVIDIA's GenAI-Perf or specialized tools like TensorMesh provide more granular insights into token-level metrics and GPU utilization.
How do I handle "cold start" spikes in my benchmarks?
The industry standard is to implement a "warm-up" phase. Send a series of representative requests to the server for 30-60 seconds to initialize the GPU kernels and populate the cache before you start recording any metrics for your final report.
What is the impact of continuous batching on throughput?
Continuous batching significantly increases throughput by eliminating the idle time spent waiting for the longest request in a batch to finish. Instead of processing in fixed blocks, the server can eject finished requests and insert new ones immediately, maximizing GPU utilization.
Should I prioritize TP or PP for a 30B+ parameter model?
Tensor Parallelism (TP) is generally better for reducing latency as it splits the workload across GPUs for a single request. Pipeline Parallelism (PP) is better for fitting massive models that exceed the memory of a few GPUs. For a 32B model on L40S GPUs, a TP2:PP1 split often provides the best balance of cost and speed.