How to Choose Batch Sizes to Minimize Cost per Token in LLM Serving

alt

You are paying too much for your Large Language Model (LLM) inference. If you are sending requests one by one to your GPU or API provider, you are leaving money on the table. The difference between a profitable AI product and a cash-burning experiment often comes down to one technical decision: how you group those requests.

Batching is not just a performance tweak; it is an economic lever. By grouping multiple user queries into a single computation block, you maximize the work your hardware does per second. This directly lowers the cost per token. In 2026, with the LLM market projected to hit $36.1 billion, efficiency isn't optional-it's survival. Let’s look at exactly how to pick the right batch size for your specific workload without sacrificing latency.

The Economics of Batching: Why Size Matters

At its core, LLM serving is about keeping the GPU busy. GPUs are expensive parallel processors. When you send a single request, the GPU spends most of its time waiting for data rather than computing. It’s like filling a massive industrial oven with a single cookie. You pay for the electricity to heat the whole chamber, but you only get one result.

When you batch requests, you fill that oven. According to research from Koombea AI, proper batching can reduce API overhead costs by up to 90%. Organizations like First American and Scribd have reported 50% cost reductions on massive document processing workloads simply by optimizing this parameter. The math is simple: if your GPU processes 1,000 tokens per second whether you send one request or fifty, the cost per token drops dramatically as the batch size increases.

However, there is a catch. Larger batches mean longer wait times for individual users. This is the fundamental trade-off in LLM serving: throughput versus latency. Your goal is to find the "sweet spot" where you maximize GPU utilization without pushing response times beyond what your users will tolerate.

Static vs. Dynamic vs. Continuous Batching

Not all batching strategies are created equal. Choosing the wrong type can negate the cost savings you’re trying to achieve. Here is how the three main approaches compare:

Comparison of Batching Strategies for LLM Serving
Strategy Best For Latency Impact Cost Efficiency
Static Batching Predictable workloads (e.g., nightly report generation) High (waits for full batch) Moderate
Dynamic Batching Variable traffic (e.g., chatbots during peak hours) Medium (adjusts based on queue) High
Continuous Batching Real-time interactive apps requiring high throughput Low (inserts new requests as others finish) Very High (up to 24x throughput gain)

Static batching is the simplest to implement but the least efficient for real-time apps because it waits for a fixed number of requests before starting computation. Dynamic batching improves this by adjusting the batch size based on current load. However, Continuous Batching is the gold standard for modern LLM serving. Tools like vLLM and TensorRT-LLM use this technique to allow new sequences to be inserted into the batch as soon as other sequences complete generation. This keeps the GPU fully utilized throughout the entire generation process, not just at the start. Benchmarks show continuous batching can achieve up to 24x higher throughput compared to standard implementations.

Manga artist monitoring latency and throughput graphs in server room

Optimal Batch Sizes by Task Type

There is no universal "best" batch size. The optimal number depends heavily on what your model is doing. A classification task is computationally lighter than open-ended text generation. Here are the recommended ranges based on industry benchmarks from 2024-2025:

  • Text Generation (Open-ended): 10-50 requests per batch. These tasks require more memory for the Key-Value (KV) cache as tokens are generated sequentially.
  • Classification & Sentiment Analysis: 100-500 requests per batch. Since these tasks output a single label or short string, they consume less memory and benefit from large parallel processing.
  • Simple Q&A Systems: 50-200 requests per batch. These fall in the middle, balancing context length with output brevity.

For example, a fintech engineering team reduced their support ticket classification costs by 58% by moving from individual API calls to a batch size of 35. They noted that it took three weeks of tuning to find this specific number, highlighting that empirical testing is crucial. If you are using a smaller model like Mistral 7B, you can push these numbers higher. If you are running a behemoth like LLaMA2-70B, you may need to stay lower to avoid running out of memory.

Hardware Constraints and Memory Limits

Your choice of batch size is ultimately capped by your GPU’s VRAM (Video RAM). Every active request in a batch consumes memory for the KV cache, which stores the context of the conversation. As the batch size grows, so does the memory footprint.

Research published in early 2025 indicates that consumer-grade GPUs often offer superior memory bandwidth per dollar compared to enterprise chips like the A100 or H100 for smaller models. However, for large models, you are likely bound by total capacity. A common limit seen in benchmarks is a batch size of 64 for larger models before hitting out-of-memory errors. Beyond this point, throughput plateaus while latency spikes dangerously.

To manage this, consider these hardware-aware strategies:

  1. Quantization: Using 4-bit or 8-bit quantized models reduces memory usage significantly, allowing you to increase batch sizes without upgrading hardware.
  2. Model Cascading: Route 90% of simple queries to a small, cheap model (like Mistral 7B) and only send complex tasks to premium models. This combination with batching can cut costs by up to 87%.
  3. Early Stopping: Configure your system to halt token generation once a satisfactory completion is reached. This can reduce output tokens by 20-40%, freeing up memory for larger batches.
Gekiga style GPU chip straining under heavy memory cache load

Practical Implementation Steps

Implementing effective batching requires a structured approach. Do not guess. Measure.

Step 1: Baseline Your Current Costs. Calculate your current cost per token and average latency. If you are using OpenAI’s API, note that their Batch API offers significant discounts (up to 50% off standard rates) for non-real-time jobs. If you are self-hosting, track your GPU utilization metrics.

Step 2: Identify Your Latency Budget. How slow can your app be? For a chatbot, users might tolerate a 500ms delay. For a background document processor, a 5-second delay is fine. This budget dictates your maximum viable batch size.

Step 3: Test Incrementally. Start with a small batch size (e.g., 4) and double it (8, 16, 32) while monitoring two metrics: throughput (tokens per second) and p95 latency. Stop increasing the batch size when latency exceeds your budget or when throughput gains diminish (diminishing returns usually hit around batch size 64).

Step 4: Deploy Continuous Batching. If your infrastructure allows, switch from static to continuous batching using engines like vLLM. This automatically handles the complexity of inserting new requests as old ones finish, maximizing efficiency without manual tuning.

Common Pitfalls to Avoid

Many teams make the mistake of maximizing batch size regardless of consequences. This leads to "tail latency" issues, where a few users experience extremely long wait times because they got stuck behind a heavy batch. Always monitor p95 and p99 latency, not just averages.

Another pitfall is ignoring input sequence length variability. A batch containing ten short prompts and one very long prompt will be constrained by the long prompt’s processing time. Consider padding shorter inputs or separating long-context requests into their own queues to maintain consistent performance.

Finally, do not neglect the cost of streaming. Streaming responses typically cost 20-40% more than batch processing due to the overhead of maintaining constant connections. If your application doesn’t strictly require real-time character-by-character display, disable streaming to save costs.

What is the ideal batch size for LLM text generation?

For general text generation, a batch size of 10-50 is typically optimal. This range balances GPU utilization with acceptable latency. Larger batches may cause significant delays for individual users, while smaller batches underutilize the GPU.

How does continuous batching differ from static batching?

Static batching waits for a fixed number of requests before processing them all together. Continuous batching dynamically inserts new requests into the pipeline as soon as previous requests finish generating tokens. This results in much higher GPU utilization and better throughput, especially for variable-length outputs.

Can batching reduce my OpenAI API costs?

Yes. OpenAI offers a dedicated Batch API that provides up to 50% discount on standard pricing for non-real-time jobs. Additionally, even if you don't use their specific API, batching your requests to third-party providers can reduce overhead fees associated with individual API calls.

Why does my GPU run out of memory with large batch sizes?

Each request in a batch requires memory for the Key-Value (KV) cache, which stores the context of the conversation. As batch size increases, the total memory required grows linearly. If the combined memory needs exceed your GPU's VRAM, you will encounter out-of-memory errors. Quantization or using smaller models can help mitigate this.

Is streaming more expensive than batch processing?

Yes, streaming responses typically cost 20-40% more than batch processing. This is due to the computational overhead of maintaining persistent connections and sending data incrementally. If real-time display is not critical, disabling streaming can significantly lower costs.