Why AI infrastructure scales on tokens, not requests
Most technology leaders have spent decades thinking about scale through a familiar lens. More users create more requests. More requests require more servers. More servers increase infrastructure cost. That model worked well for web applications, APIs, databases, and even cloud native architectures. AI has changed that equation.
In the world of Large Language Models, the most important unit of scale is not the request. It is the token.
A thousand users can generate vastly different infrastructure demands depending on how many tokens are processed. Two requests may appear identical from an API perspective, yet one can consume ten times more compute, memory, and cost than the other.
This shift is giving rise to a new discipline that many practitioners are beginning to call Tokenomics.
Just as FinOps brought financial accountability to cloud consumption, Tokenomics brings visibility, governance, and optimization to AI consumption.
To understand why Tokenomics matters, we first need to understand what tokens actually are and why they have become the fundamental unit of AI economics.
1. Tokens and tokenization
Most people assume AI systems scale based on the number of requests they receive. In reality, Large Language Models scale based on tokens. Every interaction with an LLM consumes tokens.
What is a token?
A token is the smallest unit of text processed by a Large Language Model. A token is not a word, character or byte.
Instead, a token represents a statistically-learned fragment of text that the model understands and processes. Depending on the language and structure of the content, a token can represent a complete word, part of a word, punctuation, numbers, or even fragments of code.
What are input tokens?
Input tokens include:
- System prompts
- User queries
- Retrieved context from RAG systems
- Conversation history
What are output tokens?
Output tokens represent the generated response produced by the model
This distinction is important because AI providers charge primarily based on token consumption rather than API request volume.
A request containing 500 tokens costs dramatically less than a request containing 10,000 tokens, even though both appear as a single API call.
How tokenization works
Before a model can understand human language, text must be converted into tokens. Modern LLMs use tokenization algorithms such as:
- Byte Pair Encoding (BPE)
- SentencePiece
- WordPiece
These algorithms break human readable text into machine processable units. Consider the following examples:
- “Cloud optimization” typically becomes 2 to 3 tokens
- “FinOps” is usually processed as a single token
- “Multi cloud governance” may expand into 3 to 5 tokens
- “cost_allocation_rule” often breaks into 4 to 6 tokens because of the underscore separated structure
- “AI transformation” may consume more tokens than expected because emojis require additional encoding before the model can process them
The reality of tokens
In English-language workloads:
- 1 token is approximately 3 to 4 characters
- 100 words typically generate around 130 to 150 tokens
However, token counts increase rapidly when working with:
- JSON
- Source code
- Infrastructure logs
- Tables
- Markdown
- Non English languages
- Emojis
This is particularly relevant for enterprise AI systems where structured data frequently dominates prompts.
Requests measure traffic. Tokens measure compute, memory, latency, and cost.
A typical enterprise AI request often looks something like this:
- System instructions: ~800 tokens
- Retrieved enterprise context: ~2,500 tokens
- User query: ~300 tokens
- AI generated response: ~700 tokens
More than 4K Tokens Per Request.
At first glance this may not seem significant.
However, imagine:
- 15 requests per second
- Millions of active users
- Continuous multi turn conversations
The resulting token volume becomes enormous. This is why request counts alone provide only a partial view of AI consumption.
2. Why enterprise AI requests grow so fast
When teams begin building AI applications, they often focus on two visible components.
- User input
- Model output
What they fail to recognize is that production AI systems contain several hidden layers that contribute far more tokens than users themselves.
In many enterprise deployments, these hidden components become the dominant source of infrastructure consumption.
The three largest contributors are:
- System prompts
- Retrieval pipelines
- Multimodal inputs
Together they create a phenomenon that can be described as token inflation.
System prompts: repeated every request
System prompts establish:
- Behavioral instructions
- Safety guardrails
- Formatting requirements
- Enterprise policies
- Tool usage constraints
Enterprise prompts are rarely small. Many organizations maintain prompts between 1,000 and 2,500 tokens. The critical detail is that these prompts are attached to every request.
Example
The scale of this problem becomes apparent when viewed through an enterprise lens. Imagine an AI assistant that carries a system prompt of approximately 1,800 tokens containing instructions, policies, formatting rules, and guardrails. If that assistant processes 800,000 requests in a month, those instructions alone account for roughly 1.44 billion tokens.
What makes this particularly striking is that these tokens are consumed before any meaningful user interaction takes place. They are not generated by questions, conversations, or business tasks. They are simply the repeated cost of sending the same instructions over and over again with every request.
This is why system prompts often represent one of the largest and least visible contributors to enterprise AI costs. In other words, Your AI assistant may be repeating the same instructions millions of times.
RAG and the context inflation problem
Retrieval Augmented Generation improves response quality by bringing relevant enterprise knowledge into the prompt at runtime. This additional context allows the model to answer questions using company specific information rather than relying solely on its pre-trained knowledge.
The challenge is that accuracy often comes at the cost of significantly higher token consumption.
Imagine a query that retrieves five relevant documents from an enterprise knowledge base. If each document contributes roughly 900 tokens of content, the retrieval layer alone adds approximately 4,500 tokens to the request. When combined with a system prompt containing around 1,800 tokens of instructions and governance policies, the request already exceeds 6,000 tokens before the user has typed more than a simple question.
In other words, the majority of the tokens being processed may originate from the surrounding context rather than the user interaction itself. As retrieval systems become more sophisticated and pull in larger volumes of supporting information, context inflation can quickly become one of the largest drivers of AI infrastructure cost.
Multimodal AI and token explosion
The next wave of enterprise AI is multimodal.
Models increasingly process:
- Images
- PDFs
- Screenshots
- Diagrams
- Video frames
While users see a single image, the model internally converts visual information into representations that consume significant computational resources.
A high resolution image can contribute between 2,000 and 5,000 token equivalents.
Example
A typical multimodal enterprise AI request might include:
- Image processing: ~3,500 token equivalents
- Retrieved enterprise context (RAG): ~4,500 tokens
- System instructions and guardrails: ~1,800 tokens
- User input: ~600 tokens
Before the model generates a single word of output, the request has already consumed approximately 10,400 tokens.
3. Long contexts and hidden compute costs
Modern LLMs continue to push context window limits.
Today it is common to see models supporting:
- 8K tokens
- 32K tokens
- 128K tokens and beyond
This advancement creates exciting possibilities. Users can provide larger documents, longer conversations, and richer enterprise context. However, larger context windows come with significant infrastructure consequences.
Attention scaling and compute complexity
Transformer architectures rely on self attention mechanisms. These mechanisms calculate relationships between tokens throughout the sequence. As token count grows, the amount of work required by the model grows dramatically.
Longer contexts frequently result in:
- Higher inference latency
- Lower hardware efficiency
- Reduced throughput
- Increased tail latency
Larger context windows are not free capacity. They are a compute trade off.
KV cache and throughput trade offs
During generation, transformer models maintain Key Value cache structures containing previously processed tokens.
As prompts grow larger, these caches consume more GPU memory.
This directly impacts:
- Available memory
- Batch efficiency
- Concurrent request capacity
- Throughput
As prompt sizes increase:
- KV memory expands
- Batch sizes shrink
- Throughput decreases
- Cost per response rises
All while using the same hardware.
Token variance and autoscaling challenges
Average token counts rarely reveal the true problem.
The real challenge is variance.
Example distribution
- P50 requests typically consume around 3,000 tokens.
- P95 requests can grow to approximately 12,000 tokens.
- P99 requests may exceed 20,000 tokens.
Most infrastructure platforms scale based on:
- Request count
- CPU utilization
- GPU utilization
Yet token spikes introduce hidden pressure through:
- Larger KV caches
- Reduced batching
- Slower inference
- Latency spikes
By the time infrastructure metrics react, user experience has already degraded.
Final takeaway
In AI infrastructure, token growth quietly reduces hardware efficiency long before systems visibly fail.
4. Token discipline: preventing AI infrastructure waste
Many organizations initially assume AI costs are primarily determined by model selection. In practice, architectural decisions often have a far greater impact. Small inefficiencies repeated millions of times can create substantial waste.
The result is:
- Higher latency
- Lower throughput
- Increased GPU expenditure
- Reduced platform stability
This is where Tokenomics becomes operational rather than theoretical.
Common production anti-patterns
Raw API and JSON injection
Teams frequently inject entire API responses directly into prompts.
- A raw JSON payload can easily consume between 4,000 and 8,000 tokens
- A summarized or selectively extracted version of the same information may require only 800 to 1,500 tokens
Optimization through summarization or selective extraction can reduce token volume by 60 to 80 percent.
Full conversation replay
Many chat systems resend the entire conversation with every request.
- A short chat session typically consumes between 2,000 and 4,000 tokens
- A long multi-turn conversation can grow to 12,000 to 18,000 tokens or more
Better alternatives include:
- Rolling summaries
- Memory compression
- Context pruning
Overloaded system prompts
Prompts often accumulate duplicate instructions over time. Reducing:
- 1,600 token prompts
- To 900 token prompts
Can save hundreds of millions of tokens each month in large deployments.
Unbounded output length
Organizations frequently allow responses to generate thousands of unnecessary tokens.
A 2,000 token answer may provide little additional value compared to a concise 500 token response.
The result is increased cost and increased latency.
Engineering strategies for token efficiency
- Prompt Compression helps reduce recurring token overhead by eliminating redundant instructions and simplifying system prompts
- Retrieval Summarization reduces the size of retrieved context, ensuring that only the most relevant information is included in the prompt
- Structured Data Minification removes unnecessary formatting, verbose field names, and redundant JSON structures, lowering token consumption
- Image Downscaling reduces the token equivalent footprint of multimodal inputs while maintaining sufficient quality for the task at hand
- Output Token Limits help control generation costs and prevent unnecessarily long responses that add little incremental value
- Token Percentile Monitoring provides visibility into P50, P95, and P99 token consumption patterns, enabling teams to identify inefficiencies before they become infrastructure bottlenecks
These optimizations may appear small individually. Combined across billions of tokens, they create meaningful infrastructure savings.
Token visibility equals infrastructure visibility
Mature AI platforms monitor token behavior with the same rigor that cloud teams monitor CPU, memory, and storage.
Common metrics include:
- P50 token usage
- P95 token usage
- P99 token usage
- Input versus output ratios
- Context growth trends
- Prompt inflation over time
Why?
Because token growth directly influences:
- Cost
- GPU utilization
- Throughput
- Autoscaling effectiveness
- User latency
Without token visibility, organizations are effectively operating blind.
Conclusion: why Tokenomics is becoming a foundational discipline
The cloud era introduced a new realization: infrastructure costs were no longer fixed, meaning every engineering decision had a financial consequence. That realization led to the rise of FinOps.
The AI era is creating a similar shift. Every prompt, every retrieved document, every image, every context window, and every generated response carries a measurable token cost.
Organizations that focus solely on request volume will miss the true drivers of AI economics.
The future belongs to teams that understand how tokens flow through their systems, how they influence performance, and how they impact cost at scale. Tokenomics is not simply about reducing expenses, it is about understanding the fundamental unit that powers modern AI. These systems do not scale on request volume alone, they scale on token discipline.