Inference costs are the expenses incurred when running trained AI models to generate predictions, completions, or other outputs. Unlike training costs (which are one-time or periodic), inference costs are ongoing and scale directly with usage. For most organizations using AI, inference costs represent the largest and fastest-growing component of their AI budget.
Understanding Inference
In machine learning, "inference" refers to the process of using a trained model to generate outputs from new inputs. When you send a prompt to GPT-4o and receive a response, that's inference. When an AI agent makes 20 LLM calls to complete a task, each call is an inference operation.
Inference is fundamentally different from training:
For most organizations, training costs are zero (they use pre-trained models via API), making inference costs their entire AI spend.
Components of Inference Costs
1. Compute Costs
The GPU (or TPU) time required to run the model:
The cost per token depends on:
2. API Markup
When using hosted APIs (OpenAI, Anthropic, Google), inference costs include:
3. Networking and Latency Costs
For self-hosted inference:
4. Infrastructure Overhead
Supporting infrastructure for inference:
API Inference vs Self-Hosted Inference
API Inference (OpenAI, Anthropic, Google)
Pros: No infrastructure management, instant scaling, latest models, pay-per-token
Cons: Higher per-token cost, less control, vendor lock-in, rate limits
Best for: Most organizations, especially < $50K/month in AI spend
Self-Hosted Inference
Pros: Lower per-token cost at scale, full control, data privacy, no rate limits
Cons: Significant infrastructure expertise needed, GPU procurement challenges, model updates are manual
Best for: Organizations with > $100K/month in AI spend, strict data requirements, or specialized models
Cost Comparison
For a typical workload using a ~70B parameter model:
| Metric | API (Anthropic Sonnet) | Self-Hosted (Llama 70B on H100) |
|---|---|---|
| Input cost/1M tokens | $3.00 | ~$0.30-0.50 |
| Output cost/1M tokens | $15.00 | ~$0.80-1.50 |
| Minimum monthly cost | $0 (pay-per-use) | ~$15,000 (GPU rental) |
| Break-even | — | ~$20K-30K/month API spend |
Self-hosting becomes cost-effective at significant scale, but the break-even point is higher than most teams expect due to infrastructure overhead.
Inference Cost Trends
Prices Are Falling Rapidly
Inference costs have been declining ~10x every 18 months due to:
But Usage Is Growing Faster
Despite falling per-token costs, total inference spending is increasing because:
The Inference Cost Paradox
Lower costs → more usage → higher total spend. This "Jevons Paradox" of AI means that organizations can't rely on falling prices alone — they need active cost management.