What are inference costs in AI?

Inference costs are the expenses of running trained AI models to generate outputs. When you send a prompt to an LLM and get a response, that's inference. Costs include compute (GPU time), API charges (provider markup), and infrastructure. For most organizations, inference is their entire AI cost.

How are inference costs calculated?

For API users, inference costs = input tokens × input price + output tokens × output price. For self-hosted inference, costs = GPU rental/ownership + infrastructure + operations. API pricing is simpler but more expensive per token; self-hosted has lower per-token costs but higher fixed costs.

Are inference costs going down?

Per-token costs are falling ~10x every 18 months due to better hardware, efficient architectures, and competition. However, total inference spending is rising because AI agents and new features consume far more tokens. Active cost management is still essential.

When should I consider self-hosted inference?

Consider self-hosting when API costs exceed $20,000-30,000/month, you have strict data privacy requirements, or you need models not available via API. The break-even point is higher than expected due to infrastructure complexity. Most organizations under $50K/month are better served by APIs.

Inference Costs: Definition & Guide

Inference costs are the expenses incurred when running trained AI models to generate predictions, completions, or other outputs. Unlike training costs (which are one-time or periodic), inference costs are ongoing and scale directly with usage. For most organizations using AI, inference costs represent the largest and fastest-growing component of their AI budget.

Understanding Inference

In machine learning, "inference" refers to the process of using a trained model to generate outputs from new inputs. When you send a prompt to GPT-4o and receive a response, that's inference. When an AI agent makes 20 LLM calls to complete a task, each call is an inference operation.

Inference is fundamentally different from training:

Training: Teaching the model by processing massive datasets (one-time, very expensive)

Inference: Using the trained model to generate outputs (ongoing, scales with usage)

For most organizations, training costs are zero (they use pre-trained models via API), making inference costs their entire AI spend.

Components of Inference Costs

1. Compute Costs

The GPU (or TPU) time required to run the model:

Prefill: Processing input tokens through the model (parallel, relatively fast)

Decode: Generating output tokens one at a time (sequential, the bottleneck)

Memory: Storing the model weights and KV-cache in GPU memory

The cost per token depends on:

Model size (7B parameters vs 70B vs 175B+)

GPU type (A100, H100, H200)

Batch efficiency (processing multiple requests together)

Quantization level (full precision vs INT8 vs INT4)

2. API Markup

When using hosted APIs (OpenAI, Anthropic, Google), inference costs include:

Raw compute costs

Provider margin (estimated 50-80% margin for frontier models)

Infrastructure overhead (networking, storage, redundancy)

Support and reliability guarantees

3. Networking and Latency Costs

For self-hosted inference:

Load balancers and API gateways

GPU cluster networking

Data transfer between regions

CDN for edge deployment

4. Infrastructure Overhead

Supporting infrastructure for inference:

Model serving frameworks (vLLM, TGI, Triton)

Request queuing and batching

Auto-scaling and GPU provisioning

Monitoring and logging

API Inference vs Self-Hosted Inference

API Inference (OpenAI, Anthropic, Google)

Pros: No infrastructure management, instant scaling, latest models, pay-per-token

Cons: Higher per-token cost, less control, vendor lock-in, rate limits

Best for: Most organizations, especially < $50K/month in AI spend

Self-Hosted Inference

Pros: Lower per-token cost at scale, full control, data privacy, no rate limits

Cons: Significant infrastructure expertise needed, GPU procurement challenges, model updates are manual

Best for: Organizations with > $100K/month in AI spend, strict data requirements, or specialized models

Cost Comparison

For a typical workload using a ~70B parameter model:

Metric	API (Anthropic Sonnet)	Self-Hosted (Llama 70B on H100)
Input cost/1M tokens	$3.00	~$0.30-0.50
Output cost/1M tokens	$15.00	~$0.80-1.50
Minimum monthly cost	$0 (pay-per-use)	~$15,000 (GPU rental)
Break-even	—	~$20K-30K/month API spend

Self-hosting becomes cost-effective at significant scale, but the break-even point is higher than most teams expect due to infrastructure overhead.

Inference Cost Trends

Prices Are Falling Rapidly

Inference costs have been declining ~10x every 18 months due to:

More efficient model architectures (Mixture of Experts, Mamba)

Better hardware (H100 → H200 → B200)

Improved inference engines (FlashAttention, PagedAttention, speculative decoding)

Competition driving down API margins

But Usage Is Growing Faster

Despite falling per-token costs, total inference spending is increasing because:

AI agents consume 10-100x more tokens than chatbots

More features are being AI-powered

Models are being used for tasks previously done by humans

AI applications are reaching larger user bases

The Inference Cost Paradox

Lower costs → more usage → higher total spend. This "Jevons Paradox" of AI means that organizations can't rely on falling prices alone — they need active cost management.

Optimizing Inference Costs

Model-Level Optimizations

Quantization: Reduce model precision (FP16 → INT8 → INT4) for 2-4x cost reduction

Distillation: Train smaller models that mimic larger ones

Mixture of Experts: Only activate relevant model parameters per token

Serving-Level Optimizations

Dynamic batching: Process multiple requests together for GPU efficiency

Speculative decoding: Use a small model to draft responses, large model to verify

KV-cache management: Efficient memory allocation for cached context

Prompt caching: Reuse computed prompt representations

Application-Level Optimizations

Model routing: Use the cheapest model that meets quality requirements

Token budgets: Prevent wasteful token consumption

Response caching: Cache and reuse identical responses

Prompt optimization: Reduce token counts without losing quality

Inference Costs