What are the top 5 metrics for AI agent health?

1) Uptime/availability percentage, 2) Task success rate, 3) P95 response latency, 4) Token consumption rate, 5) Error rate by category. These five give you a comprehensive view of any agent.

How do I calculate cost per task?

Cost per task = (input tokens × input token price) + (output tokens × output token price) + any tool/API costs incurred during the task. ClawHQ calculates this automatically for every task.

What is a good task success rate?

It depends on the task complexity, but generally: above 95% for simple tasks, above 85% for moderate tasks, and above 70% for complex multi-step tasks. Track trends over time rather than focusing on absolute numbers.

Should I monitor token usage or cost?

Both, but for different reasons. Token usage helps you optimize prompts and agent behavior. Cost helps you budget and identify waste. Token usage without cost context is misleading because different models have very different pricing.

AI Agent Health Monitoring: Metrics That Actually Matter

The Metrics That Matter

Not all metrics are created equal. Some are essential signals of agent health. Others are noise that creates dashboard clutter and alert fatigue. Here's how to tell the difference.

Tier 1: Critical Metrics (Monitor Always)

1. Availability (Uptime Percentage)

The most fundamental metric: is the agent available to accept and process tasks?

Target: 99.5%+ for production agents
Calculation: (Total time - Downtime) / Total time × 100
Alert on: Any downtime exceeding 2 minutes

In ClawHQ, availability is tracked automatically and displayed as a percentage with a visual timeline showing outage events.

2. Task Success Rate

What percentage of tasks complete successfully?

Target: Varies by task complexity (see FAQ above)
Calculation: Successful tasks / Total tasks × 100
Alert on: Rate drops below baseline by more than 10 percentage points

Track this per task type, not just per agent. An agent might have 90% overall success but 100% on simple tasks and 50% on complex ones — that 50% is the real signal.

3. Response Latency (P50, P95, P99)

How long do tasks take? Median (P50) tells you the typical case. P95 and P99 tell you about the worst cases.

Why P95/P99 matter: An agent with 500ms median but 30-second P99 has a hidden problem. One in 100 tasks is extremely slow, and if you only look at the average, you'll miss it.
Alert on: P95 exceeding 3x baseline, P99 exceeding 5x baseline

4. Error Rate by Category

Don't just track total errors — categorize them:

Transient errors: Network timeouts, rate limits — usually self-resolving
Agent errors: Invalid reasoning, tool misuse — indicates agent configuration issues
External errors: API failures, data unavailability — external dependency problems
Cost errors: Budget exceeded, token limit hit — resource constraints

5. Token Consumption Rate

How many tokens is the agent consuming per unit of time and per task?

Track separately: Input tokens (prompt + context) and output tokens (responses)
Alert on: Sudden spikes in per-task token usage (could indicate a reasoning loop or prompt issue)

Tier 2: Important Metrics (Monitor Daily)

6. Cost Per Task

The dollar cost of each completed task, including LLM tokens, tool API calls, and compute.

7. Queue Depth

How many tasks are waiting to be processed? A growing queue means you need more capacity or your agents are slowing down.

8. Memory/Context Usage

How much of the agent's context window is being used? Agents running near context limits may start dropping important information.

9. Skill Invocation Patterns

Which skills are being used most? Are any skills failing frequently? This helps you identify skill-specific issues and optimization opportunities.

Tier 3: Informational Metrics (Review Weekly)

10. Task Type Distribution

What kinds of tasks are your agents handling? Useful for capacity planning and understanding workload patterns.

11. Time-of-Day Patterns

When are your agents busiest? This informs scheduling and scaling decisions.

12. Model Performance Comparison

If you use multiple models, compare their performance on similar tasks. This data drives model selection optimization.

Metrics to Ignore (Usually)

Some metrics create noise without value:

Raw request count: Without context about success/failure, raw counts aren't useful
Average latency: Averages hide outliers. Use percentiles instead.
CPU/memory utilization: Unless you're self-hosting, the LLM provider handles compute. Your agent process itself uses minimal resources.

Building Your Dashboard

A good monitoring dashboard shows the Tier 1 metrics at a glance, with drill-down capability for Tier 2 and Tier 3. ClawHQ provides this out of the box — a fleet overview with the critical metrics, and detailed views for each agent.

For more on setting up monitoring, see our complete monitoring guide.

Ready to manage your agent fleet? Start managing your fleet for free→