The Metrics That Matter
Not all metrics are created equal. Some are essential signals of agent health. Others are noise that creates dashboard clutter and alert fatigue. Here's how to tell the difference.
Tier 1: Critical Metrics (Monitor Always)
1. Availability (Uptime Percentage)
The most fundamental metric: is the agent available to accept and process tasks?
- Target: 99.5%+ for production agents
- Calculation: (Total time - Downtime) / Total time × 100
- Alert on: Any downtime exceeding 2 minutes
In ClawHQ, availability is tracked automatically and displayed as a percentage with a visual timeline showing outage events.
2. Task Success Rate
What percentage of tasks complete successfully?
- Target: Varies by task complexity (see FAQ above)
- Calculation: Successful tasks / Total tasks × 100
- Alert on: Rate drops below baseline by more than 10 percentage points
Track this per task type, not just per agent. An agent might have 90% overall success but 100% on simple tasks and 50% on complex ones — that 50% is the real signal.
3. Response Latency (P50, P95, P99)
How long do tasks take? Median (P50) tells you the typical case. P95 and P99 tell you about the worst cases.
- Why P95/P99 matter: An agent with 500ms median but 30-second P99 has a hidden problem. One in 100 tasks is extremely slow, and if you only look at the average, you'll miss it.
- Alert on: P95 exceeding 3x baseline, P99 exceeding 5x baseline
4. Error Rate by Category
Don't just track total errors — categorize them:
- Transient errors: Network timeouts, rate limits — usually self-resolving
- Agent errors: Invalid reasoning, tool misuse — indicates agent configuration issues
- External errors: API failures, data unavailability — external dependency problems
- Cost errors: Budget exceeded, token limit hit — resource constraints
5. Token Consumption Rate
How many tokens is the agent consuming per unit of time and per task?
- Track separately: Input tokens (prompt + context) and output tokens (responses)
- Alert on: Sudden spikes in per-task token usage (could indicate a reasoning loop or prompt issue)
Tier 2: Important Metrics (Monitor Daily)
6. Cost Per Task
The dollar cost of each completed task, including LLM tokens, tool API calls, and compute.
7. Queue Depth
How many tasks are waiting to be processed? A growing queue means you need more capacity or your agents are slowing down.
8. Memory/Context Usage
How much of the agent's context window is being used? Agents running near context limits may start dropping important information.
9. Skill Invocation Patterns
Which skills are being used most? Are any skills failing frequently? This helps you identify skill-specific issues and optimization opportunities.
Tier 3: Informational Metrics (Review Weekly)
10. Task Type Distribution
What kinds of tasks are your agents handling? Useful for capacity planning and understanding workload patterns.
11. Time-of-Day Patterns
When are your agents busiest? This informs scheduling and scaling decisions.
12. Model Performance Comparison
If you use multiple models, compare their performance on similar tasks. This data drives model selection optimization.
Metrics to Ignore (Usually)
Some metrics create noise without value:
- Raw request count: Without context about success/failure, raw counts aren't useful
- Average latency: Averages hide outliers. Use percentiles instead.
- CPU/memory utilization: Unless you're self-hosting, the LLM provider handles compute. Your agent process itself uses minimal resources.
Building Your Dashboard
A good monitoring dashboard shows the Tier 1 metrics at a glance, with drill-down capability for Tier 2 and Tier 3. ClawHQ provides this out of the box — a fleet overview with the critical metrics, and detailed views for each agent.
For more on setting up monitoring, see our complete monitoring guide.
Ready to manage your agent fleet? Start managing your fleet for free→



