AI adoption often starts with a sentence that sounds reasonable in a planning meeting:
“We should be using AI here.”
Sometimes that instinct is right. A model really can reduce manual review, speed up research, route support tickets, summarize messy information, or help teams ship better software.
But intuition is a weak operating system for AI programs.
The hard part is not launching the first pilot. The hard part is knowing whether the system is actually improving the business, whether the improvement is worth the cost, and whether new failure modes are quietly building up around it.
That is the difference between using AI and using AI correctly.
TL;DR
- AI adoption without measurement turns into activity theater: pilots, dashboards, policies, and demos that do not prove value.
- The first measurement mistake is usually skipping the pre-AI baseline. If you do not know the old cycle time, error rate, cost, or satisfaction score, you cannot prove the new one is better.
- Useful AI metrics usually fall into three layers: business outcomes, workflow quality, and model/system behavior.
- The Rumsfeld matrix is a helpful lens: known knowns can be instrumented, known unknowns can be monitored, unknown knowns can be surfaced through domain experts, and unknown unknowns require blast-radius control.
- The practical path is simple but not easy: define the outcome, establish baselines, run controlled experiments, evaluate quality continuously, and revisit the metric when reality teaches you something new.
What You Will Learn Here
- Why “everyone else is adopting AI” is not a strategy
- How to apply knowns and unknowns to AI measurement
- Which metrics matter before, during, and after deployment
- How to connect AI behavior to business value without fooling yourself
- How to design AI systems that can survive surprises
- An 8-step roadmap for moving from intuition to measurable AI ROI
The Intuition Trap
AI adoption has never been easier to justify emotionally.
There are better models, better tooling, better developer workflows, and better executive narratives. The pressure to do something is real. In McKinsey’s 2024 global survey, AI adoption jumped to 72% of organizations using AI in at least one business function. BCG, meanwhile, reported that many companies still struggle to turn AI activity into tangible value.
That gap is the real story.
The problem is not that teams are uninterested in AI. The problem is that many teams still measure adoption instead of value.
Common weak signals include:
- number of pilots launched
- number of AI policies drafted
- number of models deployed
- number of employees with access to tools
- number of prompts or conversations created
Those metrics are not useless. They can tell you whether a program exists. They cannot tell you whether it works.
A support chatbot with high usage can still frustrate customers. A code assistant can create more review burden than speed. A document summarizer can save time while introducing subtle compliance risk. A sales copilot can generate more messages without improving conversion.
So the first shift is philosophical:
AI success is not adoption. AI success is measured improvement in a real workflow.
Start with the Baseline
Measurement starts before the model shows up.
If you want AI to improve a process, capture the current state first:
| Workflow Question | Baseline Metric |
|---|---|
| How long does this take today? | Cycle time, time-to-resolution, review time |
| How often does it fail? | Error rate, escalation rate, rework rate |
| What does it cost? | Cost per task, cost per case, labor hours |
| How good is the output? | Rubric score, QA score, customer satisfaction |
| How much human supervision is required? | Review minutes, override rate, approval rate |
Without this baseline, teams end up making claims like “AI made us faster” without knowing faster than what.
That is especially risky because AI changes the shape of work. It may reduce drafting time while increasing verification time. It may lower first-response latency while raising escalation volume. It may let one team do more while making another team absorb more exceptions.
The useful question is not:
Did the model produce something?
It is:
Did the workflow become better after all costs, reviews, failures, and second-order effects are included?
The Rumsfeld Matrix for AI Work
Donald Rumsfeld’s “known knowns / known unknowns / unknown unknowns” framing is useful because AI systems do not only fail in ways you can list ahead of time.
Every AI initiative sits inside four categories:
What We Know What We Do Not Know
Aware Known Knowns Known Unknowns
Accuracy Drift patterns
Latency Long-term user behavior
Cost Edge-case performance
Throughput Explainability gaps
Unaware Unknown Knowns Unknown Unknowns
Tacit expert rules Emergent behaviors
Hidden data assets Second-order incentives
Overlooked patterns Valid-looking bad outputs
Existing signals Unexpected dependencies
Known Knowns
These are the comfortable metrics.
You can instrument accuracy, latency, cost, throughput, uptime, task completion, and schema validity. They belong on dashboards. They can be tracked over time. They can be used for release gates.
For example:
- p95 response latency under 2 seconds
- hallucination rate below 2% on a labeled dataset
- cost per resolved ticket below a target threshold
- JSON schema validity above 99.5%
Known Unknowns
These are risks you can name but cannot fully predict.
You know drift can happen. You know users may change behavior once the tool exists. You know edge cases will appear in production. You know the model may perform differently on a new customer segment.
Known unknowns call for monitoring, experiments, and periodic review.
For example:
- monitor quality by customer segment
- compare AI-assisted and human-only workflows
- run regression evals after prompt or model changes
- sample production outputs for human review
Unknown Knowns
These are the things the organization knows but has not converted into system knowledge.
A support lead may know which refund cases always need escalation. A data analyst may know a field is unreliable after a migration. A compliance person may know which phrasing creates risk. A senior engineer may know which tool failures are harmless and which ones are dangerous.
AI programs fail here when they treat the dataset as the whole truth.
The fix is not just more data. It is structured discovery:
- interview domain experts before building
- run pre-mortems with operators
- review historical incidents and edge cases
- turn tacit rules into rubrics, tests, and guardrails
Unknown Unknowns
These are the surprises.
An agent generates valid-looking SQL that damages a downstream workflow. A recommendation system changes user behavior in a way that breaks the original metric. A summarizer reduces reading time but causes reviewers to miss context. A tool-calling agent learns a path that technically works but bypasses an important human checkpoint.
You cannot write a complete checklist for unknown unknowns.
You can design for impact control:
- feature flags
- canary releases
- human approval on high-risk actions
- rate limits and scoped permissions
- rollback paths
- anomaly detection
- incident reviews that question assumptions, not just bugs
The point is not to predict every failure. The point is to make surprise survivable.
The Three Metric Layers That Matter
AI teams often mix together metrics that answer different questions. A cleaner model is to separate them into three layers.
1. Business Outcome Metrics
These answer: Did the business get better?
Examples:
- revenue per account
- conversion rate
- customer retention
- support cost per resolved case
- time-to-resolution
- employee hours saved after review time is included
- risk incidents avoided
This is the layer executives usually care about. It is also the layer most likely to be polluted by wishful thinking if the workflow baseline is missing.
2. Workflow Quality Metrics
These answer: Did the process improve?
Examples:
- task completion rate
- first-pass acceptance rate
- rework rate
- human override rate
- escalation accuracy
- approval latency
- reviewer burden
- user satisfaction after AI interaction
This layer is where product managers, operations leads, and engineering teams can usually have the most practical conversations.
For an AI support agent, the business metric might be “support cost per resolved case.” The workflow metrics might be “correct routing rate,” “cases solved without escalation,” and “average review minutes per response.”
3. Model and System Metrics
These answer: Did the AI system behave correctly?
Examples:
- eval score
- hallucination rate
- retrieval precision
- tool-call accuracy
- schema validity
- latency
- inference cost
- refusal correctness
- safety policy compliance
This is the layer most AI teams start with because it is closest to the implementation. It matters, but it is not enough.
A model can score well on an eval and still fail to create business value. A workflow can create value even if the model is imperfect, as long as the surrounding system catches the right failures.
The job is to connect all three layers.
Model behavior -> workflow quality -> business outcome
Tool-call accuracy -> correct ticket routing -> lower time-to-resolution
Retrieval quality -> fewer bad answers -> higher customer satisfaction
Summary quality -> faster review -> lower cost per case
Impact Chaining: The Antidote to Vague ROI
One practical technique is impact chaining.
Instead of saying “AI will improve support,” map the causal path:
| AI Capability | Workflow Change | Business Outcome |
|---|---|---|
| Classify inbound tickets | Faster routing to the right queue | Lower time-to-resolution |
| Draft support replies | Lower writing time per case | Lower cost per resolved case |
| Retrieve policy snippets | Fewer incorrect answers | Higher CSAT, lower compliance risk |
| Summarize case history | Faster human review | Higher agent throughput |
Each link needs a metric.
If the AI drafts replies, measure drafting time. But also measure review time, edit distance, approval rate, error rate, and customer satisfaction. If review time doubles, the apparent productivity gain may disappear.
This is where many AI ROI stories fall apart. They measure the step that got faster and ignore the steps that got heavier.
Controlled Experiments Beat Hope
The cleanest way to know whether AI is helping is to compare workflows.
For a real rollout, that may look like:
- Team A uses the current process; Team B uses the AI-assisted process.
- A random sample of tickets gets AI triage; the rest use normal triage.
- One model/prompt variant handles a test cohort; another handles a control cohort.
- Human reviewers grade outputs without knowing which system produced them.
Measure the full workflow:
- completion rate
- time-to-completion
- error rate
- rework rate
- escalation rate
- cost per completed task
- human quality score
- user satisfaction
This is less glamorous than a demo, but it is much harder to fool.
It also makes model selection more honest. The best model is not always the one with the highest benchmark score. It is the one that improves the target workflow within the cost, latency, quality, and risk constraints that matter.
Evals Are the Bridge Between Product and Engineering
For AI applications, evals are not just technical tests. They are the contract between product intent and system behavior.
A good eval suite turns vague expectations into inspectable examples:
- “The support agent should ask for missing account details before taking action.”
- “The coding agent should not modify files outside the requested scope.”
- “The research assistant should distinguish direct evidence from inference.”
- “The sales assistant should not invent pricing terms.”
Those expectations become datasets, rubrics, deterministic checks, and judge prompts.
For agent systems, I usually want a mix of:
- deterministic tests for schemas, permissions, and tool boundaries
- dataset evals for repeated examples
- rubric-based LLM judges for semantic quality
- trace-aware grading for tool choice, retrieval, and intermediate decisions
- human calibration on sampled outputs
This matters because production AI quality is rarely one-dimensional. A response can be polite but wrong. Correct but unsupported. Fast but risky. Cheap but incomplete. Helpful in one customer segment and harmful in another.
Evals give the team a shared language for those tradeoffs.
Double-Loop Learning: Fix the Metric, Not Just the Model
Single-loop learning asks:
Are we hitting the target?
Double-loop learning asks:
Is this still the right target?
AI teams need double-loop learning because AI systems change the environment around themselves. Once a tool exists, users adapt. Teams route more work through it. Managers reset expectations. Edge cases that were rare may become common because the AI makes a new workflow possible.
That means the original metric can become stale.
For example:
- You optimized for response speed, then discovered trust dropped because users wanted more explanation.
- You optimized for automation rate, then discovered the hardest cases were being automated too aggressively.
- You optimized for low escalation, then discovered the system was hiding uncertainty instead of surfacing it.
- You optimized for token cost, then discovered shorter outputs increased human review time.
The right response is not always “improve the model.”
Sometimes it is:
- change the metric
- split the metric by risk tier
- add a human approval step
- lower automation on certain cases
- rewrite the rubric
- retire the use case
That last option matters. A mature AI program should be able to kill a bad AI use case without treating it as a failure of ambition.
An 8-Step Roadmap for AI Measurement Maturity
Here is the practical sequence I would use with most teams.
1. Name the Workflow
Do not start with “AI strategy.” Start with a workflow.
Bad:
- “Use AI in customer success.”
Better:
- “Reduce average time-to-resolution for tier-1 billing tickets without increasing incorrect refunds.”
2. Establish the Baseline
Capture the current process before AI changes it:
- cycle time
- cost
- error rate
- rework
- escalation
- quality score
- user satisfaction
3. Define the Outcome Metric
Pick the business result that matters.
Examples:
- reduce cost per resolved case by 20%
- increase first-contact resolution by 10%
- reduce analyst research time by 30%
- improve lead qualification accuracy by 15%
4. Define Workflow Guardrails
Add metrics that prevent local optimization from creating broader harm.
Examples:
- no increase in complaint rate
- no increase in compliance escalations
- human override rate below a target threshold
- quality rubric score above release threshold
5. Build the Eval Set
Create examples that represent reality:
- common cases
- high-risk cases
- historical failures
- edge cases from domain experts
- adversarial or ambiguous inputs
This becomes the regression suite for prompts, models, tools, and policies.
6. Run Controlled Experiments
Compare AI-assisted workflows against the current process.
Do not only measure model output. Measure the full path from task intake to accepted result.
7. Monitor Production Behavior
After launch, track:
- quality drift
- cost drift
- latency drift
- user behavior changes
- human override patterns
- incidents and near misses
Sample production outputs for human review. Feed the interesting failures back into the eval set.
8. Review the Metric Itself
Every few weeks, ask:
- Is this metric still measuring the real goal?
- Are users gaming it?
- Did the workflow change?
- Are new risks appearing?
- Should some cases be automated less?
That last review is what separates measurement from dashboard decoration.
The Real Test
Knowing you are using AI correctly does not mean predicting every outcome.
It means your system can answer three questions honestly:
- What was the workflow like before AI?
- What changed after AI, including costs and review burden?
- What are we learning from failures that we did not know to expect?
The organizations that do this well will not be the ones with the most demos. They will be the ones with the tightest feedback loops between product intent, system behavior, human judgment, and business value.
That is the move:
From intuition to metrics.
From demos to evidence.
From “we are using AI” to “we know where AI is working, where it is not, and what we are changing next.”
Sources
- The state of AI in early 2024 - McKinsey, May 2024
- AI Adoption in 2024: 74% of Companies Struggle to Achieve and Scale Value - Boston Consulting Group, October 24, 2024
- Gartner Survey Finds Generative AI is Now the Most Frequently Deployed AI Solution in Organizations - Gartner, May 7, 2024
- AI ROI: How to measure the true value of AI - CIO, featuring Adrian Dunkley of StarApple AI
- Linking AI Adoption to Productivity: 5 Proven Metrics and a Worklytics ROI Framework - Worklytics
- The MITRE AI Maturity Model and Organizational Assessment Tool Guide - MITRE