From Intuition to Metrics: How to Measure AI Success

AI adoption often starts with a sentence that sounds reasonable in a planning meeting:

“We should be using AI here.”

Sometimes that instinct is right. A model really can reduce manual review, speed up research, route support tickets, summarize messy information, or help teams ship better software.

But intuition is a weak operating system for AI programs.

The hard part is not launching the first pilot. The hard part is knowing whether the system is actually improving the business, whether the improvement is worth the cost, and whether new failure modes are quietly building up around it.

That is the difference between using AI and using AI correctly.

TL;DR

AI adoption without measurement turns into activity theater: pilots, dashboards, policies, and demos that do not prove value.
The first measurement mistake is usually skipping the pre-AI baseline. If you do not know the old cycle time, error rate, cost, or satisfaction score, you cannot prove the new one is better.
Useful AI metrics usually fall into three layers: business outcomes, workflow quality, and model/system behavior.
The Rumsfeld matrix is a helpful lens: known knowns can be instrumented, known unknowns can be monitored, unknown knowns can be surfaced through domain experts, and unknown unknowns require blast-radius control.
The practical path is simple but not easy: define the outcome, establish baselines, run controlled experiments, evaluate quality continuously, and revisit the metric when reality teaches you something new.

What You Will Learn Here

Why “everyone else is adopting AI” is not a strategy
How to apply knowns and unknowns to AI measurement
Which metrics matter before, during, and after deployment
How to connect AI behavior to business value without fooling yourself
How to design AI systems that can survive surprises
An 8-step roadmap for moving from intuition to measurable AI ROI

The Intuition Trap

AI adoption has never been easier to justify emotionally.

There are better models, better tooling, better developer workflows, and better executive narratives. The pressure to do something is real. In McKinsey’s 2024 global survey, AI adoption jumped to 72% of organizations using AI in at least one business function. BCG, meanwhile, reported that many companies still struggle to turn AI activity into tangible value.

That gap is the real story.

The problem is not that teams are uninterested in AI. The problem is that many teams still measure adoption instead of value.

Common weak signals include:

number of pilots launched
number of AI policies drafted
number of models deployed
number of employees with access to tools
number of prompts or conversations created

Those metrics are not useless. They can tell you whether a program exists. They cannot tell you whether it works.

A support chatbot with high usage can still frustrate customers. A code assistant can create more review burden than speed. A document summarizer can save time while introducing subtle compliance risk. A sales copilot can generate more messages without improving conversion.

So the first shift is philosophical:

AI success is not adoption. AI success is measured improvement in a real workflow.

Start with the Baseline

Measurement starts before the model shows up.

If you want AI to improve a process, capture the current state first:

Workflow Question	Baseline Metric
How long does this take today?	Cycle time, time-to-resolution, review time
How often does it fail?	Error rate, escalation rate, rework rate
What does it cost?	Cost per task, cost per case, labor hours
How good is the output?	Rubric score, QA score, customer satisfaction
How much human supervision is required?	Review minutes, override rate, approval rate

Without this baseline, teams end up making claims like “AI made us faster” without knowing faster than what.

That is especially risky because AI changes the shape of work. It may reduce drafting time while increasing verification time. It may lower first-response latency while raising escalation volume. It may let one team do more while making another team absorb more exceptions.

The useful question is not:

Did the model produce something?

It is:

Did the workflow become better after all costs, reviews, failures, and second-order effects are included?

The Rumsfeld Matrix for AI Work

Donald Rumsfeld’s “known knowns / known unknowns / unknown unknowns” framing is useful because AI systems do not only fail in ways you can list ahead of time.

Every AI initiative sits inside four categories:

                         What We Know          What We Do Not Know

Aware                    Known Knowns          Known Unknowns
                         Accuracy              Drift patterns
                         Latency               Long-term user behavior
                         Cost                  Edge-case performance
                         Throughput            Explainability gaps

Unaware                  Unknown Knowns        Unknown Unknowns
                         Tacit expert rules    Emergent behaviors
                         Hidden data assets    Second-order incentives
                         Overlooked patterns   Valid-looking bad outputs
                         Existing signals      Unexpected dependencies

Known Knowns

These are the comfortable metrics.

You can instrument accuracy, latency, cost, throughput, uptime, task completion, and schema validity. They belong on dashboards. They can be tracked over time. They can be used for release gates.

For example:

p95 response latency under 2 seconds
hallucination rate below 2% on a labeled dataset
cost per resolved ticket below a target threshold
JSON schema validity above 99.5%

Known Unknowns

These are risks you can name but cannot fully predict.

You know drift can happen. You know users may change behavior once the tool exists. You know edge cases will appear in production. You know the model may perform differently on a new customer segment.

Known unknowns call for monitoring, experiments, and periodic review.

For example:

monitor quality by customer segment
compare AI-assisted and human-only workflows
run regression evals after prompt or model changes
sample production outputs for human review

Unknown Knowns

These are the things the organization knows but has not converted into system knowledge.

A support lead may know which refund cases always need escalation. A data analyst may know a field is unreliable after a migration. A compliance person may know which phrasing creates risk. A senior engineer may know which tool failures are harmless and which ones are dangerous.

AI programs fail here when they treat the dataset as the whole truth.

The fix is not just more data. It is structured discovery:

interview domain experts before building
run pre-mortems with operators
review historical incidents and edge cases
turn tacit rules into rubrics, tests, and guardrails

Unknown Unknowns

These are the surprises.

An agent generates valid-looking SQL that damages a downstream workflow. A recommendation system changes user behavior in a way that breaks the original metric. A summarizer reduces reading time but causes reviewers to miss context. A tool-calling agent learns a path that technically works but bypasses an important human checkpoint.

You cannot write a complete checklist for unknown unknowns.

You can design for impact control:

feature flags
canary releases
human approval on high-risk actions
rate limits and scoped permissions
rollback paths
anomaly detection
incident reviews that question assumptions, not just bugs

The point is not to predict every failure. The point is to make surprise survivable.

The Three Metric Layers That Matter

AI teams often mix together metrics that answer different questions. A cleaner model is to separate them into three layers.

1. Business Outcome Metrics

These answer: Did the business get better?

Examples:

revenue per account
conversion rate
customer retention
support cost per resolved case
time-to-resolution
employee hours saved after review time is included
risk incidents avoided

This is the layer executives usually care about. It is also the layer most likely to be polluted by wishful thinking if the workflow baseline is missing.

2. Workflow Quality Metrics

These answer: Did the process improve?

Examples:

task completion rate
first-pass acceptance rate
rework rate
human override rate
escalation accuracy
approval latency
reviewer burden
user satisfaction after AI interaction

This layer is where product managers, operations leads, and engineering teams can usually have the most practical conversations.

For an AI support agent, the business metric might be “support cost per resolved case.” The workflow metrics might be “correct routing rate,” “cases solved without escalation,” and “average review minutes per response.”

3. Model and System Metrics

These answer: Did the AI system behave correctly?

Examples:

eval score
hallucination rate
retrieval precision
tool-call accuracy
schema validity
latency
inference cost
refusal correctness
safety policy compliance

This is the layer most AI teams start with because it is closest to the implementation. It matters, but it is not enough.

A model can score well on an eval and still fail to create business value. A workflow can create value even if the model is imperfect, as long as the surrounding system catches the right failures.

The job is to connect all three layers.

Model behavior -> workflow quality -> business outcome

Tool-call accuracy -> correct ticket routing -> lower time-to-resolution
Retrieval quality -> fewer bad answers -> higher customer satisfaction
Summary quality -> faster review -> lower cost per case

Impact Chaining: The Antidote to Vague ROI

One practical technique is impact chaining.

Instead of saying “AI will improve support,” map the causal path:

AI Capability	Workflow Change	Business Outcome
Classify inbound tickets	Faster routing to the right queue	Lower time-to-resolution
Draft support replies	Lower writing time per case	Lower cost per resolved case
Retrieve policy snippets	Fewer incorrect answers	Higher CSAT, lower compliance risk
Summarize case history	Faster human review	Higher agent throughput

Each link needs a metric.

If the AI drafts replies, measure drafting time. But also measure review time, edit distance, approval rate, error rate, and customer satisfaction. If review time doubles, the apparent productivity gain may disappear.

This is where many AI ROI stories fall apart. They measure the step that got faster and ignore the steps that got heavier.

Controlled Experiments Beat Hope

The cleanest way to know whether AI is helping is to compare workflows.

For a real rollout, that may look like:

Team A uses the current process; Team B uses the AI-assisted process.
A random sample of tickets gets AI triage; the rest use normal triage.
One model/prompt variant handles a test cohort; another handles a control cohort.
Human reviewers grade outputs without knowing which system produced them.

Measure the full workflow:

completion rate
time-to-completion
error rate
rework rate
escalation rate
cost per completed task
human quality score
user satisfaction

This is less glamorous than a demo, but it is much harder to fool.

It also makes model selection more honest. The best model is not always the one with the highest benchmark score. It is the one that improves the target workflow within the cost, latency, quality, and risk constraints that matter.

Evals Are the Bridge Between Product and Engineering

For AI applications, evals are not just technical tests. They are the contract between product intent and system behavior.

A good eval suite turns vague expectations into inspectable examples:

“The support agent should ask for missing account details before taking action.”
“The coding agent should not modify files outside the requested scope.”
“The research assistant should distinguish direct evidence from inference.”
“The sales assistant should not invent pricing terms.”

Those expectations become datasets, rubrics, deterministic checks, and judge prompts.

For agent systems, I usually want a mix of:

deterministic tests for schemas, permissions, and tool boundaries
dataset evals for repeated examples
rubric-based LLM judges for semantic quality
trace-aware grading for tool choice, retrieval, and intermediate decisions
human calibration on sampled outputs

This matters because production AI quality is rarely one-dimensional. A response can be polite but wrong. Correct but unsupported. Fast but risky. Cheap but incomplete. Helpful in one customer segment and harmful in another.

Evals give the team a shared language for those tradeoffs.

Double-Loop Learning: Fix the Metric, Not Just the Model

Single-loop learning asks:

Are we hitting the target?

Double-loop learning asks:

Is this still the right target?

AI teams need double-loop learning because AI systems change the environment around themselves. Once a tool exists, users adapt. Teams route more work through it. Managers reset expectations. Edge cases that were rare may become common because the AI makes a new workflow possible.

That means the original metric can become stale.

For example:

You optimized for response speed, then discovered trust dropped because users wanted more explanation.
You optimized for automation rate, then discovered the hardest cases were being automated too aggressively.
You optimized for low escalation, then discovered the system was hiding uncertainty instead of surfacing it.
You optimized for token cost, then discovered shorter outputs increased human review time.

The right response is not always “improve the model.”

Sometimes it is:

change the metric
split the metric by risk tier
add a human approval step
lower automation on certain cases
rewrite the rubric
retire the use case

That last option matters. A mature AI program should be able to kill a bad AI use case without treating it as a failure of ambition.

An 8-Step Roadmap for AI Measurement Maturity

Here is the practical sequence I would use with most teams.

1. Name the Workflow

Do not start with “AI strategy.” Start with a workflow.

Bad:

“Use AI in customer success.”

Better:

“Reduce average time-to-resolution for tier-1 billing tickets without increasing incorrect refunds.”

2. Establish the Baseline

Capture the current process before AI changes it:

cycle time
cost
error rate
rework
escalation
quality score
user satisfaction

3. Define the Outcome Metric

Pick the business result that matters.

Examples:

reduce cost per resolved case by 20%
increase first-contact resolution by 10%
reduce analyst research time by 30%
improve lead qualification accuracy by 15%

4. Define Workflow Guardrails

Add metrics that prevent local optimization from creating broader harm.

Examples:

no increase in complaint rate
no increase in compliance escalations
human override rate below a target threshold
quality rubric score above release threshold

5. Build the Eval Set

Create examples that represent reality:

common cases
high-risk cases
historical failures
edge cases from domain experts
adversarial or ambiguous inputs

This becomes the regression suite for prompts, models, tools, and policies.

6. Run Controlled Experiments

Compare AI-assisted workflows against the current process.

Do not only measure model output. Measure the full path from task intake to accepted result.

7. Monitor Production Behavior

After launch, track:

quality drift
cost drift
latency drift
user behavior changes
human override patterns
incidents and near misses

Sample production outputs for human review. Feed the interesting failures back into the eval set.

8. Review the Metric Itself

Every few weeks, ask:

Is this metric still measuring the real goal?
Are users gaming it?
Did the workflow change?
Are new risks appearing?
Should some cases be automated less?

That last review is what separates measurement from dashboard decoration.

The Real Test

Knowing you are using AI correctly does not mean predicting every outcome.

It means your system can answer three questions honestly:

What was the workflow like before AI?
What changed after AI, including costs and review burden?
What are we learning from failures that we did not know to expect?

The organizations that do this well will not be the ones with the most demos. They will be the ones with the tightest feedback loops between product intent, system behavior, human judgment, and business value.

That is the move:

From intuition to metrics.

From demos to evidence.

From “we are using AI” to “we know where AI is working, where it is not, and what we are changing next.”

Sources

The state of AI in early 2024 - McKinsey, May 2024
AI Adoption in 2024: 74% of Companies Struggle to Achieve and Scale Value - Boston Consulting Group, October 24, 2024
Gartner Survey Finds Generative AI is Now the Most Frequently Deployed AI Solution in Organizations - Gartner, May 7, 2024
AI ROI: How to measure the true value of AI - CIO, featuring Adrian Dunkley of StarApple AI
Linking AI Adoption to Productivity: 5 Proven Metrics and a Worklytics ROI Framework - Worklytics
The MITRE AI Maturity Model and Organizational Assessment Tool Guide - MITRE

Luis Mori Guerra

Recent Articles

Topics

From Intuition to Metrics: How to Know You're Using AI Correctly

TL;DR

What You Will Learn Here

The Intuition Trap

Start with the Baseline

The Rumsfeld Matrix for AI Work

Known Knowns

Known Unknowns

Unknown Knowns

Unknown Unknowns

The Three Metric Layers That Matter

1. Business Outcome Metrics

2. Workflow Quality Metrics

3. Model and System Metrics

Impact Chaining: The Antidote to Vague ROI

Controlled Experiments Beat Hope

Evals Are the Bridge Between Product and Engineering

Double-Loop Learning: Fix the Metric, Not Just the Model

An 8-Step Roadmap for AI Measurement Maturity

1. Name the Workflow

2. Establish the Baseline

3. Define the Outcome Metric

4. Define Workflow Guardrails

5. Build the Eval Set

6. Run Controlled Experiments

7. Monitor Production Behavior

8. Review the Metric Itself

The Real Test

Sources

Search the blog