AI tools love to report numbers, but the labels can hide what’s actually happening: what counts as “right,” what the model is unsure about, and what your users will feel.
This guide translates the most common evaluation metrics and terms into plain English, with quick mental checks you can do while reviewing results on the web in Firefox.
One helpful framing: every metric is a shortcut. Your job is to pick the shortcut that matches the cost of being wrong.
Start with the simplest question: “What is a correct answer?”
Before accuracy, precision, or anything else, lock down what “correct” means for your task.
For some AI tasks, correctness is binary (spam vs not spam). For others, it’s subjective (summary quality) or depends on thresholds (is a photo “blurry enough” to reject?).
- Label definition: what each class means (and what it doesn’t).
- Edge cases: examples that are hard to label; write down the rule you’ll apply.
- Ground truth source: who labeled it, with what expertise, and how disagreements were handled.
- Unit of evaluation: per message, per user, per session, per document?
If you’re reviewing outputs in a web tool, keep a short “label rules” note open in another tab—otherwise the metric debates never end.
Accuracy: useful, but easiest to fool
Accuracy is “how often the model is right overall.”
It’s most meaningful when your classes are balanced and the cost of different errors is similar.
Where it goes wrong: if 95% of items are “not spam,” a model that always predicts “not spam” gets 95% accuracy while being useless.
Quick Firefox check: if you can filter a results table by label, look for class imbalance. If one label dominates, treat accuracy as a weak headline number.
Precision and recall: “How careful?” vs “How thorough?”
These two matter most for “find the needles” problems like fraud, safety issues, policy violations, or important alerts.
- Precision (careful): when the model flags something, how often is it truly that thing?
- Recall (thorough): of all true cases, how many did the model actually catch?
Typical trade-off: raising recall often lowers precision (you catch more, but you also accuse more).
Plain-English way to pick: decide which mistake hurts more.
- False positives are costly (e.g., blocking good users): prioritize precision.
- False negatives are costly (e.g., missing unsafe content): prioritize recall.
F1 score: one number that hides the trade-off
F1 is a combined score that balances precision and recall (it’s their harmonic mean).
It’s handy for tracking changes over time, but it can hide what you actually care about. Two models can have the same F1 while one is “too strict” and the other is “too lenient.”
If someone shows only F1, ask to see precision and recall separately at the operating threshold you’ll use in production.
Confusion matrix: the most honest table in the room
A confusion matrix is a grid that shows where predictions go right and wrong across classes.
It answers practical questions like: “Are we confusing ‘refund request’ with ‘chargeback’?” or “Do we misclassify new users more than returning users?”
What to look for:
- One dominant off-diagonal cell: a specific, repeated confusion (often a labeling rule or missing feature).
- Rare classes collapsing: small classes getting predicted as the most common class.
- Asymmetry: class A mistaken for B much more than B mistaken for A (usually meaning one label is too broad).
When reviewing in Firefox, a simple workflow is: open the matrix screenshot/report, then open a tab with example errors for the top 1–2 confusion pairs.
ROC-AUC and PR-AUC: ranking quality, not your final decision
AUC metrics often confuse people because they can look great while the model still behaves poorly at your real-world threshold.
- ROC-AUC measures how well the model ranks positives above negatives across all possible thresholds.
- PR-AUC (precision-recall AUC) is usually more informative when positives are rare (fraud, abuse, defects).
Two practical reminders:
- AUC is about ordering, not whether your chosen cutoff gives acceptable precision/recall.
- If the positive class is rare, PR-AUC tends to match your intuition better than ROC-AUC.
If you’re choosing a threshold, ask for a curve or table that shows precision and recall at several cutoffs—not just one AUC number.
Confidence, probability, and calibration: “How sure” is not always true
Many models output a score that looks like a probability (e.g., 0.93). People call it confidence.
But a score of 0.93 doesn’t automatically mean “93% chance this is correct.” That only holds when the model is calibrated.
Plain-English calibration test: among all items the model scores around 0.80, are about 80% actually correct?
Why you should care:
- If you use confidence to auto-approve/auto-reject, miscalibration quietly creates bad automation.
- Calibration can drift when data changes (new topics, new user behavior, new spam styles).
In web reviews, watch for “high-confidence wrong” examples. A few of those are often more alarming than lots of low-confidence noise.
Takeaway: pick metrics based on the cost of mistakes
Accuracy is fine for balanced problems, but most real systems need you to name the painful error first.
Use precision/recall (plus a confusion matrix) to see what’s actually breaking, use AUC to judge ranking potential, and treat “confidence” as a claim that needs calibration evidence.
If you only remember one thing: ask, “Which wrong answer is more expensive?” and choose metrics that measure that mistake directly.