How to Set Quality Baselines for an AI Agent Before Production

The most common quality monitoring mistake is deploying an AI agent without a baseline. Without a baseline, you can't detect drift — you have no reference point for what "normal" looks like. Every measurement is just a number, not a signal.

Establishing a quality baseline is a prerequisite for SPC, and it's straightforward if you approach it systematically. Here's the exact process.

Step 1: Define your quality metrics

Before you can measure a baseline, you need to decide what you're measuring. For most AI agents, start with these five:

Task accuracy: Percentage of responses that correctly complete the assigned task, scored against a reference answer. Requires a golden test set.
Hallucination rate: Percentage of responses containing unverifiable or fabricated claims. Can be measured by LLM-as-judge, embedding similarity to retrieved context, or human review.
Tone score: A numeric measure of how closely the agent's tone matches your target persona. Typically 0–1, measured by an LLM judge against a tone rubric.
Response length: Mean token count per response. Length drift is often a leading indicator for other quality changes.
Grounding rate: For RAG agents — the proportion of factual claims in the response that are traceable to retrieved context.

Don't try to measure everything at once. Pick two or three that are most critical for your agent's use case and establish those baselines first. You can add metrics later as your monitoring practice matures.

Step 2: Build a golden test set

A golden test set is a fixed collection of inputs with known-good reference outputs, used to evaluate your agent at regular intervals. This is the foundation of quantitative quality monitoring.

For a meaningful SPC baseline, your golden test set should:

Contain at least 50–100 examples (more for high-stakes agents)
Cover the full distribution of real user intents — not just easy cases
Include challenging edge cases that stress-test the agent's reasoning
Have reference answers that are unambiguously correct
Be stable — the same test set is used every evaluation run

The test set doesn't need to be perfect. It needs to be consistent. The same imperfect test set run daily gives you a reliable relative signal even if the absolute numbers aren't ideal.

Step 3: Run your baseline period

Once your test set is ready, run your agent against it daily for at least 14–21 days before setting final control limits. This baseline period should use your production configuration — the actual model, prompts, and context that will serve users.

Important: Your baseline period should represent stable, intentional operation. If you make prompt changes or model updates during the baseline period, restart it. Control limits computed from a mixed baseline will be misleadingly wide and will reduce your detection sensitivity.

During the baseline period, record each daily evaluation as a data point. By the end of 21 days, you'll have 21 observations per metric — enough to compute a robust mean and standard deviation.

Step 4: Compute your control limits

From your baseline data, compute for each metric:

Center line (CL): The mean of all baseline observations
Upper Control Limit (UCL): CL + 3σ
Lower Control Limit (LCL): CL − 3σ (or 0 if the metric can't go negative)
Zone A: ±2σ to ±3σ from CL
Zone B: ±1σ to ±2σ from CL
Zone C: Within ±1σ of CL

These limits are descriptive — they describe the range of variation your agent naturally exhibits when healthy. They are not specifications or targets. A metric within control limits is not necessarily "good" — it's just statistically consistent with baseline.

Step 5: Validate the baseline before relying on it

Before going live with SPC monitoring, do a quick sanity check on your baseline:

Are there any obvious outliers in the baseline period that should be excluded? (E.g., a day where evaluation infrastructure was misconfigured)
Does the distribution look approximately normal? Heavily skewed metrics may need transformation.
Are the control limits wide enough to avoid constant false positives? If UCL − CL is less than 3x the typical day-to-day variation, your limits are too tight.

When to recalibrate

Recalibrate your baseline after any intentional quality improvement: model upgrade, prompt revision, index rebuild. Run a fresh 7–14 day period after the change to establish new limits. Document what changed and when — this creates an audit trail that's invaluable for root cause analysis when a rule fires later.

A quality baseline is not a one-time artifact. It's a living reference that evolves as your agent intentionally improves. The key is that evolution is intentional — you update it when you mean to, not when drift has silently moved your process mean.

How to set quality baselines for an AI agent before production