How to set quality baselines for an AI agent before production

The most common quality monitoring mistake is deploying an AI agent without a baseline. Without a baseline, you can't detect drift — you have no reference point for what "normal" looks like. Every measurement is just a number, not a signal.

Establishing a quality baseline is a prerequisite for SPC, and it's straightforward if you approach it systematically. Here's the exact process.

Step 1: Define your quality metrics

Before you can measure a baseline, you need to decide what you're measuring. For most AI agents, start with these five:

Don't try to measure everything at once. Pick two or three that are most critical for your agent's use case and establish those baselines first. You can add metrics later as your monitoring practice matures.

Step 2: Build a golden test set

A golden test set is a fixed collection of inputs with known-good reference outputs, used to evaluate your agent at regular intervals. This is the foundation of quantitative quality monitoring.

For a meaningful SPC baseline, your golden test set should:

The test set doesn't need to be perfect. It needs to be consistent. The same imperfect test set run daily gives you a reliable relative signal even if the absolute numbers aren't ideal.

Step 3: Run your baseline period

Once your test set is ready, run your agent against it daily for at least 14–21 days before setting final control limits. This baseline period should use your production configuration — the actual model, prompts, and context that will serve users.

Important: Your baseline period should represent stable, intentional operation. If you make prompt changes or model updates during the baseline period, restart it. Control limits computed from a mixed baseline will be misleadingly wide and will reduce your detection sensitivity.

During the baseline period, record each daily evaluation as a data point. By the end of 21 days, you'll have 21 observations per metric — enough to compute a robust mean and standard deviation.

Step 4: Compute your control limits

From your baseline data, compute for each metric:

These limits are descriptive — they describe the range of variation your agent naturally exhibits when healthy. They are not specifications or targets. A metric within control limits is not necessarily "good" — it's just statistically consistent with baseline.

Step 5: Validate the baseline before relying on it

Before going live with SPC monitoring, do a quick sanity check on your baseline:

When to recalibrate

Recalibrate your baseline after any intentional quality improvement: model upgrade, prompt revision, index rebuild. Run a fresh 7–14 day period after the change to establish new limits. Document what changed and when — this creates an audit trail that's invaluable for root cause analysis when a rule fires later.

A quality baseline is not a one-time artifact. It's a living reference that evolves as your agent intentionally improves. The key is that evolution is intentional — you update it when you mean to, not when drift has silently moved your process mean.

← All articles   ·   Related: SPC for AI — the full explainer →

Agent SPC sets your baseline automatically.

Connect your agent and we compute your control limits from the first 14 days of production data.

Get early access