Statistical Process Control for AI: Adapting Manufacturing Quality Methods for LLMs

In 1924, Walter Shewhart was working at Bell Telephone Laboratories on a problem that sounds remarkably familiar: how do you know when a production process is drifting before it starts producing defects? His answer — the control chart — became the foundation of statistical process control and has been used in manufacturing, aviation, and pharmaceuticals for 100 years.

The same mathematics applies to LLM outputs. Here's the full technical treatment.

The core idea: distinguishing signal from noise

Every production process has natural variation. A machine tool doesn't produce parts with exactly identical dimensions — there's a distribution of outcomes centered around the process mean. This is called common cause variation: random, inherent to the process, and expected.

When the process is disturbed — a tool wears, a material batch changes, an operator does something differently — a different kind of variation appears: special cause variation. This is the signal. The challenge is distinguishing special cause variation from the background noise of common cause variation.

Shewhart's insight: set control limits at ±3σ from the process mean. If the process is stable and normally distributed, only 0.27% of measurements will exceed these limits by chance. A point beyond 3σ is extremely unlikely to be common cause variation — it's a signal.

Applying this to LLM output quality

Consider an AI customer support agent with a measured hallucination rate evaluated daily against a golden test set. Over 30 days of stable operation, the mean hallucination rate is 0.8% with a standard deviation of 0.15%. This gives us:

Center line (CL): 0.80%
Upper Control Limit (UCL = CL + 3σ): 1.25%
Lower Control Limit (LCL = CL − 3σ): 0.35%

Any daily hallucination rate above 1.25% is a special cause signal — something has changed. But we can do better than waiting for a 3σ breach.

The Western Electric Rules

The Western Electric Company formalized four rules for interpreting control charts in their 1956 Statistical Quality Control Handbook. These rules detect patterns that indicate special cause variation before any point exceeds a 3σ limit:

Rule 1: One point beyond Zone A (±3σ). The classic Shewhart rule — any single point beyond the control limits.

Rule 2: Two of three consecutive points in Zone A or beyond (beyond ±2σ on the same side). This fires before a Rule 1 violation, with a false positive rate of roughly 0.4%.

Rule 3: Four of five consecutive points in Zone B or beyond (beyond ±1σ on the same side). Detects a sustained shift in the process mean.

Rule 4: Eight consecutive points on one side of the center line. The "run rule" — no point needs to be near a control limit. A run of 8+ on one side is statistically extremely unlikely under a stable process.

The Nelson Rules extend this further

Nelson (1984) added four more rules that are particularly relevant for AI quality monitoring:

Rule 5: Six consecutive points steadily increasing or decreasing. This detects a systematic trend — like accuracy slowly degrading as your RAG index gets staler.

Rule 6: Fifteen consecutive points within Zone C (within ±1σ). Counterintuitively, too little variation is also a signal — it often indicates measurement error or a stratified process.

Rule 7: Fourteen consecutive points alternating up and down. This pattern suggests interaction between two systematic sources of variation — e.g., A/B prompt variants being served alternately.

Rule 8: Eight consecutive points on both sides of the center line with none in Zone C. This indicates the process has bimodal behavior — two distinct populations are being mixed.

Why these rules work for LLMs

LLM output quality metrics behave like manufacturing process measurements in the ways that matter: they have a stable mean under normal conditions, they exhibit natural variation within a predictable range, and they exhibit systematic patterns when something changes.

The key analogy: your agent's daily accuracy score is like a daily sample of parts off a production line. The distribution of that score, under stable conditions, defines your process capability. SPC rules detect when the distribution has shifted.

The differences between manufacturing and AI are mostly in favor of applying SPC more easily. Unlike physical measurements, LLM quality scores can be evaluated at arbitrary frequency against a fixed test set, eliminating many sources of measurement error. The process is also more likely to exhibit gradual drift (model updates, index staleness) than sudden shocks, which makes run and trend rules particularly effective.

Practical considerations for AI quality monitoring

A few implementation notes for teams applying SPC to LLM quality:

Sample size. SPC works best with a stable baseline of at least 20–30 observations before setting control limits. For daily evaluation, this means roughly 3–4 weeks of pre-production or early-production data before the limits are meaningful.

Metric choice matters. Not all metrics are equally suitable for SPC. Accuracy against a fixed test set is ideal — deterministic, stable, comparable across time. User ratings are noisier and require larger samples. Latency can be bimodal (fast vs. slow calls), which benefits from separate control charts for each mode.

Recalibrating baselines. When you intentionally improve your agent — update the model, rebuild the index, revise the prompt — you should recalibrate the baseline. Run a fresh baseline period after any intentional quality change, then re-establish control limits from the new data.

False positive rate. With all eight Western Electric rules applied simultaneously, the false positive rate is higher than with Rule 1 alone. For production alerts, consider applying Rules 1, 2, 4, and 5 as the high-priority tier, and Rules 3, 6, 7, 8 as informational signals for weekly review.

Statistical process control isn't magic — it's applied probability theory. The reason it's been the standard in quality-critical industries for 100 years is that it works: it reliably distinguishes the signal of a degrading process from the noise of normal variation. The mathematics transfers directly to LLM quality monitoring.

Statistical process control for AI: adapting manufacturing quality methods for LLMs