System Management by Exception: "Automated Detection of Performance Regressions Using Statistical Process Control Techniques"

Friday, April 24, 2026

"Automated Detection of Performance Regressions Using Statistical Process Control Techniques"

Exploring ICPE’12 — A Precedent I Didn’t Expect

I recently came across an interesting paper from ICPE 2012 where my earlier work was cited. It’s always a bit surreal to see your ideas show up in academic research years later—especially in a context that closely aligns with what you’ve been working on.

The Paper

Automated detection of performance regressions using statistical process control techniques

Thanh H.D. Nguyen, Bram Adams, Zhen Ming Jiang, Ahmed E. Hassan
Published by ACM, April 2012

What caught my attention was their discussion of using control charts to detect performance regressions—an approach very close to what I explored back in 2005.

The Connection

In the paper, the authors reference my work:

Trubin et al. [18] proposed the use of control charts for in-field monitoring of software systems where performance counters fluctuate according to input load. Control charts can automatically learn when deviations exceed control limits and alert operators.

They go on to build upon this idea, applying control charts not just to live systems, but to performance regression testing.

Key Idea: Control Charts for Regression Detection

The core concept is elegant:

Use historical baseline runs (previous software versions) to establish control limits
Compare new test runs against those limits
Measure a violation ratio—how often metrics fall outside expected bounds
A higher ratio indicates a higher probability of regression

This aligns closely with the fundamental principle I worked on: detecting anomalies not by fixed thresholds, but by statistically learned behavior.

The Real Challenge

The authors correctly highlight a critical difficulty:

We want to detect deviations in the system (the process), not deviations caused by input variability (the load).

This is the central problem in performance analysis—and one that still trips up many modern monitoring systems.

They also point out two assumptions required for traditional control charts:

Stable (non-varying) input
Normally distributed output

In real-world systems, both assumptions are often violated.

Their Solution: Preprocessing

To address this, they introduce preprocessing steps:

Scaling – normalizing data to reduce input-driven variance
Filtering – cleaning noise before applying control charts

This is a practical adaptation, though it also highlights the limitations of applying classical statistical techniques directly to complex software systems.

Looking Back

For reference, the cited work is:

[18] I. Trubin. Capturing workload pathology by statistical exception detection system.
Computer Measurement Group (CMG), 2005.

It’s interesting to see how the idea of statistical exception detection—especially under variable workloads—continues to evolve and reappear in different forms.

Final Thoughts

What I find most encouraging is that the core idea still holds:

Performance anomalies should be detected relative to expected behavior, not absolute thresholds.

Whether you call it control charts, anomaly detection, or change point analysis—the principle remains the same.

And it’s a good reminder: sometimes ideas don’t just age… they propagate.

Igor Trubin

He started in 1979 as IBM/370 system engineer. In 1986 he got his PhD. in Robotics at St. Petersburg Technical University (Russia) and then worked as a professor teaching CAD/CAM, Robotics for 12 years. He published 30+ papers and made several presentations for conferences related to the Robotics and Artificial Intelligent fields. In 1999 he moved to the US, worked at Capital One bank as a Capacity Planner. His first CMG.org paper was written and presented in 2001. The next one, "Exception Detection System Based on MASF Technique," won a Best Paper award at CMG'02 and was presented at UKCMG'03 in Oxford, England. He made other tech. presentations at IBM z/Series Expo, SPEC.org, Southern and Central Europe CMG and ran several workshops covering his original method of Anomaly and Change Point Detection (Perfomalist.com). Author of “Performance Anomaly Detection” class (at CMG.org). Worked 2 years as the Capacity team lead for IBM, worked for SunTrust Bank for 3 years and then at IBM for 3 years as Sr. IT Architect. Now he works for Capital One bank as IT Manager at the Cloud Engineering and since 2015 he is a member of CMG.org Board of Directors. Runs UT channel iTrubin

System Management by Exception

Popular Post

_