Saturday, April 25, 2026

From AIOps to Agentic Systems: Why Monitoring Is Not Enough (and Never Was)

For years, the industry has been obsessed with observability.

Dashboards. Alerts. Correlations.
Then came AIOps — promising intelligence on top.

But let’s be honest:

Most AIOps tools today are still just better dashboards.

They detect problems.
Sometimes they explain them.
But very rarely do they fix anything.

For mainframe environments, this gap is even more important. IBM Z systems still run many of the enterprise’s most critical transaction workloads, where CPU, memory, I/O, service classes, batch windows, and subsystem behavior interact in complex ways. AI on the mainframe is not only about adding assistants or anomaly detection. The real opportunity is to combine trusted workload models, mainframe operational context, and governed automation so the platform can recommend — and eventually execute — safe actions before service levels are at risk.

The Missing Step: Action

Across my (with Capital One and 2 other co-authors) patent family:

US10437697 (2016)
US11243863 (2019)
US12007869 (2021)

there is a deliberate progression:


[Workload] → [Model] → [Insight] → [Action]

Most systems today stop here:


[Workload] → [Model] → [Insight]   ❌

The real value starts here:


[Workload] → [Model] → [Insight] → [Action]   ✅

Step 1 — Modeling the System (US10437697)

The first patent introduced a core idea:

Model how business activity (transactions) drives system resources (CPU, memory, I/O).

Not thresholds.
Not heuristics.
But statistical relationships.


Transactions ───► CPU / Memory / I/O
           (modeled mathematically)

This was already a shift from traditional monitoring.

Step 2 — Adding Context (US11243863)

The second patent introduced interaction types:

Different workloads behave differently — so model them separately.


Mobile ─┐
Web     ├──► Separate models ───► Better decisions
ATM     ┘

This aligns with what the industry now calls:

service-level observability
topology-aware analysis

Step 3 — Acting on the Model (US12007869)

This is the key leap.

The latest patent moves beyond analysis:

Use the models to automatically reconfigure the system.


Before:
Workload ───► Overloaded Node

After:
Workload ───► Optimal Node
          (automatically reassigned)

Or more formally:


[Model] → Decision → Remap workloads → Optimize system

This is no longer monitoring.

This is autonomous control.

Why This Matters Now (Agentic AI)

Everyone is talking about:

AI agents
autonomous systems
self-healing infrastructure

But here’s the uncomfortable truth:

You can’t have agentic systems without reliable system models.

LLMs don’t understand system dynamics.
They generate text — not operational decisions.

What you need is:


Statistical Models (US10437697)
+ Context Segmentation (US11243863)
+ Autonomous Action (US12007869)

Which leads to:


→ Agentic AIOps

The Real Gap in AIOps Today

Platforms like:

Datadog
Dynatrace
New Relic

are very good at:

✔ Detecting anomalies
✔ Explaining root causes

But still weak at:

❌ Acting autonomously
❌ Continuously optimizing systems

My Take (Provocative Version)

AIOps without action is just observability with better marketing.

The real transition is:


Monitoring → AIOps → Autonomous Systems → Agentic AI Ops

And the key step is exactly what US12007869 enables:

Systems that don’t just understand —
but act based on that understanding.

Final Thought

If your system still depends on humans to:

interpret alerts
decide what to do
execute changes

Then it’s not AIOps.

It’s just monitoring — with extra steps.

______________

Reference:

My CMG presentation about the subject: https://cmg.org/wp-content/plugins/s2member-files/proceedings/2017/362_Trubin.pdf

___________________________________________

Disclaimer: this post is written with ChartGPT's help.

Igor Trubin

He started in 1979 as IBM/370 system engineer. In 1986 he got his PhD. in Robotics at St. Petersburg Technical University (Russia) and then worked as a professor teaching CAD/CAM, Robotics for 12 years. He published 30+ papers and made several presentations for conferences related to the Robotics and Artificial Intelligent fields. In 1999 he moved to the US, worked at Capital One bank as a Capacity Planner. His first CMG.org paper was written and presented in 2001. The next one, "Exception Detection System Based on MASF Technique," won a Best Paper award at CMG'02 and was presented at UKCMG'03 in Oxford, England. He made other tech. presentations at IBM z/Series Expo, SPEC.org, Southern and Central Europe CMG and ran several workshops covering his original method of Anomaly and Change Point Detection (Perfomalist.com). Author of “Performance Anomaly Detection” class (at CMG.org). Worked 2 years as the Capacity team lead for IBM, worked for SunTrust Bank for 3 years and then at IBM for 3 years as Sr. IT Architect. Now he works for Capital One bank as IT Manager at the Cloud Engineering and since 2015 he is a member of CMG.org Board of Directors. Runs UT channel iTrubin

One of most recent parent with Capital One (2021 US12007869) is about Autonomous / AI ops (AIOps)

The following patent family:

Patent	Level	What it protects
2016 (10437697)	Foundation	`Build + validate statistical models
2019 (11243863)	Structured	Segment system into interaction types
2021 (12007869)	Adaptive	Dynamically reconfigure system using models

This progression covers:

✔ Observability / APM tools

Modeling + correlation (Patent 1)

✔ Capacity planning systems

Segmented workload modeling (Patent 2)

✔ Autonomous / AI ops (AIOps)

Self-optimizing infrastructure (Patent 3)

👉 You effectively moved toward:

self-driving infrastructure based on statistical modeling

Those patents map very directly to modern AIOps, especially the parts around business/workload demand → resource utilization → model scoring → automated load-balancing/remapping.

Core patent family vs AIOps platform features

Patent concept	Plain-English meaning	Modern AIOps equivalent
Interaction / transaction volume by type	Business workload demand, e.g. mobile banking, ATM, web traffic	Service traffic, request rate, user actions, business events
Device/resource utilization	CPU, memory, disk, network usage	Infrastructure + APM telemetry
Statistical / regression / multivariate models	Model relationship between workload and resource consumption	ML baselines, anomaly models, predictive analytics
Diagnostic scoring: R², RMSE, strength	Decide which models are reliable	Confidence/scoring of anomalies, correlations, RCA evidence
Filtering weak models	Keep only useful models	Noise reduction / alert suppression
Forecasts	Predict future demand/resource pressure	Bottleneck prediction, capacity forecasting
Remapping devices to interaction types	Use model output to change workload placement	Automated remediation, scaling, routing, load balancing

The strongest overlap is not generic “anomaly detection.” It is business-demand-aware resource modeling that can drive infrastructure decisions.

Igor Trubin

Friday, April 24, 2026

"Automated Detection of Performance Regressions Using Statistical Process Control Techniques"

Exploring ICPE’12 — A Precedent I Didn’t Expect

I recently came across an interesting paper from ICPE 2012 where my earlier work was cited. It’s always a bit surreal to see your ideas show up in academic research years later—especially in a context that closely aligns with what you’ve been working on.

The Paper

Automated detection of performance regressions using statistical process control techniques

Thanh H.D. Nguyen, Bram Adams, Zhen Ming Jiang, Ahmed E. Hassan
Published by ACM, April 2012

What caught my attention was their discussion of using control charts to detect performance regressions—an approach very close to what I explored back in 2005.

The Connection

In the paper, the authors reference my work:

Trubin et al. [18] proposed the use of control charts for in-field monitoring of software systems where performance counters fluctuate according to input load. Control charts can automatically learn when deviations exceed control limits and alert operators.

They go on to build upon this idea, applying control charts not just to live systems, but to performance regression testing.

Key Idea: Control Charts for Regression Detection

The core concept is elegant:

Use historical baseline runs (previous software versions) to establish control limits
Compare new test runs against those limits
Measure a violation ratio—how often metrics fall outside expected bounds
A higher ratio indicates a higher probability of regression

This aligns closely with the fundamental principle I worked on: detecting anomalies not by fixed thresholds, but by statistically learned behavior.

The Real Challenge

The authors correctly highlight a critical difficulty:

We want to detect deviations in the system (the process), not deviations caused by input variability (the load).

This is the central problem in performance analysis—and one that still trips up many modern monitoring systems.

They also point out two assumptions required for traditional control charts:

Stable (non-varying) input
Normally distributed output

In real-world systems, both assumptions are often violated.

Their Solution: Preprocessing

To address this, they introduce preprocessing steps:

Scaling – normalizing data to reduce input-driven variance
Filtering – cleaning noise before applying control charts

This is a practical adaptation, though it also highlights the limitations of applying classical statistical techniques directly to complex software systems.

Looking Back

For reference, the cited work is:

[18] I. Trubin. Capturing workload pathology by statistical exception detection system.
Computer Measurement Group (CMG), 2005.

It’s interesting to see how the idea of statistical exception detection—especially under variable workloads—continues to evolve and reappear in different forms.

Final Thoughts

What I find most encouraging is that the core idea still holds:

Performance anomalies should be detected relative to expected behavior, not absolute thresholds.

Whether you call it control charts, anomaly detection, or change point analysis—the principle remains the same.

And it’s a good reminder: sometimes ideas don’t just age… they propagate.

Igor Trubin

Monday, April 20, 2026

#ICPE2026 workshop presentation "Detecting past and future change points in performance data for education and practice"

See announcement and abstract HERE

Igor Trubin

System Management by Exception

Popular Post

_

Saturday, April 25, 2026

From AIOps to Agentic Systems: Why Monitoring Is Not Enough (and Never Was)

The Missing Step: Action

Step 1 — Modeling the System (US10437697)

Step 2 — Adding Context (US11243863)

Step 3 — Acting on the Model (US12007869)

Why This Matters Now (Agentic AI)

The Real Gap in AIOps Today

My Take (Provocative Version)

Final Thought

One of most recent parent with Capital One (2021 US12007869) is about Autonomous / AI ops (AIOps)

✔ Observability / APM tools

✔ Capacity planning systems

✔ Autonomous / AI ops (AIOps)

Core patent family vs AIOps platform features

Friday, April 24, 2026

"Automated Detection of Performance Regressions Using Statistical Process Control Techniques"

Exploring ICPE’12 — A Precedent I Didn’t Expect

The Paper

Automated detection of performance regressions using statistical process control techniques

The Connection

Key Idea: Control Charts for Regression Detection

The Real Challenge

Their Solution: Preprocessing

Looking Back

Final Thoughts

Monday, April 20, 2026

#ICPE2026 workshop presentation "Detecting past and future change points in performance data for education and practice"