Popular Post

_

Saturday, April 25, 2026

From AIOps to Agentic Systems: Why Monitoring Is Not Enough (and Never Was)

For years, the industry has been obsessed with observability.

Dashboards. Alerts. Correlations.
Then came AIOps — promising intelligence on top.

But let’s be honest:

Most AIOps tools today are still just better dashboards.

They detect problems.
Sometimes they explain them.
But very rarely do they fix anything.


The Missing Step: Action

Across my (with Capital One and 2 other co-authors) patent family:

  • US10437697 (2016)
  • US11243863 (2019)
  • US12007869 (2021)

there is a deliberate progression:

[Workload] → [Model] → [Insight] → [Action]

Most systems today stop here:

[Workload] → [Model] → [Insight] ❌

The real value starts here:

[Workload] → [Model] → [Insight] → [Action] ✅

Step 1 — Modeling the System (US10437697)

The first patent introduced a core idea:

Model how business activity (transactions) drives system resources (CPU, memory, I/O).

Not thresholds.
Not heuristics.
But statistical relationships.

Transactions ───► CPU / Memory / I/O
(modeled mathematically)

This was already a shift from traditional monitoring.


Step 2 — Adding Context (US11243863)

The second patent introduced interaction types:

Different workloads behave differently — so model them separately.

Mobile ─┐
Web ├──► Separate models ───► Better decisions
ATM ┘

This aligns with what the industry now calls:

  • service-level observability
  • topology-aware analysis

Step 3 — Acting on the Model (US12007869)

This is the key leap.

The latest patent moves beyond analysis:

Use the models to automatically reconfigure the system.

Before:
Workload ───► Overloaded Node

After:
Workload ───► Optimal Node
(automatically reassigned)

Or more formally:

[Model] → Decision → Remap workloads → Optimize system

This is no longer monitoring.

This is autonomous control.


Why This Matters Now (Agentic AI)

Everyone is talking about:

  • AI agents
  • autonomous systems
  • self-healing infrastructure

But here’s the uncomfortable truth:

You can’t have agentic systems without reliable system models.

LLMs don’t understand system dynamics.
They generate text — not operational decisions.

What you need is:

Statistical Models (US10437697)
+ Context Segmentation (US11243863)
+ Autonomous Action (US12007869)

Which leads to:

→ Agentic AIOps

The Real Gap in AIOps Today

Platforms like:

  • Datadog
  • Dynatrace
  • New Relic

are very good at:

✔ Detecting anomalies
✔ Explaining root causes

But still weak at:

❌ Acting autonomously
❌ Continuously optimizing systems


My Take (Provocative Version)

AIOps without action is just observability with better marketing.

The real transition is:

Monitoring → AIOps → Autonomous Systems → Agentic AI Ops

And the key step is exactly what US12007869 enables:

Systems that don’t just understand —
but act based on that understanding.


Final Thought

If your system still depends on humans to:

  • interpret alerts
  • decide what to do
  • execute changes

Then it’s not AIOps.

It’s just monitoring — with extra steps.

______________

Reference:

My CMG presentation about the subject: https://cmg.org/wp-content/plugins/s2member-files/proceedings/2017/362_Trubin.pdf



___________________________________________

Disclaimer:  this post is written with ChartGPT's help. 

One of most recent parent with Capital One (2021 US12007869) is about Autonomous / AI ops (AIOps)

The following patent family:
PatentLevelWhat it protects
2016 (10437697)    Foundation    `Build + validate statistical models
2019 (11243863)    Structured    Segment system into interaction types
2021 (12007869)        Adaptive    Dynamically reconfigure system using models

This progression covers:

✔ Observability / APM tools 

  • Modeling + correlation (Patent 1)

✔ Capacity planning systems

  • Segmented workload modeling (Patent 2)

✔ Autonomous / AI ops (AIOps)

  • Self-optimizing infrastructure (Patent 3)

👉 You effectively moved toward:

self-driving infrastructure based on statistical modeling

Those patents map very directly to modern AIOps, especially the parts around business/workload demand → resource utilization → model scoring → automated load-balancing/remapping.

Core patent family vs AIOps platform features

Patent conceptPlain-English meaningModern AIOps equivalent
Interaction / transaction volume by typeBusiness workload demand, e.g. mobile banking, ATM, web trafficService traffic, request rate, user actions, business events
Device/resource utilizationCPU, memory, disk, network usageInfrastructure + APM telemetry
Statistical / regression / multivariate modelsModel relationship between workload and resource consumptionML baselines, anomaly models, predictive analytics
Diagnostic scoring: R², RMSE, strengthDecide which models are reliableConfidence/scoring of anomalies, correlations, RCA evidence
Filtering weak modelsKeep only useful modelsNoise reduction / alert suppression
ForecastsPredict future demand/resource pressureBottleneck prediction, capacity forecasting
Remapping devices to interaction typesUse model output to change workload placementAutomated remediation, scaling, routing, load balancing

The  strongest overlap is not generic “anomaly detection.” It is business-demand-aware resource modeling that can drive infrastructure decisions.



Friday, April 24, 2026

"Automated Detection of Performance Regressions Using Statistical Process Control Techniques"

Exploring ICPE’12 — A Precedent I Didn’t Expect

I recently came across an interesting paper from ICPE 2012 where my earlier work was cited. It’s always a bit surreal to see your ideas show up in academic research years later—especially in a context that closely aligns with what you’ve been working on.

The Paper

Automated detection of performance regressions using statistical process control techniques

Thanh H.D. Nguyen, Bram Adams, Zhen Ming Jiang, Ahmed E. Hassan
Published by ACM, April 2012

What caught my attention was their discussion of using control charts to detect performance regressions—an approach very close to what I explored back in 2005.

The Connection

In the paper, the authors reference my work:

Trubin et al. [18] proposed the use of control charts for in-field monitoring of software systems where performance counters fluctuate according to input load. Control charts can automatically learn when deviations exceed control limits and alert operators.

They go on to build upon this idea, applying control charts not just to live systems, but to performance regression testing.

Key Idea: Control Charts for Regression Detection

The core concept is elegant:

  • Use historical baseline runs (previous software versions) to establish control limits
  • Compare new test runs against those limits
  • Measure a violation ratio—how often metrics fall outside expected bounds
  • A higher ratio indicates a higher probability of regression

This aligns closely with the fundamental principle I worked on: detecting anomalies not by fixed thresholds, but by statistically learned behavior.

The Real Challenge

The authors correctly highlight a critical difficulty:

We want to detect deviations in the system (the process), not deviations caused by input variability (the load).

This is the central problem in performance analysis—and one that still trips up many modern monitoring systems.

They also point out two assumptions required for traditional control charts:

  1. Stable (non-varying) input
  2. Normally distributed output

In real-world systems, both assumptions are often violated.

Their Solution: Preprocessing

To address this, they introduce preprocessing steps:

  • Scaling – normalizing data to reduce input-driven variance
  • Filtering – cleaning noise before applying control charts

This is a practical adaptation, though it also highlights the limitations of applying classical statistical techniques directly to complex software systems.

Looking Back

For reference, the cited work is:

[18] I. Trubin. Capturing workload pathology by statistical exception detection system.
Computer Measurement Group (CMG), 2005.

It’s interesting to see how the idea of statistical exception detection—especially under variable workloads—continues to evolve and reappear in different forms.

Final Thoughts

What I find most encouraging is that the core idea still holds:

Performance anomalies should be detected relative to expected behavior, not absolute thresholds.

Whether you call it control charts, anomaly detection, or change point analysis—the principle remains the same.

And it’s a good reminder: sometimes ideas don’t just age… they propagate.




Monday, April 20, 2026

#ICPE2026 workshop presentation "Detecting past and future change points in performance data for education and practice"

 See announcement and abstract  HERE