Popular Post

_

Saturday, February 28, 2009

Real-Time Statistical Exception Detection

Does that make sense to apply statistical filtering to real-time computer performance data? I did not try as I believe analyzing last day data against historical baseline (based on dynamic statistical thresholds) would be enough to have good alert for upcoming issue and at the same time classical alerting system (based on constant thresholds, for instance, patrol or sites-scope) captures severe incidents if something completely dying.

But I see some companies do that using the following three (at least) products available on a market:

1. Integrien Alive™ (http://www.integrien.com/ )
2. Netuitive (http://netuitive.com/ )
3. ProactiveNet (now BMC), (http://documents.bmc.com/products/documents/49/13/84913/84913.pdf )

Plus Firescope http://www.firescope.com/default.htm and Managed Objects http://managedobjects.com/ do something similar)

I have recently had discussion with Integrien sales people as they did live presentation of Alive product for company I work for now.
I was impressed, it looks working good. Most interesting for me is the deference between SEDS (my approach) and their technology.

Apparently both approaches are using dynamic statistical thresholds to issue an alert.

But I think they do that using some patented complex statistical algorithms that should work well even if sample data is not normally distributed. It’s done based on some research that Dr Mazda A. Marvasti did and I am aware of this research as some of his thoughts was published in CMG (in MeasureIT) couple years ago. That consists of very good critic of SPC (Statistical Process Control) concepts applied to IT data as SPC works perfect if data is normally distributed and if not, it works not so perfect. The 1st attempt to improve SPC was MASF to regroup analyzed data and after regrouping data might be more close to normal. SEDS is based on MASF and, for instance, it looks at history in different dimension by not comparing (calculating st. deviations) hours during the same day but grouping hours by weekday and also it calculates statistic across weeks not days.

(You could find more details in my last paper. Links to some papers related to this subject including my papers can be found in this blog )

BTW In respond on his publication I did special analysis to see how far from normal the data is used by SEDS and some result of this research has been published in one of my papers. And my opinion is some data is close to normal and some still indeed is not so close and it depends of metrics, subsystems and environment (prod/non-prod) and how it’s grouped.

The key is what type of threshold the SEDS-like product uses to establish a base-line. That could be very simple – static one, or based on st. deviations, but that could be more complex thresholds such as combination of static (based on expert experiences – empiric) and simple statistical ones (based on st. deviations). SEDS uses that combination and SEDS has a several tuning parameters to tune SEDS to capture meaningful exceptions. I believe this approach is valuable (and cheap) for practical usage and several successful implementations of SEDS proves that.

But for more accurate analysis of data especially if it’s far from normal destitution, other more advanced statistical techniques could be applied and looks like this product implements that. For me it’s just another (more sophisticated) threshold calculation for base-lining. Anyway I am continue improving my approach and will be thinking about what they and others do in this area.

Other interesting observation I got from  the Integrien tool live presentation:
The rate of dynamic threshold exceeding is so large that they have to put additional (static???) threshold considering that some number of exceptions are kind of normal and just a nose that should be ignored. That means if the number of exceptions is bigger that that threshold, the smart alert is issued. I did not get how this threshold is set or calculated, but it’s very high - HUNDREDS (!!!) of exceptions per interval. I believe the reason of this is they apply “Anomaly” detector to too granular data. As I stated in my last paper the better result could be reached by doing statistical after some stigmatization (SEDS does that mostly after averaging that to hourly data)

BTW SEDS uses original meta-metric to detect only meaningful exceptions (it uses EV or Exception Value - see my last paper) that allows SEDS to have fault positive rate very low.