Popular Post

Saturday, February 28, 2009

Real-Time Statistical Exception Detection

Does that make sense to apply statistical filtering to real-time computer performance data? I did not try as I believe analyzing last day data against historical baseline (based on dynamic statistical thresholds) would be enough to have good alert for upcoming issue and at the same time classical alerting system (based on constant thresholds, for instance, patrol or sites-scope) captures severe incidents if something completely dying.

But I see some companies do that using the following three (at least) products available on a market:

1. Integrien Alive™ (http://www.integrien.com/ )
2. Netuitive (http://netuitive.com/ )
3. ProactiveNet (now BMC), (http://documents.bmc.com/products/documents/49/13/84913/84913.pdf )

Plus Firescope http://www.firescope.com/default.htm and Managed Objects http://managedobjects.com/ do something similar)

I have recently had discussion with Integrien sales people as they did live presentation of Alive product for company I work for now.
I was impressed, it looks working good. Most interesting for me is the deference between SEDS (my approach) and their technology.

Apparently both approaches are using dynamic statistical thresholds to issue an alert.

But I think they do that using some patented complex statistical algorithms that should work well even if sample data is not normally distributed. It’s done based on some research that Dr Mazda A. Marvasti did and I am aware of this research as some of his thoughts was published in CMG (in MeasureIT) couple years ago. That consists of very good critic of SPC (Statistical Process Control) concepts applied to IT data as SPC works perfect if data is normally distributed and if not, it works not so perfect. The 1st attempt to improve SPC was MASF to regroup analyzed data and after regrouping data might be more close to normal. SEDS is based on MASF and, for instance, it looks at history in different dimension by not comparing (calculating st. deviations) hours during the same day but grouping hours by weekday and also it calculates statistic across weeks not days.

(You could find more details in my last paper. Links to some papers related to this subject including my papers can be found in this blog )

BTW In respond on his publication I did special analysis to see how far from normal the data is used by SEDS and some result of this research has been published in one of my papers. And my opinion is some data is close to normal and some still indeed is not so close and it depends of metrics, subsystems and environment (prod/non-prod) and how it’s grouped.

The key is what type of threshold the SEDS-like product uses to establish a base-line. That could be very simple – static one, or based on st. deviations, but that could be more complex thresholds such as combination of static (based on expert experiences – empiric) and simple statistical ones (based on st. deviations). SEDS uses that combination and SEDS has a several tuning parameters to tune SEDS to capture meaningful exceptions. I believe this approach is valuable (and cheap) for practical usage and several successful implementations of SEDS proves that.

But for more accurate analysis of data especially if it’s far from normal destitution, other more advanced statistical techniques could be applied and looks like this product implements that. For me it’s just another (more sophisticated) threshold calculation for base-lining. Anyway I am continue improving my approach and will be thinking about what they and others do in this area.

Other interesting observation I got from  the Integrien tool live presentation:
The rate of dynamic threshold exceeding is so large that they have to put additional (static???) threshold considering that some number of exceptions are kind of normal and just a nose that should be ignored. That means if the number of exceptions is bigger that that threshold, the smart alert is issued. I did not get how this threshold is set or calculated, but it’s very high - HUNDREDS (!!!) of exceptions per interval. I believe the reason of this is they apply “Anomaly” detector to too granular data. As I stated in my last paper the better result could be reached by doing statistical after some stigmatization (SEDS does that mostly after averaging that to hourly data)

BTW SEDS uses original meta-metric to detect only meaningful exceptions (it uses EV or Exception Value - see my last paper) that allows SEDS to have fault positive rate very low.

3 comments:

  1. Great post Igor! I'd love to set up a series of podcasts or guest author posts where we could dive into this area of predictive/proactive analytics and monitoring more. Would you be interested?

    Thanks!

    Doug
    BSM/ITSM Blog: http://dougmcclure.net

    ReplyDelete
  2. Igor, I very much enjoyed reading your blog. I also just finished reading your paper on this topic as well. You make good points regarding the applicability of SEDS to capacity management and I believe it can be a viable tool for looking at specific metrics (or grouping of metrics) for the purpose of capacity type problems. The problem becomes more complicated when dealing with complex applications and looking at near realtime data. The complexity of applications I have seen at large organizations precludes the ability to find relevant grouping of important metrics without having deep application experities. Once you get this granularity of data collection and this level of obscurity in application knowledge then it is not possible to find or even craft metrics that show near normal distribution. I have a paper out that shows how non-aggregated data, no matter what time slice techniques being applied to it, do not conform to normal distribution. With this limitaion SPC techniques cannot be appropriately applied to these types of data. Regardless, when looking at applications in a data agnostic manner(without having any prescribed knowledge of the metrics) then the only way to analyze the abnormalities is to look at them in aggregate format. This aggregation technique can then lead to problem identification without requiring application knowledge. The subject of aggregate anomaly analysis is a deeper subject for later discussions. Thanks for sharing your thoughts and experiences.

    ReplyDelete
  3. I was pointed (from one of the LinkedIn discussions I created) to one more product with the similar approach ( behavior learning application monitoring ) from India start-up called Appnomic :
    http://www.networkworld.com/community/node/45584

    ReplyDelete