Popular Post

Search This Blog

Tuesday, November 12, 2013

HP techreport: "Statistical Techniques for Online Anomaly Detection in Data Centers". My critique.

SOURCE: HPL-2011-8 Statistical Techniques for Online Anomaly Detection in Data Centers - Wang, Chengwei; Viswanathan, Krishnamurthy; Choudur, Lakshminarayan; Talwar, Vanish; Satterfield, Wade; Schwan, Karsten

The subject of the paper is extremely good and this blog is the place to discuss that type of matter as you can find here numerous discussions about tools and methods that solve basically the same problem. Below the introductory paragraph with key assumptions of the paper that I have some doubts with:

 MASF uses reference set as a baseline, based on which the statistical thresholds are calculated (UCL, LCL), originally the suggestion was to have that static (not changing) over time, so the baseline is always the same. Developing my SETDS methodology I have modernized the approach and now SETDS mostly uses baseline that slides from past to present ending just when most resent “actual data” starts. (and the mean is actually moving average!)  So it is still MASF-like way to build thresholds, but they are changing overtime self-ajusting to pattern changes. I call that “dynamic thresholding”. BTW After SETDS, some other vendors implemented this approach as you can here: Baseliningand dynamic thresholds features in Fluke and Tivoli tools

2     A few years ago I had intensive discussion about “normality” data assumption with founder of the Alive ( Integrien) tool (now it is part of VMware vCOPS): Real-Time StatisticalException Detection. So vCOPS now has ability to detect real-time anomalies applying non-parametric statistical approach. SETDS also has ability to detect anomalies (my original term is statistical exceptions) in real-time manner if applied to near-real-time data: Real-Time Control Chartsfor SEDS

The other part of the paper mentions the usage of multiple time dimension approach, which is not really new. I have explored the similar one during my IT-Control chart development by treating that as a data cube with at least two time dimensions – weeks and hours and also comparing historical  baseline  with most recent data; see details in the most popular post of this blog: One Example of BIRT Data Cubes Usage for Performance Data Analysis:

Section III of the paper describes the way of using “Tukey” method and is definitely valid as the non-parametric way to calculate UCL and LCL. (I should try to do that). I usually use just percentiles (e.g. UCL= 95 and LCL=5) if data are apparently not normally distributed..

The part B of the section III in the paper is about “windowing approaches”. It is interesting as it compares collections of data points and how good they fit to a given distribution.  It reminds me other CMG paper that had similar approach of calculating the entropy of different portions of the performance data. See my attempt to use entropy based approach to capture some anomalies here: Quantifying Imbalance inComputer Systems

Finally the results of some tests are presented in the end of the paper. Really interesting comparison of different approaches, not sure they used MASF and that would also be interesting to compare result with SETDS…But at the “related work” part of the paper unfortunately I did not notice any recent well known and widely used implementations of the anomaly detection techniques (except MASF) that are very good presented in this blog (including SEDS/SETDS).