System Management by Exception: HP techreport: "Statistical Techniques for Online Anomaly Detection in Data Centers". My critique.

Tuesday, November 12, 2013

HP techreport: "Statistical Techniques for Online Anomaly Detection in Data Centers". My critique.

SOURCE: HPL-2011-8 Statistical Techniques for Online Anomaly Detection in Data Centers - Wang, Chengwei; Viswanathan, Krishnamurthy; Choudur, Lakshminarayan; Talwar, Vanish; Satterfield, Wade; Schwan, Karsten

The subject of the paper is extremely good and this blog is the place to discuss that type of matter as you can find here numerous discussions about tools and methods that solve basically the same problem. Below the introductory paragraph with key assumptions of the paper that I have some doubts with:

MASF uses reference set as a baseline, based on which the statistical thresholds are calculated (UCL, LCL), originally the suggestion was to have that static (not changing) over time, so the baseline is always the same. Developing my SETDS methodology I have modernized the approach and now SETDS mostly uses baseline that slides from past to present ending just when most resent “actual data” starts. (and the mean is actually moving average!) So it is still MASF-like way to build thresholds, but they are changing overtime self-ajusting to pattern changes. I call that “dynamic thresholding”. BTW After SETDS, some other vendors implemented this approach as you can here: Baseliningand dynamic thresholds features in Fluke and Tivoli tools

2 A few years ago I had intensive discussion about “normality” data assumption with founder of the Alive ( Integrien) tool (now it is part of VMware vCOPS): Real-Time StatisticalException Detection. So vCOPS now has ability to detect real-time anomalies applying non-parametric statistical approach. SETDS also has ability to detect anomalies (my original term is statistical exceptions) in real-time manner if applied to near-real-time data: Real-Time Control Chartsfor SEDS

The other part of the paper mentions the usage of multiple time dimension approach, which is not really new. I have explored the similar one during my IT-Control chart development by treating that as a data cube with at least two time dimensions – weeks and hours and also comparing historical baseline with most recent data; see details in the most popular post of this blog: One Example of BIRT Data Cubes Usage for Performance Data Analysis:

Section III of the paper describes the way of using “Tukey” method and is definitely valid as the non-parametric way to calculate UCL and LCL. (I should try to do that). I usually use just percentiles (e.g. UCL= 95 and LCL=5) if data are apparently not normally distributed..

The part B of the section III in the paper is about “windowing approaches”. It is interesting as it compares collections of data points and how good they fit to a given distribution. It reminds me other CMG paper that had similar approach of calculating the entropy of different portions of the performance data. See my attempt to use entropy based approach to capture some anomalies here: Quantifying Imbalance inComputer Systems

Finally the results of some tests are presented in the end of the paper. Really interesting comparison of different approaches, not sure they used MASF and that would also be interesting to compare result with SETDS…But at the “related work” part of the paper unfortunately I did not notice any recent well known and widely used implementations of the anomaly detection techniques (except MASF) that are very good presented in this blog (including SEDS/SETDS).

Igor Trubin

He started in 1979 as IBM/370 system engineer. In 1986 he got his PhD. in Robotics at St. Petersburg Technical University (Russia) and then worked as a professor teaching CAD/CAM, Robotics for 12 years. He published 30+ papers and made several presentations for conferences related to the Robotics and Artificial Intelligent fields. In 1999 he moved to the US, worked at Capital One bank as a Capacity Planner. His first CMG.org paper was written and presented in 2001. The next one, "Exception Detection System Based on MASF Technique," won a Best Paper award at CMG'02 and was presented at UKCMG'03 in Oxford, England. He made other tech. presentations at IBM z/Series Expo, SPEC.org, Southern and Central Europe CMG and ran several workshops covering his original method of Anomaly and Change Point Detection (Perfomalist.com). Author of “Performance Anomaly Detection” class (at CMG.com). Worked 2 years as the Capacity team lead for IBM, worked for SunTrust Bank for 3 years and then at IBM for 3 years as Sr. IT Architect. Now he works for Capital One bank as IT Manager at the Cloud Engineering and since 2015 he is a member of CMG.org Board of Directors. Runs UT channel iTrubin

System Management by Exception

Popular Post

_

Tuesday, November 12, 2013

HP techreport: "Statistical Techniques for Online Anomaly Detection in Data Centers". My critique.

No comments:

Post a Comment