SOURCE: HPL-2011-8 Statistical Techniques for
Online Anomaly Detection in Data Centers - Wang, Chengwei; Viswanathan,
Krishnamurthy; Choudur, Lakshminarayan; Talwar, Vanish; Satterfield, Wade;
Schwan, Karsten
The
subject of the paper is extremely good and this blog is the place to discuss
that type of matter as you can find here numerous discussions about tools and methods
that solve basically the same problem. Below
the introductory paragraph with key assumptions of the paper that I have some doubts with:
MASF uses reference set as a baseline, based on
which the statistical thresholds are calculated (UCL, LCL), originally the
suggestion was to have that static (not changing) over time, so the baseline is
always the same. Developing my SETDS methodology I have modernized the approach
and now SETDS mostly uses baseline that slides from past to present ending just when most resent “actual data” starts. (and the mean is actually moving average!) So it is still MASF-like way to build
thresholds, but they are changing overtime self-ajusting to pattern changes. I call that “dynamic thresholding”. BTW After SETDS, some other vendors implemented this approach as you can here: Baseliningand dynamic thresholds features in Fluke and Tivoli tools
2 A few years ago I had intensive discussion about
“normality” data assumption with founder of the Alive ( Integrien) tool (now
it is part of VMware vCOPS): Real-Time StatisticalException Detection. So vCOPS now has ability to detect real-time anomalies
applying non-parametric statistical approach. SETDS also has ability to detect
anomalies (my original term is statistical exceptions) in real-time manner if applied
to near-real-time data: Real-Time Control Chartsfor SEDS
The other part of the paper mentions the
usage of multiple time dimension approach, which is not really new. I have
explored the similar one during my IT-Control chart development by treating
that as a data cube with at least two time dimensions – weeks and hours and also comparing historical baseline with most recent data; see details in the most popular post of this blog: One
Example of BIRT Data Cubes Usage for Performance Data Analysis:
Section III of the paper describes the way of using “Tukey”
method and is definitely valid as the non-parametric way to calculate UCL and LCL. (I
should try to do that). I usually use just percentiles (e.g. UCL= 95 and LCL=5)
if data are apparently not normally distributed..
The part B of the section III in the paper is about “windowing
approaches”. It is interesting as it compares collections of data points and
how good they fit to a given distribution. It reminds me other CMG paper that had similar
approach of calculating the entropy of different portions of the performance
data. See my attempt to use entropy based approach to capture some anomalies
here: Quantifying Imbalance inComputer Systems
Finally the results of some tests are
presented in the end of the paper. Really interesting comparison of different approaches,
not sure they used MASF and that would also be interesting to compare result
with SETDS…But at the “related work” part of the paper unfortunately
I did not notice any recent well known and widely used implementations of the anomaly detection
techniques (except MASF) that are very good presented in this blog (including
SEDS/SETDS).
No comments:
Post a Comment