System Management by Exception: October 2012

Wednesday, October 24, 2012

Not a MASF Based Statistical Techniques (Entropy-based) for Anomaly Detection in Data Centers (and Clouds)

The following papers published on Mendeley criticizes the MASF Gaussian assumption and offer other methods (Tukey and Relative Entropy) to detect anomalies statistically. (BTW I tried to use the entropy analysis to capture performance anomalies - check my other post)

1. Statistical techniques for online anomaly detection in data centers
by Chengwei Wang, Krishnamurthy Viswanathan, Lakshminarayan Choudur, Vanish Talwar, Wade Satterfield, Karsten Schwan

Abstract

Online anomaly detection is an important step in data center management, requiring light-weight techniques that provide sufficient accuracy for subsequent diagnosis and management actions. This paper presents statistical techniques based on the Tukey and Relative Entropy statistics, and applies them to data collected from a production environment and to data captured from a testbed for multi-tier web applications running on server class machines. The proposed techniques are lightweight and improve over standard Gaussian assumptions in terms of performance.

2. Online detection of utility cloud anomalies using metric distributions

by Chengwei Wang Chengwei Wang, V Talwar, K Schwan, P Ranganathan

Abstract

The online detection of anomalies is a vital element of operations in data centers and in utility clouds like Amazon EC2. Given ever-increasing data center sizes coupled with the complexities of systems software, applications, and workload patterns, such anomaly detection must operate automatically, at runtime, and without the need for prior knowledge about normal or anomalous behaviors. Further, detection should function for different levels of abstraction like hardware and software, and for the multiple metrics used in cloud computing systems. This paper proposes EbAT - Entropy-based Anomaly Testing - offering novel methods that detect anomalies by analyzing for arbitrary metrics their distributions rather than individual metric thresholds. Entropy is used as a measurement that captures the degree of dispersal or concentration of such distributions, aggregating raw metric data across the cloud stack to form entropy time series. For scalability, such time series can then be combined hierarchically and across multiple cloud subsystems. Experimental results on utility cloud scenarios demonstrate the viability of the approach. EbAT outperforms threshold-based methods with on average 57.4% improvement in accuracy of anomaly detection and also does better by 59.3% on average in false alarm rate with a `near-optimum' threshold-based method.

3. EbAT : Online Methods for Detecting Utility Cloud Anomalies

Chengwei Wang in Middleware (2009)

4. Performance Metric Selection for Autonomic Anomaly Detection on Cloud Computing Systems

Song Fu in 2011 IEEE Global Telecommunications Conference GLOBECOM 2011 (2011)

5. Mining anomalies using traffic feature distributions

Anukool Lakhina, Mark Crovella, Christophe Diot in ACM SIGCOMM Computer Communication Review (2005)

6. Krishnamurthy Viswanathan, Lakshminarayan Choudur, Vanish Talwar et al. (2012) Ranking Anomalies in Data Centers, 1-8. In NOMS.

7. Greg Eisenhauer, Matthew Wolf, Chengwei Wang (2010) Monalytics : Online Monitoring and Analytics for Managing Large Scale Data Centers. In ICAC.

8. Fast Anomaly Detection for Large Data Centers

Ang Li Ang Li, Lin Gu Lin Gu, Kuai Xu Kuai Xu in 2010 IEEE Global Telecommunications Conference GLOBECOM 2010 (2010)

9.Online Reactive Anomaly Detection over Stream Data

Yan Fu Yan Fu, Jun-Lin Zhou Jun-Lin Zhou, Yue Wu Yue Wu in 2008 International Conference on Apperceiving Computing and Intelligence Analysis (2008)

10.Semantic anomaly detection in online data sources

O Raz, P Koopman, M Shaw in Proceedings of the 24th International Conference on Software Engineering ICSE 2002 (2002)

11.Statistical anomaly detection via httpd data analysis

Daniel Q Naiman in Computational Statistics & Data Analysis (2004)

12.A comparative study of real-valued negative selection to statistical anomaly detection techniques

T Stibor, J Timmis, C Eckert in Comparative and General Pharmacology (2005)

Igor Trubin

He started in 1979 as IBM/370 system engineer. In 1986 he got his PhD. in Robotics at St. Petersburg Technical University (Russia) and then worked as a professor teaching CAD/CAM, Robotics for 12 years. He published 30+ papers and made several presentations for conferences related to the Robotics and Artificial Intelligent fields. In 1999 he moved to the US, worked at Capital One bank as a Capacity Planner. His first CMG.org paper was written and presented in 2001. The next one, "Exception Detection System Based on MASF Technique," won a Best Paper award at CMG'02 and was presented at UKCMG'03 in Oxford, England. He made other tech. presentations at IBM z/Series Expo, SPEC.org, Southern and Central Europe CMG and ran several workshops covering his original method of Anomaly and Change Point Detection (Perfomalist.com). Author of “Performance Anomaly Detection” class (at CMG.com). Worked 2 years as the Capacity team lead for IBM, worked for SunTrust Bank for 3 years and then at IBM for 3 years as Sr. IT Architect. Now he works for Capital One bank as IT Manager at the Cloud Engineering and since 2015 he is a member of CMG.org Board of Directors. Runs UT channel iTrubin

Tuesday, October 23, 2012

MASF Control Charts Against DB2 Performance Data

I have done that before... I used for that my own variation of MASF Control Chart called "IT- Control Chart". You can see the example in my older post: Power of Control Charts and IT-Chart Concept (Part 1) :

But not only me do that! I have found the following paper in MeasureIt :

Capacity Planning has an Important Role in Assisting Senior IT Management April, 2007, writen by Rick Isom.

The paper has a good MASF reference, the list of DB2 performance metrics that are good to be analyzed by MASF Control Charts and a few examples of Control Charts in form of 24-hour profile. One example is below (Linked to picture form the original paper published on Internet):

BTW, the actual data curve is hourly aggregated data for the particular month (October) to compare with historical base-line. Similar approach was taken in the exercise I published in the following post: Adrian Heald: A simple control chart using Captell Version 6

Igor Trubin

Saturday, October 20, 2012

Theory of Anomaly Detection: Stanford University Video Lectures

That is the part of Machine Learning Lectures: https://class.coursera.org/ml/lecture/preview/index.

XV. Anomaly Detection (Week 9)

Igor Trubin

Tuesday, October 16, 2012

Availability vs. Capacity

Continuing the previous posts about "Battle between "Gut-feeling" and Engineering." ....

Engineer 2: Igor - if you would like to find further extensions of your equation, you might check out Volume 1 of "Breaking the Availability Barrier," which I co-authored and which is available on Amazon. Also, check out several papers I published in the Availability Digest in the Geek Corner (http://www.availabilitydigest.com/articles.htm). A subscription is free.

Igor Trubin: Yes, I have already looked briefly at your book and referenced it in my blog, Very good book and I plan to read that all.

UPDATE: After reading one of suggested above papers ( "Calculating Availability - Redundant Systems" Some useful rules come out of the derivation of the availability equation.) I was able to show to my client that the following more general cluster (system) availability formula proves that for three-node cluster (n=3) the one spare node (s=1) configuration could provide approximately the same cluster availability as in the case of two spare nodes (s=2) but Capacity usage could be a critical factor as seen below.

That could be a way to save money by allocating less capacity with the same number but more reliable nodes.

Igor Trubin

Monday, October 8, 2012

Systems Availability Arena: Battle between "Gut-feeling" and Engineering. Round 3. (2 and 3 are in the previous posts)

UPDATE: the start is here

"Gut-feeler 1" • A comment on the math for availability calculations: Certainly not being a math hero, I can still understand that playing with formulas can be much fun. But we ought to take care whether the results generated by these formulas are actually producing value in real world situations.

Misleadig fomulas can be dangerous and ought to be kept in a pen and paper / chalk and blackboard environment. When those formulas can be "googled", there is the risk that some younger and less experienced person working in the role of an IT architect takes them at face value and bases real world decisons on those - or even worse, someone might write a related Wikipedia article and multiplies the damage ...

Igor's initial formula suggests that any four node cluster is hundred times more reliable than a two node cluster, and that any cluster can easily achieve 99,999% availability if you only add enough nodes. Bill's revised formula would suggest an even stronger growth in availability when adding cluster nodes ...

"Engineer 2" has already pointed out that these are moot calculations, as the formulas are just not applicable in those real world environments we are talking about. You would not construct a large building or a bridge using elegant formulas that produce over-optimistic results which are in stark contrast with real world experience. Everybody knows that after the collapse, the architect would go to jail ...

IT architects do also carry responsibility - it might be somewhat limited when desining a webshop for selling cosmetics or toys, when only the shopowner would be disappointed after investing in additional nodes and not getting the expected reliability in return. But for instance, when building 911 emergency communication systems controlling police, ambulance and fire brigade services, lives are at stake and could be lost due to system outages. Here, only the most reliable IT infrastructure is good enough - and creating false expectations by misleading formulas would be fatal.

"Engineer 3": "In Theory, there is no difference between Theory and Practice. In Practice, there is" I don't know where I found the quote above but I like it. Nevertheless, I think one should in fact do both for any system of significant importance:

(1) Use Math, a.k.a."Theory" to calculate the expected availability and adjust as needed to match the required availability. Ignoring Math/Theory and replacing it with only gut feeling and/or trust in vendor statements does not sound right to me. (2) Apply gut feeling; maybe better called experience or "Practice"; Combine the two and you are off to a good start IMHO...

"Engineer 1" • ... I disagree with Gut-feeler1's bridge building analogy. My father was involved into real bridges design and construction and he always started with pretty heavy math modeling and calculation. I cannot imagine that modern bridges are built just by the "gut feeling". Again I agree with "Engineer 3" I am for the engineering approach in clusters design that combines math modeling and applying the best practice experience.

UPDATE: other rounds are here

Igor Trubin

Wednesday, October 3, 2012

Systems Availability Arena: Battle between "Gut-feeling" and Engineering. Round 2.

This is continuation of the previous post.

Gut-feeler 1 • Gut feelings aren't always bad - for instance, when working on an IT project meant to support really critical business processes, and hence with a lot of money or even lives at stake, your gut feeling might be that standard clustering just isn't good enough and you need something significantly better here.

Then it comes to curiosity - is there something better around than the usual standard clustering ? Is there some other IT infrastructure, one that is fault tolerant and self-healing, providing much more reliability right out of the box than you could ever achieve using the plain vanilla stuff and the most sophisticated clustering conceiveable ?

If being that curious, chances are you will end up at NonStop.

Not being curious and just doing calculations, you just might end up with adding more nodes to your standard cluster hoping to make it more reliable - which in the real world often turns out to be a false hope ...

But don't get me wrong, I'm not at all against calculations. A very important one is on capacity, will (n - 1) nodes still support my workload, when a node went down for whatever reason ? That's often overlooked ...

Engineer 2 Two comments, one on the Trubin law, and one on .. comments [above]. .... Adding a node of m 9s to a node of m9s adds m9s to a cluster. The overall availability of a cluster of n nodes, each with m 9s availability, is mn 9s.... For instance, a three-node cluster of nodes with 2 9s availability will be six nines...

Let f = the failure probability of a node [A=(1-f)]. If f is an even number of nines, then the failure probability of a node is f= 0.1^m, where m is the number of nines (for instance, for three nines, f=0.1^3 =.001 and A =(1-0.1^3) = 0.999). For an n-node cluster, its availability is 1-(1-A)^n = 1-[1-(1-0.1^m)^n = 1-(0.1^m)n = 1-0.1^mn.

In general, if a node has a failure probability of f, then an n-node cluster has an availability of 1-f^mn. Two nodes with availabilities of 0.95 will have an availability of 0.9975.

Of course, this assumes that the cluster fails only if all nodes fail. Generally, a cluster can withstand the failure of some nodes but not all. In this case, the above relations can be modified to accommodate this situation.

"Gut-feeler 1"'s suggestion that adding nodes does not result in this additional availability is quite correct. The above relations apply only to hardware failures (or whatever failures might be included in the nodal availability), and are accurate for those. However, once the hardware availability becomes high (say four 9s), other factors that are not node related come into play, such as software bugs, operator errors, and environmental faults (power, air conditioning). These limit the practical availability that can be achieved. In effect, after a certain point, increases in hardware availability become irrelevant to system availability.

Thanks for starting a very interesting and meaningful thread, Igor.

See the next post for the next round.

Igor Trubin

Tuesday, October 2, 2012

Systems Availability Arena: Battle between "Gut-feeling" and Engineering!

I have put my Cluster Availability 9's Equation post to LinkedIn Continuous Availability forum and got 19 comments, divided in about two camps: "Gut-feelers" and "Engineers". Below is the first two comments. (See next posts for other comments.)

Gut-feeler 1: "Nice formula, looks good in theory – but won’t hold true in the real world. The reason for this is the underlying assumption of an ideal cluster, which does not exist in the real world. When looking at some real world implementation like the Oracle RAC cluster, you will find that a simple two-node cluster configuration will typically deliver somewhere between three and four nines of availability.

Now, will adding a third node to that cluster add another 9 to the availability figure ? Will a ten node cluster really provide 99.999999999999 % availability ? Will a cluster with hundred nodes run continuously for thousands of years without any outage ?

Certainly not, and talking to system administrators running typical cluster installations will quickly reveal that large clusters are quite complex and difficult to handle, hence more prone to failure than simple two-node clusters.

Even when looking at the HP NonStop architecture – which comes pretty close to the ideal cluster – the formula would not apply. A NonStop system (which internally is indeed a cluster, each NonStop CPU resembling a node) delivers roughly five nines of availability – but there is no significant availability difference between systems eg. with four and with sixteen CPU’s (cluster nodes).

So it is not so important how many nodes you have – but it is very important what kind of cluster you have !

Engineer 1 • I know this particular formula is too simple for the real world, so I completely agree with your comment. But still the complexity of big clusters can be modeled by more complex math. models by e.g. adding more boxes with parallel and series type of connections. And the formula will be much more ugly but useful I believe... Plus each individual node could be decomposed on some structure to model both HW and SW parts. The approach is written in some books I mentioned in my other posts, and it is suppose to be a tool to do that, but I am not aware of any. Are you?

In my real life this type of calculation/modeling is just a starting point to get rough estimation and then using some monitoring data to get that adjusted. What I do not like is when some Architects makes some decisions of the level of cluster redundancy without ANY calculation (!), just based on their gut feelings....

(NB: Real names can be found on the actual LinkedIn forum tread)
UPDATE: see the start point here

Igor Trubin

System Management by Exception

Popular Post

_

Wednesday, October 24, 2012

Not a MASF Based Statistical Techniques (Entropy-based) for Anomaly Detection in Data Centers (and Clouds)

Abstract

Tuesday, October 23, 2012

MASF Control Charts Against DB2 Performance Data

Saturday, October 20, 2012

Theory of Anomaly Detection: Stanford University Video Lectures

XV. Anomaly Detection (Week 9)

Tuesday, October 16, 2012

Availability vs. Capacity

Monday, October 8, 2012

Systems Availability Arena: Battle between "Gut-feeling" and Engineering. Round 3. (2 and 3 are in the previous posts)

Wednesday, October 3, 2012

Systems Availability Arena: Battle between "Gut-feeling" and Engineering. Round 2.

Tuesday, October 2, 2012

Systems Availability Arena: Battle between "Gut-feeling" and Engineering!