Popular Post

Search This Blog

Wednesday, November 20, 2013

MSDN Blog post: "Statistical Process Control Techniques in Performance Monitoring and Alerting" by M. Friedman

I met  again at CMG'13 and also attended his session (will put my impressions later).

Mark is my teacher, and I respect him very much. Ones I have attended his Windows Capacity Management class in Chicago. I always try to go to his presentations, to read his books and to see his online activities. Just today, checking his activities online I ran into his 2010 post in MSDN Blog that relates to my (this) blog very much:

MSDN Blogs Developer Division Performance Engineering blog > Statistical Process Control Techniques in Performance Monitoring and Alerting

I very appreciate he mentioned my blog and my name (with a little misprint...):

".... a pointer to Igor Trobin's work, which I believe is very complementary. Igor writes an interesting blog called “System Management by Exception.” In addition, Jeff Buzen and Annie Shum published a very influential paper on this subject called “MASF: Multivariate Adaptive Statistical Filtering” back in 1995. (Igor’s papers on the subject and the original Buzen and Shum paper are all available at www.cmg.org.)... "

This Mark's post was a response on Charles Loboz CMG paper critique made by Uriel Carrasquilla, Microsoft performance analyst. I attended that presentation and had some doubts too which I expressed during the presentation. BTW I have commented another Charles's CMG paper in my blog:  Quantifying Imbalance in Computer Systems: CMG'11 Trip Report. My opinion is this CMG'11 paper was much better!
(Normalized Imbalance Coefficient, from the paper)

BTW I have also made comments on Mark Friedman CMG'08 paper: Mainstream NUMA and the TCP/IP stack. His presentation was as usual very influential! See details in my CMG'08 Trip Report

And I am about to comment his CMG'13 presentation. Check the next post!

CMG’13 workshops: "Application Profiling: Telling a story with your data"

The subject was introduced by R. Gilmarc (CA) in his CMG’11 paper: IT/EV-Charts as an Application Signature: CMG'11 Trip Report, Part 1 This time he has shown us some additional development of the idea. Such as “BIFR”:

What is in our Application Profile?
• Workload – description of transaction arrival pattern
• Infrastructure – subset of infrastructure supporting our application
• Flow – server-to-server workflow
• Resource – CPU and I/O consumed per transaction at each server

Why is an Application Profile useful?
• Prerequisite for application performance analysis and capacity planning
• Directs & focuses application performance tuning efforts
• Building block for data center capacity planning
• Serves as input to a model

Some modeling approaches were included into Application Profile idea (e.g. CPU% vs. Business transactions) plus the flow is presented as a diagram from HyPerformix tool that is now CA tool.
I see the  BIFR profile is suitable for a predictive model  to run on Performance Optimizer part of HyPerformix.

Also interesting  is the attempt to use BIFR for virtual servers (LPARs) consolidation that includes TPP – Total Processing Power benchmarks. Most interesting is the usage of “Composite Resource Usage Index  to Identify LPARs that have high
resource usage across all 3 ones: TPP Percent,  I/O Percent and  Memory Percent. Looks like it allows to combine  LPARS optimally on different physical hosts in a ”tetris” way.

I appreciate he mentioned my name in the slides (at the “related work” section) and during his presentation there was some discussion about IT Control Charts. I still believe that IT-Control chart without actual data plotted (see below a copy from my old post) and built for main server resources usage (CPU, memory and I/Os) plus for main business transactions and response time (the same IT-control charts should be built for that – I published couple examples in my other papers) could be a perfect representation of any applications and also can be treated as an application profile!  

For consolidation or workload placement exercises they can be condensed to a few numbers per application, for instance, maximum of weekly upper limits for each chart. Those numbers could be treated as application profile parameters and then used for placing/moving (in a cloud) purposes, for example to be analyzed by some statistical clustering algorithms. By the way, other Cloud management tools already do similar profiling for this. (CiRBA

Another interesting idea which also was presented in the workshop is “Application invariants”. I may discuss that in my another post…

Tuesday, November 12, 2013

HP techreport: "Statistical Techniques for Online Anomaly Detection in Data Centers". My critique.

SOURCE: HPL-2011-8 Statistical Techniques for Online Anomaly Detection in Data Centers - Wang, Chengwei; Viswanathan, Krishnamurthy; Choudur, Lakshminarayan; Talwar, Vanish; Satterfield, Wade; Schwan, Karsten

The subject of the paper is extremely good and this blog is the place to discuss that type of matter as you can find here numerous discussions about tools and methods that solve basically the same problem. Below the introductory paragraph with key assumptions of the paper that I have some doubts with:

 MASF uses reference set as a baseline, based on which the statistical thresholds are calculated (UCL, LCL), originally the suggestion was to have that static (not changing) over time, so the baseline is always the same. Developing my SETDS methodology I have modernized the approach and now SETDS mostly uses baseline that slides from past to present ending just when most resent “actual data” starts. (and the mean is actually moving average!)  So it is still MASF-like way to build thresholds, but they are changing overtime self-ajusting to pattern changes. I call that “dynamic thresholding”. BTW After SETDS, some other vendors implemented this approach as you can here: Baseliningand dynamic thresholds features in Fluke and Tivoli tools

2     A few years ago I had intensive discussion about “normality” data assumption with founder of the Alive ( Integrien) tool (now it is part of VMware vCOPS): Real-Time StatisticalException Detection. So vCOPS now has ability to detect real-time anomalies applying non-parametric statistical approach. SETDS also has ability to detect anomalies (my original term is statistical exceptions) in real-time manner if applied to near-real-time data: Real-Time Control Chartsfor SEDS

The other part of the paper mentions the usage of multiple time dimension approach, which is not really new. I have explored the similar one during my IT-Control chart development by treating that as a data cube with at least two time dimensions – weeks and hours and also comparing historical  baseline  with most recent data; see details in the most popular post of this blog: One Example of BIRT Data Cubes Usage for Performance Data Analysis:

Section III of the paper describes the way of using “Tukey” method and is definitely valid as the non-parametric way to calculate UCL and LCL. (I should try to do that). I usually use just percentiles (e.g. UCL= 95 and LCL=5) if data are apparently not normally distributed..

The part B of the section III in the paper is about “windowing approaches”. It is interesting as it compares collections of data points and how good they fit to a given distribution.  It reminds me other CMG paper that had similar approach of calculating the entropy of different portions of the performance data. See my attempt to use entropy based approach to capture some anomalies here: Quantifying Imbalance inComputer Systems

Finally the results of some tests are presented in the end of the paper. Really interesting comparison of different approaches, not sure they used MASF and that would also be interesting to compare result with SETDS…But at the “related work” part of the paper unfortunately I did not notice any recent well known and widely used implementations of the anomaly detection techniques (except MASF) that are very good presented in this blog (including SEDS/SETDS).