Popular Post

_

Tuesday, December 27, 2011

IT/EV-Charts as an Application Signature: CMG'11 Trip Report, Part 1


I have attended the following CMG’11 presentation (see my previous post):

A Way to Identify, Quantify and Report Change
Richard Gimarc Kiran Chennuri
CA Technologies, Inc. Aetna Life Insurance Company

Identifying change in application performance is a time consuming task. Businesses today have
hundreds of applications and each application has hundreds of metrics. How do you wade
through that mass of data to find an indication of change? This paper describes the use of an
Application Signature to identify, quantify and report change. A Signature is a compact
description of application performance that is used much like a template to judge if a change has
occurred. There are a concise set of visual indicators generated by the Signature that supports
the identification of change in a timely manner.

Here are my comments.

I like the idea of building an application characteristic called Application Signature. As described in the paper it is actually based on typical (standard) deviations of Capacity usage during the peak hours of a day.

Looking closely to the approach I see it is similar with one I have developed for SEDS but it is a bit too simplified. Anyway it is great attempt to use SEDS methodology to watch application capacity usage.

I think the weekly IT-CONTROL CHART ( see other previous post ) is a way to compare usual weekly profile with last 168 hours of data (Base-line vs. Actual), so the base-line in the format of IT-Control Charts without actual data IS AN APPLICATION SIGNATURE but in much more accurate way. It even looks like somebody’s signature:

The actual data could be significantly different, as seen below:

And that diference should be automatically captured by SEDS-like system as an exceptions and calculated how much it differs from the "Signature" using EV meta metric as a weekly sum of each hour EV values  or as a EV-Control Charts like showed here.

For instance, in this example week the application had took a bit more than 23 unusual CPU hours as calculated below:

So, if weekly EV number is 0, that means the most recently the application (server or LPAR and so on) stayed within the IT-Signature, which is GOOD – no changes happend!

The paper also shows the “calendar view“ report that consists of set of daily control charts. It is another good idea. I used to use that approach before I switched to weekly IT- charts that cover 1/4 of a month or bi-weekly ones that cover 1/2 of a month. So if you have IT-charts there is no need for the "calendar view" that sometimes is not easy to read.

Another feature could be important for capacity usage estimates: it is a balance of hourly capacity usage for the day or week vs. overall average (e.g. weekdays vs. weekends or daily “cowboy hat” profile with lunch time drop). That is supposed to be an additional IT-Signature feature. There was another CMG’11 paper that presents some interesting approach to analyze/calculate that. I plan to publish my comments about that paper. So please check my next post soon.....

Tuesday, December 6, 2011

Application Signature: some of my SEDS ideas are at work

I am at CMG'11 conference now (in DC) presenting nothing this year (1st time for the last 11 years!), but I enjoy the conference and especially when my work is referenced.

Here is the example from paper called "Application Signature: A Way to Identify, Quantify and Report Change" which s presenting today at 4 pm by Richard Gimarc from CA Technologies, Inc and Kiran Chennuri from Aetna Life Insurance Company:

'...We readily admit that we are “standing on the shoulders of giants”; leveraging the work of others in the field to develop our own interpretation, implementation and use of an Application Signature....
... Perhaps the most influential work is by Igor Trubin. Starting in 2001, Trubin built on the ideas proposed by Buzen and Shum to develop the Statistical Exception Detection System (SEDS). Basically, SEDS “is used for automatically scanning through large volumes of performance data and identifying measurements of global metrics that differ significantly from their expected values”. Again, we see common ground with our use of an Application Signature. The points we leverage from Trubin’s work are:
  • Identify when performance metrics exceed of fall below expectation
  • Note and record the exceptions
  • Estimate the size of each exception rather than just recording its occurrence
  • Use control charts as a visual tool for examining current performance versus expected performance
 ...
What do you do when a change is identified?
  • Quantify the change. Does your current measurement exceed the Signature by 5%, or 100%? We are considering implementing a technique similar to what was described by Trubin.
  • Grade the change as either good or bad. If a metric increases, is that an indication of a bad change? Not always. Consider workload throughput; an increase in workload throughput is probably a good change. We need to find a way to customize each Application Signature metric to recognize and highlight both good and bad changes.
  • Develop a historical record of changes. Again, this is an idea developed by Trubin. A historical record will provide the application development and support staff with a quantitative description of sensitive application characteristics that may warrant improvement. 
...'
Some other anthers' work are referenced. I need to read that carefully and will report here about that in the other posts. Looking forward to attend that presentation! 

Richard and Kiran, thank you for referencing my work!



Tuesday, November 29, 2011

Finding the Edge of Surprise by Rich Olcott

I have definitely overlooked the following very good article of my CMG and IBM acquaintance:

MeasureIT - Issue 5.03 - Finding the Edge of Surprise by Rich Olcott 

At the 1st glance that article has a good overview of Classical SPC with some original suggestion how to apply that to IT data. Also I like the name of the article which could be a good short and metaphoric description of the main topic of this entire blog! 

BTW He provided there the reference to my CMG'2004 paper: “Mainframe Global and Workload Levels – Statistical Exception Detection System, Based on MASF,” CMG Proceedings (2004). The link to that my paper is published on very 1st posting of this blog!

And I have already mentioned  his previous work at my other posting: 
Aug 13, 2007
Dials for a PM Dashboard: Velocity's Missing Twin, and Quantifying Surprise, Rich Olcott
I plan to reread both his works and to add more comments-thoughts....

Wednesday, November 9, 2011

SEDS-Lite: Using Open Source Tools (R, BIRT and MySQL) to Report and Analyze Performance Data

Last Thursday we had a very good Southern Computer Measurement Group meeting of 16 attendees in Richmond VA, where I have presented the material about how to use R, BIRT, MySQL and EXCEL to analyze and report systems' performance data having as an example some real Unix server CPU utilization data for control charting.

Agenda is still on SCMG website and now my presentation slides are published and linked there:

SEDS-Lite: Using Open Source Tools (R, BIRT and MySQL) to Report and Analyze Performance Data
(slides).



Tuesday, October 11, 2011

My Southern CMG Presentation in Richmond Is About Open Source Tools for Capacity Management

I have been invited to make my new presentation on the 2011 Fall SCMG Meeting. See agenda here.


My presentation will be actually a compilations of some of my last posts in this blog: 

UCL=LCL : How many standard deviations do we use for Control Charting? Use ZERO! 
BIRT based Control Chart 
One Example of BIRT Data Cubes Usage for Performance Data Analysis 
How To Build IT-Control Chart - Use the Excel Pivot Table! 
Power of Control Charts and IT-Chart Concept (Part 1) 
Building IT-Control Chart by BIRT against Data from the MySQL Database 
- EV-Control Chart


So please plan to attend ! (registration is here)

Monday, October 10, 2011

Is Anomaly Detection Similar to Exception Detection? Apply SEDS for Information Security!

Sometimes I call my "Exception Detection" as "Anomaly Detection".  In some cases the performance degradation could be caused by parasite program (like badly written data collection agent ) or incompetent user (like submitting badly written ad-hock  database query) or even by a cyber attack (denial-of-service attack -DoS definitely  degrades performance to absolutly not performing, doesn't it?)

So it is similar by my opinion and the Exception Detection methodology I am offering to by using MASF technique can be applied to broader filed of Information Security. And vice versa! Some intrusion detection techniques could be useful for automatic performance issues detection!

I have made a litle Google reserch on that and found a few interesting approaches. See one of that:

See the abstract page for dissertation written by Steven Gianvecchio:

Application of information theory and statistical learning to anomaly detection.


So the question is "can that information theory (entropy analysis) could be applied to performance exception detection?"

Friday, October 7, 2011

EV-Control Chart

I have introduced the EV meta-metric in 2001 as a measure of anomaly severity. EV stands for Exception Value and more explanation about that idea could be found here:  The Exception Value Concept to Measure Magnitude of Systems Behavior Anomalies 
Basically it is the difference (integral) between actual data and control limits. So far I have used EV data mostly to filter out real issues or for automatic hidden trend recognition. For instance, in my paper CMG’08 “Exception Based Modeling and Forecasting” I have plotted that metric using Excel to explain how it could be used for a new trend starting point recognition. Here is the picture from that paper where EV called “Extra Volume” and for the particular parent metric (CPU util.) it is named ExtraCPUtime:

The EV meta-metric first chart 

But just plotting that meta-metric and/or two their components (EV+ and EV-) over time gives a valuable picture of system behavior. If system is stable that chart should be boring showing near zero value all the time. So using that chart would be very easy (I believe even easier than in MASF Control Charts) to recognize unusual and statistically significant increase or decrease in actual data in very early stage (Early Warning!).

Here is the example of that EV-chart against the same sample data used in few previous posts:
1. Excel example: 

2.  BIRT/MySQL example as a continuation of the exercise from the previous post:

IT-Control chart vs. EV-Chart
Here is the BIRT screenshots that illustrate how that is built:

a.        A. Addition query to get EV calculated written directly in the additional BIRT Data Set object called “Data set for EV Chart”:
SQL query to calculate EV meta-metric
 SQL query to calculate EV metric from the data kept in MySQL table

B. Then additional bar-chart object is added to the report that is bind to that new “Data set for EV Chart”:
Result report is already shown here.





Tuesday, October 4, 2011

Building IT-Control Chart by BIRT against Data from the MySQL Database

This is just about another way to build an IT-Control chart assuming the raw data are in the real database like MySQL. In this case some SQL scripting is used.

1. The raw data is CPU hourly utilization and actually the same as in the previous posts: BIRT based Control Chart and One Example of BIRT Data Cubes Usage for Performance Data Analysis. (see the raw data picture here)

2. That raw data need to be uploaded to some table (CPUutil) in the MySQL schema (ServerMetric) by using the following script (sqlScriptToUploadCSVforSEDS.sql):

The uploaded data is seen at the bottom of the picture.

3.       Then the output (result) data (ActualVsHistoric table) is built using the following script (sqlScriptToControlChartforSEDS.sql):
The fragment of the result data are seen at the bottom of the picture also. Everything is ready for building IT-Control Chart and the data is actually the same as used in BIRT based Control Chart, so result should be the same also. Below is more detailed explanation how that was done.

4.  First, using BIRT the connection to MySQL database is established (to MySQLti  with schema  ServerMetrics to table ActualVsHistorical):

5. Then, the chart is developed the same way like that was done in BIRT based Control Chart post:


1.      6. Nice thing is in BIRT you can specify report parameters, that could be then a part of any constants including for filtering (to change a baseline or to provide server or metric names). Finally the report should be run to get the following result, which is almost identical with the one built for BIRT based Control Chart post:




Thursday, September 29, 2011

Power of Control Charts and IT-Chart Concept (Part 1)


This is the video presentation about Control Charts. It is based on my workshop I have already run a few times. It shows how to read and use Control Charts for reporting and analyzing IT systems performance (e.g. servers, applications) . My original IT-(Control) Chart concept within SEDS (Statistical Exception Detection System) is also presented.

The Part 2 will be about "How to build" control chart using R, SAS, BIRT and just 


If anybody interested I would be happy to conduct this workshop again remotely via Internet or in person. Just put a request or just a comment here.



UPDATE: See the version of this presentation with the Russian narration:

Friday, September 23, 2011

How To Build IT-Control Chart - Use the Excel Pivot Table!

Continuing the topic of the previous post “One Example of BIRT Data Cubes Usage for Performance Data Analysis” I am showing here the way how to transform raw data to a “SEDS DB” format suitable for IT- Control Chart building or for exception detection. Based on the published on this blog SEDS-lite introduction it is “...building data for charting/detecting” task which is seen on the picture:


But in this case it is strictly manual process (unless someone wants to use VBA to automate that within MS Excel….) and requires the same basically approach as Data Cube/CrossTable usage in BIRT and in MS Excel it is called “PivotTable and PivotChart report“ listed under “data” menu item.

Below are a few screenshots that could help someone who is a bit familiar with EXCEL to understand how to build IT-Control Charts in order to analyze performance data in SEDS terms.

The input data is the same as in the previous post – just date/hour stamped system utilization metric (link to it). Additionally three calculated variables were added: Weekday (using Excel WEEKDAY () function) and weekhour as seen on the next picture:

/CPUdata/ sheet
Then the pivot table was built as shown on the next screenshot against raw data plus calculated weekhour field, which is actually is specified in “row” section of Pivot Table Layout Wizard (it is a bit similar with CrossTable object in BIRT; indeed, the Excel Pivot Table is the another way to work with Data Cubes too!):

/PivotForITcontrolChart/ sheet
Then three other columns were added right next to the pivot table to be able to compare Actual vs. Base-line and calculate Control limits (UCL and LCL). To do that, the “CPU util. Actual” data were referenced from the raw /CPUdata/  sheet where the last week data considered as Actual. Control limits calculation was done by usual spreadsheet formula and the picture shows that formula for UCL.

The last step was to build a chart against the data range,  which includes pivot table and those three additional fields. See result IT Control Chart on the final picture:


Do you see where exceptions (anomalies)  happened there?

Note that is IT-Control chart where the last day with actual data at the very right last 24 hours on Saturday. So that report made by Excel or BIRT is good to run once a week (e.g. by Sundays before work hours) to get all last week exceptions. To be more dynamic this report should be a bit modified (by adding "refreshing" birder) to run it daily, so minor exception first happened in the Thursday could be captured at least on Friday morning and one could make some proactive measures to avoid overutilization issue the chart shows for Friday and especially Saturday. The most dynamic way is to run that hourly (Excel is not good for that - use BIRT!) to be able to react on the first exception with a few next hours! See live example how that's suppose to be here: http://youtu.be/NTOODZAccvk or here: http://youtu.be/cQ4bk1HNuRk



By the way, I plan to prepare one another workshop type of presentation to demonstrate the technique  discussed in my last posts and also to share actual reports maybe during some CMG.org events in the nearest future...

Thursday, September 22, 2011

One Example of BIRT Data Cubes Usage for Performance Data Analysis

I have got the comment on my previous post “BIRT based Control Chart“ with questions about how actually in BIRT the data are prepared for Control Charting. Addressing this request I’d like to share how I use BIRT Cube to populate data to CrossTab object which was used then for building a control chart.


As I have already explained in my CMG paper (see IT-Control Chart), the data that describes the  IT-Control Chart (or MASF control chart) has actually 3 dimensions (actually, it has 2 time dimensions and one measurement - metric as seen in the picture at the left). And the control chart is a just a projection to the 2D cut with actual (current or last) data overlaying. So, naturally, the OLAP Cubes data model (Data Cubes) is suitable for grouping and summarizing time stamped data to a crosstable for further analysis including building a control chart. In the past SEDS implementations I did not use Cubes approach and had to transform time stamped data for control charting using basic SAS steps and procs. Now I found that Data Cubes usage is somewhat simpler and in some cases does not require a programming at all if the modern BI tools (such as BIRT) are used.

Below are the some screenshots with comments that illustrates the process of building the IT-Control Chart by using BIRT Cube.



Data source (Input data) is a table with date/hour stamped single metric with at least 4 months history (in this case it is the CPU utilization of some Unix box). That could be in any database format; in this particular example it is the following CSV file:













The result (in the form of BIRT report designer preview) is on the following picture:(Where UCL – Upper Control Limit; LCL is not included for simplicity)

Before building the Cube the three following data sets were built using BIRT “Data Explorer”:
(1) The Reference set or base-line (just “Data Set” on the picture) is based on the input raw data with some filtering and computed columns (weekday and weekhour) and 
(2) the Actual data set which is the same but having the different filter: (raw[“date”} Greater “2011-04-02”)


(3) To combine both data sets for comparing base-line vs. actual, the “Data Set1” is built as a “Joint Data Set” by the following BIRT Query builder:
Then the Data Cube was built in the BIRT Data Cube Builder with the structure shown on the following screen:
Note only one dimension is used here – weekhour as that is needed for Cross table report bellow.

The next step is building report starting with Cross Table (which is picked as an object from BIRT Report designer “Pallete”):
The picture above shows also what fields are chosen from Cube to Cross table.

The final step is dropping “Chart” object from “Palette” and adding UCL calculation using Expression Builder for additional Value (Y) Series:

To see the result one needs just to run the report or to use a "preview' tab on the report designer window:

                FINAL COMMENTS

- The BIRT report package can be exported and submitted for running under any portals (e.g. IBM TCR).
- Additional Cube dimensions makes sense to specify and use, such as server name or/and metric name.
- The report can be designed in BIRT with some parameters. For example, good idea is to use a server name as the report parameter.
- To follow the “SEDS” idea and to have the reporting process based on exceptions, the preliminary exception detection step is needed and can be done again within a  BIRT report using the SQL script similar with published in one of the previous post: 


   

Saturday, September 17, 2011

BIRT based Control Chart

Recently implementing some solution using IBM TCR I have noticed that one of the default reports in TCR/BIRT is a Control Chart in the classical (SPC) version. Looks like that was one of the requirements for ability to build a consistent reports using TCR/BIRT as it's written here: Tivoli Common Reporting Enablement Guide

So I have built a few TCR reports with control chart against Tivoli performance data and that was somewhat useful.


I believe the IT-Control Chart (see my post about that type of control chart here) would give much more value for analyzing time stamped historical data. Is that possible to build using BIRT?


The BIRT is open source free BI tool (can be downloaded from here). I have downloaded and installed that on my laptop and have built a few reports for one of my customers. One of them was to filter out the exceptionally "bad" objects (servers) using EV criteria (see the linked post here).

Then I have built the IT-Control chart using BIRT. Below is the result: 


Yes, it is possible with some limitation I have noticed in the current version of BIRT report designer. You could see it if you compare that with oher IT-Control Charts I have build using R (See example here), SAS (Example here) or EXCEL (here).


Anyway, could you see how that chart reports pro-actively on an issue?


So it is another way (not to program like in R or SAS and not to make manually like in EXCEL) to build IT-Control charts. After it is built that could be submitted to TCR (or other reporting portals) to be seen/run on a web.

Tuesday, August 23, 2011

CMG'11 papers about non-statistical ways to capture outliers/anomalies and trends

from  The CMG'11 Abstract report  :



Monitoring Performance QoS using Outliers
Eugene Margulis, Telus
Commonly used Performance Metrics often measure technical parameters that the end user neither knows nor cares about. The statistical nature of these metrics assumes a known underlying distribution when in reality such distributions are also unknown. We propose a QoS metric that is based on counting the outliers - events when the user is clearly “dis”-satisfied based on his/her expectation at the moment. We use outliers to track long term trends and changes in performance of individual transactions as well as to track system-wide freeze events that indicate system-wide resource exhaustion.

BTW I have already tried to "count" outliers  ; see my 
2005 paper listed here: http://itrubin.blogspot.com/2007/06/system-management-by-exception.html

I used the SEDS database to count and analyze exceptions:






Introduction to Wavelets and their Application for Computer Performance Trend and Anomaly Detection: 
Introduction to wavelets and their application for computer performance analysis. Wavelets are a set of waveforms that can be used to match a signal or noise. There are various families of wavelets unlike Fourier Analysis. Wavelets are stretched(scaled) in time AND frequency and correlated with the signal. The correlation in time and frequency is displayed as a heat map. The color is the intensity, the X axis is the time and the Y axis is the frequency. The heat map shows the time the trends or anamoly starts and when it repeats(frequency).

CMG'11 Abstract Report shows my virtual presence

The CMG'11 agenda is online now. The Abstract report shows the following paper related to this blog subject:

1. A Real-World Application of Dynamic Thresholds for Performance Management by Jonathan B Gladstone


He published some material on this blog that most likly is included in his CMG paper: 


Feb 17, 2011
Jonathan Gladstone has worked with a team to implement pro-active Mainframe CPU usage monitoring, basing his design partly on presentations and conversations with Igor Trubin (currently of IBM) and Boris Ginis (of BMC Software).

Here is the abstract form the Abstract report:
The author describes a real application of dynamic thresholds as developed at BMO Financial Group. The case shown uses performance management data from IBM mainframes, but the method would work equally well for detecting deviations from normal patterns in any time-series data including resource utilization in distributed systems, storage, networks or even in non-IT applications such as traffic or health management. This owes much to previous work by well-regarded CMG participants Igor Trubin (currently at IBM), Boris Zibitsker (BEZ Systems) and Boris Ginis (BMC Software).

2. Automatic Daily Monitoring of Continuous Processes in Theory and Practice by Frank Bereznay

    Monitoring large numbers of processes for potential issues before they become problematic can be time consuming and resource intensive. A number of statistical methods have been used to identify change due to a discernable cause and separate it from the fluctuations that are part of normal activity. This session provides a case study of creating a system to track and report these types of changes. Determining the best level of data summarization, control limits, and charting options will be examined as well as all of the SAS code needed to implement the process and extend its functionality.

I believe that paper is based on the presentation he did at Southern CA CMG this year, which I have already mentioned in my following post: "The Master of MASF"

I have not written any paper for this year (1st time for the last 10 years!) but I glad that the technology I have been promoting for years still have presented in this year CMG conference with some references to my work!

Tuesday, August 16, 2011

"The Master of MASF"

The following paper has been recently presented at  Southern California CMG (SCCMG)

Automatic Daily Monitoring of Continuous Processes
Theory and Practice

by 

MP Welch – Merrill Consultants
Frank Bereznay - IBM
That is another great paper that promotes the MASF approach in System performance monitoring, which is actually the main subject of this blog. Most likely that paper will be presented again and publish at the international  CMG'11 conference.

I am very proud that I was called "The Master of MASF" at that presentation! Thank you, Frank!
Here is the link to the presentation file I have found via google, which has the following pages referencing my work and also this blog:
[PPT] 

Automatic Daily Monitoring of Continuous Processes Theory and Practice



The paper also has good references to Ron Kaminski and  Dima Seliverstov work. Both authors as well as Frank  Bereznay have already  been mentioned in this blog already: 


See the following posts for Frank  Bereznay work:


Aug 13, 2007
2006 Best Paper Award paper: Did Something Change? Using Statistical Techniques to Interpret Service and Resource Metrics. Frank M. Bereznay, Kaiser Permanente LINK: http://cmg.org/conference/cmg2006/awards/6139.pdf ...


Nov 05, 2010
Brian Barnett, Perry Gibson, and Frank Bereznay. That paper has a deep discussion about normality of performance data, showing examples where MASF approach does not work. The Survival Analysis that does not require any knowledge of how...



For Ron Kaminski work:



Jan 24, 2009
and ron kaminski who expressed some interest in my ev algorithm to capture recent bad trends as that solves some problems of workload pathology recognition on which he has been working recently. so you want to manage your z-series mips?

And for Dima Seliverstov work:


Dec 10, 2010
At CMG'10 conference I met BMC software specialist Dima Seliverstrov and he mentioned of referencing my 1st CMG'01 paper in his CMG presentation (scheduled to be presented TODAY!). I looked at his paper "Application of Stock Market...