System Management by Exception

Sunday, June 3, 2012

Adrian Heald: A simple control chart using Captell Version 6

At the CMG'11 I have met Adrian and asked him to show me how his reporting tool “Captell” (www.reportingservices.com) can be used to build MASF Control Charts. Below is his response.Check also the comment to this post with my feedback.
_____________________________________________________________________________
Introduction

A control chart uses data from a specified period to derive average and upper and lower control values. For this example we are using some CPU utilization data from a UNIX machine collected over the 4 month period January through April 2011 and delivered in a CSV file. The baseline period is January, from which we calculate average values and standard deviations for each clock hour. We can then plot our control chart and compare successive month’s average with the control to see a clear picture of change.

For more information and sample reports see www.reportingservices.com
or contact
Adrian Heald
on +61 (0)411 238 755
adrian@reportingservices.com

Step 1 - Import the CPU utilization data.

The following dialog shows the table definition selecting the “Delimited text file” source type. Specify a name and folder and choose the source type.

Here we see the text file definition, all that is required is the filename and a specification of the date time format.

Step 2 – Import and view the data

This Window shows the main Captell dialog with the task importing the data

And here a view of the imported data; during the import of the data Captell automatically determines correct data types.

Step 3 – Create a query to calculate the base line

This query calculates the average and average +/- 2 standard deviations for data from January.

The query output.

Step 4 – Create a query to summarise single months data

This query calculates the average CPU for each hour throughout the month selected by the Captell parameter ‘Data\Month’.

The query output:

Step 5 – Create a chart to combine the two queries

This chart shows the baseline average CPU utilisation and upper control limit along with the average values from the current month. Captell’s ability to plot data from different sources, in this case the baseline data and the data from the new month makes reporting quite easy. The blue line with the square symbols shows the average hourly data for March, well within the control limit and all hourly values below the baseline average.

Step 6 – Change the parameter to compare a different month

Here we can see the parameter changed to April and the resultant chart. The blue line with the square symbols shows the average hourly data for April, mostly above the upper control limit and all but one hour above the January mean, indicating a substantial increase in utilization.

(Posted with the Adrian's Heald permission)

Igor Trubin

He started in 1979 as IBM/370 system engineer. In 1986 he got his PhD. in Robotics at St. Petersburg Technical University (Russia) and then worked as a professor teaching CAD/CAM, Robotics for 12 years. He published 30+ papers and made several presentations for conferences related to the Robotics and Artificial Intelligent fields. In 1999 he moved to the US, worked at Capital One bank as a Capacity Planner. His first CMG.org paper was written and presented in 2001. The next one, "Exception Detection System Based on MASF Technique," won a Best Paper award at CMG'02 and was presented at UKCMG'03 in Oxford, England. He made other tech. presentations at IBM z/Series Expo, SPEC.org, Southern and Central Europe CMG and ran several workshops covering his original method of Anomaly and Change Point Detection (Perfomalist.com). Author of “Performance Anomaly Detection” class (at CMG.org). Worked 2 years as the Capacity team lead for IBM, worked for SunTrust Bank for 3 years and then at IBM for 3 years as Sr. IT Architect. Now he works for Capital One bank as IT Manager at the Cloud Engineering and since 2015 he is a member of CMG.org Board of Directors. Runs UT channel iTrubin

Wednesday, May 23, 2012

STEEDd: Another Implementation of The Near-Real-Time Control Charting and EV Calculating

Thierry Déléris is a French System Programmer on Mainframe in a team dedicated to performance, metrology & capacity planning. He used some ideas published in Trubin's CMG papers to implement the following:

1. The solution, wich gives a daily eMail by CEC with a spreadsheet by LPAR and Workload, on a daily basis: thresholds are calculated thanks to the R Language by day of the week, hour of the day, LPAR name and WLM Workload, based on a 6 month history data (based on SMF72 records) with exclusion of outliers using Tukey Statistical Method.

This initial part of the solution has a big inconvenient: it gives the resulting spreadsheet for a CEC only the next day because it is based on the SMF 72-3 records of the previous day collected during the last night by TDSz...

2. Then the second part of the solution called STEEDd (Statistical Tool for Enhanced Exceptions Detection and Diagnosis, and as a reference to the "Avenger" British TV Show character John Steed and is legendary bowl hat) was developed using a Java solution to use the same R calculated thresholds but on a 15 minutes control solution, which interacts with BMC Mainview on the Host to collect the current data (In fact the last 15 minutes data). This solution gives a main screen to select the metric to control, and a control screen by metric. An eMail alert is sent to the team if for some metric the result is higher or lower than the target high or low thresholds.

As an example, here is a picture of the control screen used for CPU Metric by Workload & LPAR :

Legend:

When the icon is selected, the associated control chart pops up showing the metric for the last 12 hours like the shown below:

The idea of EV (Extra Value or Exception Value, introduced in Trubin’s CMG papers and discussed in this blog) is used there (Red bars for EV+ and Yellow bars for EV- on above picture) . This helps filtering the right & false negative alerts.

3. Third part of the solution: On the way! An Artificial Intelligence solution based on a rule engine is studied to explore the detected problem by a hierarchical way... This application will be used to enhance the analysis of the metric alerts thanks to an "expert system" way.

(Posted with the Thierry Déléris permission)

Igor Trubin

Monday, May 7, 2012

SEDS-Lite Presentation at Southern CMG Meeting in the SAS Institute

Last Friday I have made my presentation which was announced here: SEDS-Lite: Using Open Source Tools (R, BIRT and MySQL) to Report and Analyze Performance Data. That was presented at the Southern CMG Meeting in the SAS Institute, Cary, NC. The presentation slides are linked within AGENDA and also can be downloaded from HERE

I plan to write a paper based on this presentation and to submit that to this year CMG'12 conference.

Igor Trubin

Friday, April 20, 2012

Building IT-Control Chart with COGNOS

I am developing SEDS elements using IBM Cognos. Here is the 1st result, which is just a POC prototype of IT-Control Chart report.

I used the test data (Date-hour stamped utilization metric) that I developed to build the same IT-Control Charts by other tools (BIRT, MySQL, R). I have published some information about that on my previous blog posts. (e.g. R-script to plot IT-Control Chart against MySQL)

This time I have developed simplest meta-data package against ODBC to MySQL database by using Cognos Framework Manager and published that in TCR locally on my Laptop. Then I used Cognos Report Studio to build the report. The result of running the report is following:

I got the same result as I got by using R or BIRT, but I have noticed some nice features in COGNOS that helped me to build that faster and more accurate (e.g. adding the dates at the X-Axis)

I am going to mention that progress with some details on my up-coming SCMG presentation:

SEDS-Lite: Using Open Source Tools (R, BIRT and MySQL) to Report and Analyze Performance Data

UPDATE: I will be presenting that again at CMG'12 conference: http://itrubin.blogspot.com/2012/08/seds-lite-using-open-source-tools-r.html

Igor Trubin

Tuesday, April 17, 2012

Southern CMG Spring 2012 Meeting in Richmond - MXG is our Sponsor!

At SCMG we have usually two meetings each season (2 Spring and 2 Fall ones, both in Richmond and Raleigh). Last season - 2011 fall - I had my presentation; see the following post: "My Southern CMG Presentation in Richmond Is About Open Source Tools for Capacity Management ". Presentation slides are published here: slides

This spring I have the similar but updated presentation in our Raleigh SCMG meeting: "SEDS-Lite: Using Open Source Tools (R, BIRT and MySQL) to Report and Analyze Performance Data"

So this time I am not presenting in Richmond but I found very good sponsor for that meeting - Merrill Consultants (http://www.mxg.com). Barry Merrill himself responded on my invitation and now we have a great opportunity to see and listen the legendary Capacity Management inventor!

Please consider attending our Richmond VA SCMG meeting on May 11, 2012:

http://regions.cmg.org/regions/scmg/spring_12/richmond/meeting.htm

Igor Trubin

Wednesday, April 11, 2012

SEDS-Lite: Using Open Source Tools (R, BIRT and MySQL) to Report and Analyze Performance Data

My presentation with this name has been scheduled for the next Southern CMG meeting at SAS Institute:

SCMG Meeting Raleigh
May 04, 2012

You are welcome to attend!

Igor Trubin

Thursday, April 5, 2012

Prehistory of SEDS: Virtual CMG'90 Trip Report about Control Chart Usage. Part 1.

Using the key word "Control Chart" I have found in the www.CMG.org knowledge base a few very old CMG papers with some discussions about using classical SPC approach against computer performance data.

Here is the first one:

Fine-Grain Analysis (FGA): A Methodology for Analyzing Intermittent Performance Problems

By Robert Berry & Jeffrey Hedglin

The paper describes what Mainframe metrics are good to use for Control Charting. They should be two types - a. Performance Quality Measure - sounds like modern KPI... (e.g. response time); b. System performance metrics (e.g. CPU queue length). Then the paper describes how the intermittent problem could be detected just by plotting SPC Control Charts for both type of metrics in sync (correlated).

I use that approach a lot now, but using MASF type of Control chart and specifically my IT-Control Charts. BTW I am writing now my next CMG paper and plan to add there a couple very persuasive examples of correlated IT-Control Charts, such as, number of concurrent user LOGONS vs. number of Ph. CPUs used by LPARS on some p770 AIX frame....

To be continued....

Igor Trubin

Tuesday, March 27, 2012

R-Script to Aggregate (ETL to MySQL) Actual data with Base-line data for IT-Control Charts

At my previous post (R-script to plot IT-Control Chart against MySQL) the task was given to write a R-script to pre-process (ETL) the raw date-hour stamped data to the DATA-cubical format for Control Charting.

Here is the solution:
I have just transformed the already developed SQL script to the RODBC based R-Script which can be seen below:

The result of the script run is the "ActualVsHistorical" table in the servermentrics database on MySQL with the following data that is identical with the data used for plotting IT-Control Chart published in the previous post. The data itself can be seen by just typing the data frame name in the R-Console window:

So, all main elements of SEDS-lite project were prototyped and published on my posts. Maybe one more task is left, which is to illustrate on R how the exceptional (based on EV meta-metric filtering) list of objects (servers) can be created as a part of anomalies detection. So far that was done and published in this blog and so far it is only in the "DB2"-like SQL format to run within BIRT. See the post about that here: UCL=LCL : How many standard deviations do we use for Control Charting? Use ZERO!

Igor Trubin

Wednesday, March 21, 2012

R-script to plot IT-Control Chart against MySQL

Continuing playing with the open-source tools to build some SEDS elements, I have developed the simple R-script to plot the IT-Control chart against data stored in MySQL database.

I used the same MySQL data that was already been built and used for IT-Control Charting by BIRT reporting system. See the following post about how that was done: Building IT-Control Chart by BIRT against Data from the MySQL Database. To do that I have used RODBC package to connect and query data from MySQL database through the MySQL ODBC driver.

Actually, I have just slightly modified the R-script which I wrote for my "Power of Control Chart" workshop That script could be found in the following post: IT-Chart: The Best Way to Visualize IT Systems Performance

Here is my new script (click on it to enlarge) :

Here is the result:

which practically identical with what was done by BIRT (see link to BIRT based picture here).

If you are a programmer you would notice how it is easier to build charts using R versus BIRT (not-for-programmer, menu-based report generator).

The data used for this exercise was already preprocessed to the DATA-cubical format from raw date-hour stamped data (see the SQL script for that here). But what about doing this pre-processing also by R?

That is the next task ... (could be your homework ;). The simplest approach is again to use RODBC package just to run the mentioned above SQL script within R-system. Other and better approach is to do that using the natural R-system data manipulation technique.

Igor Trubin

Monday, February 27, 2012

Automatic Daily Monitoring of Continuous Processes in Theory and Practice: My CMG'11 Trip Report; Part 3

As I already announced in my following posting: CMG'11 Abstract Report shows my virtual presence another great MASF paper was published on CMG'11 conference:

"Automatic Daily Monitoring of Continuous Processes in Theory and Practice" written and presented by Frank Bereznay & MP Welch.

I have attended the session and here are my comments:

1. Difference from MASF and SPC was stressed. "MASF is a framework and not a detailed statistical method".

2. "... key assumption, our workload is repeatable is some fashion over time. The concept of a repeatable workload is fundamental to any sort of detection testing and needs to be validated before making any investment of time and software into developing a detection system..." That is true!

3. The weekly 168-hour profile was admitted as the best one for MASF analysis:

- the picture from commented paper

I am glad they did that as I moved from the 24-hour profile to this one long ago. See my 2006 paper and here is the IT-Control Chart from that:

So they suggested to have 168 separate (for each hour) group of data (separated reference sets) that exactly technique I had been using since 2006. They stressed, that you need to have at least 5 month of historical data to build that weekly profile adaptive filtering policy. And if you do not have this luxury they describe the way to reduce the number of groups, for instance by separating shifts.

At this point I would slightly disagree. To have hourly summarized 6 month historical data is not a problem anymore in the modern capacity planning processe, especially in Mainframes (they used that platform for demonstration)

4. They published some simple SAS code fragments. I have never did that! But I have started publishing R-codes and SQL scripts as they are more popular (and open sourced) programming systems.

5. They reproduced my favorite IT-Control Chart, but against daily data:

- the picture from commented paper

That is similar with my very early attempt to build a IT-Control chart in the same my 2006 paper:

- that is my 1st Control Chart builder!

But I believe the 168 hourly control chart (I call that IT-Control Chart) is better; in spite it is a bit busy ... See another example below:

6. Some techniques for reduction of false positives were discussed.

I glad they mentioned my way to do that by using EV meta-metric:

"One technique for reducing false positives is to measure the area between under the exception (one of Truben’s techniques) to determine the extent of the deviation. In this case, this exception would not likely warrant review and is common when using the Hourly stigmatization of this data." (I believe they misspelled my name. it is Trubin - not Truben...)

Anyway, they did extensive referencing of some of my papers and even mentioned this blog and I greatly appropriate that!

All in all it is very good paper and presentation!

Igor Trubin

Friday, February 24, 2012

I was the professor at the technical university in Russia - List of courses I taught in 1999

Igor Trubin

Friday, February 17, 2012

Forrester’s “APM and BTM” about CEP - Complex Event Processing

Continuing the previous post subject I looked at another research about APM, which was made a bit earlier in 2010 by Forrester Research, Inc. and called

“Competitive Analysis: Application Performance Management And Business Transaction Monitoring”. The research can be downloaded here.

I found that research also admits importance of usage for APM the “self-learning” related techniques and treated that as a part of CEP - Complex Event Processing.

Based on the research,

“..The Next Step: APM, BTM, BPM, And CEP Converge Complex event processing (CEP) is most probably the first step in the evolution of application performance management. All products reviewed are using some form of statistical-based analysis to distinguish normal from abnormal behavior of applications and transactions. Nastel seems to have taken this analysis one step further by adding a level of inference to its solution. Progress Software has already made the jump into CEP by combining its expertise in BTM and BPM. OpTier recently acquired a solution and announced its intention to enter the advanced field of CEP. SL Corporation, based on its process control automation past, has provided event correlation for a long time, and further integrates with major CEP vendors…”

Below are Vendors that Forester’s research mentioned as having some CEP features (Underlined)

BMC

BPPM Application, Database, and Middleware Monitoring with Analytics monitors transactions running through Web application servers and messaging middleware as well as packaged applications like SAP, Oracle Applications, PeopleSoft, and Siebel CRM. Data collected is automatically integrated with a self-learning analytics engine.

NetIQ

AppManager Performance Profiler is a self-learning, continuously configuring, and continuously adapting technology that profiles dynamic application behavior and sends Trusted Alarms that helps troubleshoot system incidents.

IBM

..(Tivoli) proactively defines autothresholds based on normal behavior.

Nastel Technologies.

AutoPilot CEP integrates events from AutoPilot and third-party monitoring solutions to provide a predictive analysis of application and transaction behavior (normal versus abnormal) and provides a role-based dashboard.

SL Corporation.

RTView Historian allows for persistence of performance metrics via relational databases. The historical data is used for predictive analysis of trends in component and application behavior; historical data provides the ability to create trusted alerts triggered not against fixed thresholds but against dynamically calculated baselines that take into account typical loads during different periods of the workday.

Correlsense

SharePath builds a transaction model for each transaction type to show how it typically utilizes the infrastructure and then creates automatic baselines to provide alerting capabilities and information about a deviation from normal operating tolerances.

Progress Software.

Progress Apama (also part of the RPM Suite) can take information from Actional and perform complex pattern detection activities around it, looking for anomalies that Actional might not otherwise detect. This might include, for example, detecting a cross-correlation between different transactions that might be the root cause of an issue.

Igor Trubin

Wednesday, February 15, 2012

Gartner's Magic Quadrant for Application Performance Monitoring and Behavior Learning Engine

I strongly believe that my SEDS or SETDS (Statistical Exception and Trend Detection System) could be treated as BLE – Behavior Learning Engine. SEDS or SETDS (new name I use now) is not recognized by the following Gartner’s research, but BLE is.

Gartner 2011 research (G00215740) called “Magic Quadrant for Application Performance Monitoring” ( can be downloaded here) admitted that one of the important functionality dimensions of APM is “Applications Performance Analytics” which descried in the research and can be seen in below quotes:

That includes BLE which is indeed the essential component of Application Performance Analytics. The research includes the several Vendors/tools analyses that showed in the Quadrant picture below:

But only the following vendors were indicated in the research as having tools with strong behavior learning features:

ASG

BMC software

CA Technologies

Compuware

IBM

I hope the SETDS implementation offering withing the IBM consulting service (which I currently do) could shift the company in that Magic Quadrant to the right...

Igor Trubin

Popular Post

_

Sunday, June 3, 2012

Wednesday, May 23, 2012

Monday, May 7, 2012

Friday, April 20, 2012

Tuesday, April 17, 2012

Wednesday, April 11, 2012

SCMG Meeting RaleighMay 04, 2012

Thursday, April 5, 2012

Tuesday, March 27, 2012

Wednesday, March 21, 2012

Monday, February 27, 2012

As I already announced in my following posting: CMG'11 Abstract Report shows my virtual presence another great MASF paper was published on CMG'11 conference:

"Automatic Daily Monitoring of Continuous Processes in Theory and Practice" written and presented by Frank Bereznay & MP Welch.

I have attended the session and here are my comments:

1. Difference from MASF and SPC was stressed. "MASF is a framework and not a detailed statistical method".

2. "... key assumption, our workload is repeatable is some fashion over time. The concept of a repeatable workload is fundamental to any sort of detection testing and needs to be validated before making any investment of time and software into developing a detection system..." That is true!

3. The weekly 168-hour profile was admitted as the best one for MASF analysis:

- the picture from commented paper

I am glad they did that as I moved from the 24-hour profile to this one long ago. See my 2006 paper and here is the IT-Control Chart from that:

At this point I would slightly disagree. To have hourly summarized 6 month historical data is not a problem anymore in the modern capacity planning processe, especially in Mainframes (they used that platform for demonstration)

4. They published some simple SAS code fragments. I have never did that! But I have started publishing R-codes and SQL scripts as they are more popular (and open sourced) programming systems.

5. They reproduced my favorite IT-Control Chart, but against daily data:

- the picture from commented paper

That is similar with my very early attempt to build a IT-Control chart in the same my 2006 paper:

- that is my 1st Control Chart builder!

But I believe the 168 hourly control chart (I call that IT-Control Chart) is better; in spite it is a bit busy ... See another example below:

6. Some techniques for reduction of false positives were discussed.

Friday, February 24, 2012

Friday, February 17, 2012

Wednesday, February 15, 2012

SCMG Meeting Raleigh
May 04, 2012