System Management by Exception: Threshold

Showing posts with label Threshold. Show all posts

Friday, October 7, 2011

EV-Control Chart

I have introduced the EV meta-metric in 2001 as a measure of anomaly severity. EV stands for Exception Value and more explanation about that idea could be found here: The Exception Value Concept to Measure Magnitude of Systems Behavior Anomalies

Basically it is the difference (integral) between actual data and control limits. So far I have used EV data mostly to filter out real issues or for automatic hidden trend recognition. For instance, in my paper CMG’08 “Exception Based Modeling and Forecasting” I have plotted that metric using Excel to explain how it could be used for a new trend starting point recognition. Here is the picture from that paper where EV called “Extra Volume” and for the particular parent metric (CPU util.) it is named ExtraCPUtime:

The EV meta-metric first chart

But just plotting that meta-metric and/or two their components (EV+ and EV-) over time gives a valuable picture of system behavior. If system is stable that chart should be boring showing near zero value all the time. So using that chart would be very easy (I believe even easier than in MASF Control Charts) to recognize unusual and statistically significant increase or decrease in actual data in very early stage (Early Warning!).

Here is the example of that EV-chart against the same sample data used in few previous posts:

1. Excel example:

2. BIRT/MySQL example as a continuation of the exercise from the previous post:

IT-Control chart vs. EV-Chart

Here is the BIRT screenshots that illustrate how that is built:

a. A. Addition query to get EV calculated written directly in the additional BIRT Data Set object called “Data set for EV Chart”:

SQL query to calculate EV meta-metric

SQL query to calculate EV metric from the data kept in MySQL table

B. Then additional bar-chart object is added to the report that is bind to that new “Data set for EV Chart”:

Result report is already shown here.

Igor Trubin

He started in 1979 as IBM/370 system engineer. In 1986 he got his PhD. in Robotics at St. Petersburg Technical University (Russia) and then worked as a professor teaching CAD/CAM, Robotics for 12 years. He published 30+ papers and made several presentations for conferences related to the Robotics and Artificial Intelligent fields. In 1999 he moved to the US, worked at Capital One bank as a Capacity Planner. His first CMG.org paper was written and presented in 2001. The next one, "Exception Detection System Based on MASF Technique," won a Best Paper award at CMG'02 and was presented at UKCMG'03 in Oxford, England. He made other tech. presentations at IBM z/Series Expo, SPEC.org, Southern and Central Europe CMG and ran several workshops covering his original method of Anomaly and Change Point Detection (Perfomalist.com). Author of “Performance Anomaly Detection” class (at CMG.com). Worked 2 years as the Capacity team lead for IBM, worked for SunTrust Bank for 3 years and then at IBM for 3 years as Sr. IT Architect. Now he works for Capital One bank as IT Manager at the Cloud Engineering and since 2015 he is a member of CMG.org Board of Directors. Runs UT channel iTrubin

Tuesday, March 22, 2011

CMG Canada Paper about Threshold Managment

CMG Canada will have a paper presented by J. Gladstone related to the following posting on this blog: http://itrubin.blogspot.com/2011/02/jonathan-gladstone-threshold-management.html

Igor Trubin

Thursday, February 17, 2011

Jonathan Gladstone: Threshold Management Diagram

Jonathan Gladstone has worked with a team to implement pro-active Mainframe CPU usage monitoring, basing his design partly on presentations and conversations with Igor Trubin (currently of IBM) and Boris Ginis (of BMC Software).

His system does not generate any alerts on this basis, but it’s a good place to go to

find out what’s been running hot (or cool) at the system level, and/or
figure out why at the service class level.

It compares each interval (in this case every 10 minutes) of the most recent day’s utilization (by system and by service class) with the average for a given hour on a given day of the week over the past six weeks. Each interval is compared to the set of the last 36 values in a similar timeframe. If more than one interval in an hour is higher than the 98^th percentile for its hour & day, the hour is marked yellow; if more than four intervals are high the hour is marked red. If more than one interval is lower than the 2^nd percentile for its hour & day, the hour is marked blue. Anything in between (i.e. anything that falls within roughly x-bar±2SD) is green.

Here’s the main “CPU Overview” page from his system:

The thumbnails give an idea of what’s going on – green is within normal range. Let’s look at the Sunday, Jan. 23^rd (just because all the colours are there). Clicking on any thumbnail shows that day close up:

Without going into details about what runs in which systems, we can see that they’re listed in reverse alpha order and, of course, anyone who’s looking at this knows which system is which. The user can see that a lot of systems were running well below their normal utilization on this particular Sunday. That’s mostly because of some special testing: our developers were asked to stay off the systems if they could. To see more detail let’s choose SCA6, which has all of the colours. If we click anywhere on the bar for SCA6, this next level of detail is shown:

(Igor Trubin: That is very similar to IT-control charts: see http://itrubin.blogspot.com/2009/05/seds-charts-at-scmg.html , but in the older version of 24-hour format; I prefer now to use 7x24 weekly profile shown here: https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjizmXcXEYkyoRxutIr4xS2dj83eQeVurIX3NQhrWlge4a6BYZeXM1JU9mCvdwiwvZEtMX-m4Q93BC98g9vhEKm_vSaJguMlE7RmVyU7ih_RViHc9FmjjscNzDPgsLtYglQ_SxCgERsii4/s1600-h/Untitled.jpg)

That chart shows the system’s total utilization (from SMF70s) for individual 10-minute intervals (green area) compared to the average, high (98%ile) and low (2%ile) values for each hour based on the last six weeks. We see why some hours are marked red, yellow or blue instead of green according to the rules above. Clicking anywhere on the green area gets a long page full of control charts that show the same information for each defined service class within that system (from SMF72s).

Among them the following, BATH_A6, is high-priority batch. Clearly it was driving some of the yellow and red flags for this system in the 2-3 and 5-7h windows:

(This post is published here with Jonathan’s Gladstone permission. He retains all publication rights and copyright for this material)

Igor Trubin

Popular Post

_

Friday, October 7, 2011

EV-Control Chart

Tuesday, March 22, 2011

CMG Canada Paper about Threshold Managment

Thursday, February 17, 2011

Jonathan Gladstone: Threshold Management Diagram