System Management by Exception: Jonathan Gladstone: Threshold Management Diagram

Thursday, February 17, 2011

Jonathan Gladstone: Threshold Management Diagram

Jonathan Gladstone has worked with a team to implement pro-active Mainframe CPU usage monitoring, basing his design partly on presentations and conversations with Igor Trubin (currently of IBM) and Boris Ginis (of BMC Software).

His system does not generate any alerts on this basis, but it’s a good place to go to

find out what’s been running hot (or cool) at the system level, and/or
figure out why at the service class level.

It compares each interval (in this case every 10 minutes) of the most recent day’s utilization (by system and by service class) with the average for a given hour on a given day of the week over the past six weeks. Each interval is compared to the set of the last 36 values in a similar timeframe. If more than one interval in an hour is higher than the 98^th percentile for its hour & day, the hour is marked yellow; if more than four intervals are high the hour is marked red. If more than one interval is lower than the 2^nd percentile for its hour & day, the hour is marked blue. Anything in between (i.e. anything that falls within roughly x-bar±2SD) is green.

Here’s the main “CPU Overview” page from his system:

The thumbnails give an idea of what’s going on – green is within normal range. Let’s look at the Sunday, Jan. 23^rd (just because all the colours are there). Clicking on any thumbnail shows that day close up:

Without going into details about what runs in which systems, we can see that they’re listed in reverse alpha order and, of course, anyone who’s looking at this knows which system is which. The user can see that a lot of systems were running well below their normal utilization on this particular Sunday. That’s mostly because of some special testing: our developers were asked to stay off the systems if they could. To see more detail let’s choose SCA6, which has all of the colours. If we click anywhere on the bar for SCA6, this next level of detail is shown:

(Igor Trubin: That is very similar to IT-control charts: see http://itrubin.blogspot.com/2009/05/seds-charts-at-scmg.html , but in the older version of 24-hour format; I prefer now to use 7x24 weekly profile shown here: https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjizmXcXEYkyoRxutIr4xS2dj83eQeVurIX3NQhrWlge4a6BYZeXM1JU9mCvdwiwvZEtMX-m4Q93BC98g9vhEKm_vSaJguMlE7RmVyU7ih_RViHc9FmjjscNzDPgsLtYglQ_SxCgERsii4/s1600-h/Untitled.jpg)

That chart shows the system’s total utilization (from SMF70s) for individual 10-minute intervals (green area) compared to the average, high (98%ile) and low (2%ile) values for each hour based on the last six weeks. We see why some hours are marked red, yellow or blue instead of green according to the rules above. Clicking anywhere on the green area gets a long page full of control charts that show the same information for each defined service class within that system (from SMF72s).

Among them the following, BATH_A6, is high-priority batch. Clearly it was driving some of the yellow and red flags for this system in the 2-3 and 5-7h windows:

(This post is published here with Jonathan’s Gladstone permission. He retains all publication rights and copyright for this material)

Igor Trubin

He started in 1979 as IBM/370 system engineer. In 1986 he got his PhD. in Robotics at St. Petersburg Technical University (Russia) and then worked as a professor teaching CAD/CAM, Robotics for 12 years. He published 30+ papers and made several presentations for conferences related to the Robotics and Artificial Intelligent fields. In 1999 he moved to the US, worked at Capital One bank as a Capacity Planner. His first CMG.org paper was written and presented in 2001. The next one, "Exception Detection System Based on MASF Technique," won a Best Paper award at CMG'02 and was presented at UKCMG'03 in Oxford, England. He made other tech. presentations at IBM z/Series Expo, SPEC.org, Southern and Central Europe CMG and ran several workshops covering his original method of Anomaly and Change Point Detection (Perfomalist.com). Author of “Performance Anomaly Detection” class (at CMG.com). Worked 2 years as the Capacity team lead for IBM, worked for SunTrust Bank for 3 years and then at IBM for 3 years as Sr. IT Architect. Now he works for Capital One bank as IT Manager at the Cloud Engineering and since 2015 he is a member of CMG.org Board of Directors. Runs UT channel iTrubin

2 comments:

Tim BrowningFebruary 17, 2011
Excellent use of statistical graphics. I think the process could be refined by filtering out outliers from the history (from which the average is calculated). For data that contains outliers, Tukey provides a method for detection: x is outlier if x > p75 + 1.5 * IQR or
x < p25 - 1.5 * IQR; where p75 is 75th percentil; p25 is 25th percentile; IQR is inter-quartile range. Also works best for non-normal distribution.
ReplyDelete
Replies
Igor TrubinMarch 22, 2011
CMG Canada is having a paper presented by J. Gladstone related to this posting. Check: http://regions.cmg.org/regions/cacmg/
ReplyDelete
Replies

Add comment

Popular Post

_

Thursday, February 17, 2011

Jonathan Gladstone: Threshold Management Diagram

2 comments: