His system does not generate any alerts on this basis, but it’s a good place to go to
- find out what’s been running hot (or cool) at the system level, and/or
- figure out why at the service class level.
It compares each interval (in this case every 10 minutes) of the most recent day’s utilization (by system and by service class) with the average for a given hour on a given day of the week over the past six weeks. Each interval is compared to the set of the last 36 values in a similar timeframe. If more than one interval in an hour is higher than the 98th percentile for its hour & day, the hour is marked yellow; if more than four intervals are high the hour is marked red. If more than one interval is lower than the 2nd percentile for its hour & day, the hour is marked blue. Anything in between (i.e. anything that falls within roughly x-bar±2SD) is green.
Here’s the main “CPU Overview” page from his system:
The thumbnails give an idea of what’s going on – green is within normal range. Let’s look at the Sunday, Jan. 23rd (just because all the colours are there). Clicking on any thumbnail shows that day close up:
Without going into details about what runs in which systems, we can see that they’re listed in reverse alpha order and, of course, anyone who’s looking at this knows which system is which. The user can see that a lot of systems were running well below their normal utilization on this particular Sunday. That’s mostly because of some special testing: our developers were asked to stay off the systems if they could. To see more detail let’s choose SCA6, which has all of the colours. If we click anywhere on the bar for SCA6, this next level of detail is shown:
(Igor Trubin: That is very similar to IT-control charts: see http://itrubin.blogspot.com/2009/05/seds-charts-at-scmg.html , but in the older version of 24-hour format; I prefer now to use 7x24 weekly profile shown here: https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjizmXcXEYkyoRxutIr4xS2dj83eQeVurIX3NQhrWlge4a6BYZeXM1JU9mCvdwiwvZEtMX-m4Q93BC98g9vhEKm_vSaJguMlE7RmVyU7ih_RViHc9FmjjscNzDPgsLtYglQ_SxCgERsii4/s1600-h/Untitled.jpg)
That chart shows the system’s total utilization (from SMF70s) for individual 10-minute intervals (green area) compared to the average, high (98%ile) and low (2%ile) values for each hour based on the last six weeks. We see why some hours are marked red, yellow or blue instead of green according to the rules above. Clicking anywhere on the green area gets a long page full of control charts that show the same information for each defined service class within that system (from SMF72s).
Among them the following, BATH_A6, is high-priority batch. Clearly it was driving some of the yellow and red flags for this system in the 2-3 and 5-7h windows:
(This post is published here with Jonathan’s Gladstone permission. He retains all publication rights and copyright for this material)
Excellent use of statistical graphics. I think the process could be refined by filtering out outliers from the history (from which the average is calculated). For data that contains outliers, Tukey provides a method for detection: x is outlier if x > p75 + 1.5 * IQR or
ReplyDeletex < p25 - 1.5 * IQR; where p75 is 75th percentil; p25 is 25th percentile; IQR is inter-quartile range. Also works best for non-normal distribution.
CMG Canada is having a paper presented by J. Gladstone related to this posting. Check: http://regions.cmg.org/regions/cacmg/
ReplyDelete