Popular Post

_

Showing posts with label Threshold. Show all posts
Showing posts with label Threshold. Show all posts

Friday, October 7, 2011

EV-Control Chart

I have introduced the EV meta-metric in 2001 as a measure of anomaly severity. EV stands for Exception Value and more explanation about that idea could be found here:  The Exception Value Concept to Measure Magnitude of Systems Behavior Anomalies 
Basically it is the difference (integral) between actual data and control limits. So far I have used EV data mostly to filter out real issues or for automatic hidden trend recognition. For instance, in my paper CMG’08 “Exception Based Modeling and Forecasting” I have plotted that metric using Excel to explain how it could be used for a new trend starting point recognition. Here is the picture from that paper where EV called “Extra Volume” and for the particular parent metric (CPU util.) it is named ExtraCPUtime:

The EV meta-metric first chart 

But just plotting that meta-metric and/or two their components (EV+ and EV-) over time gives a valuable picture of system behavior. If system is stable that chart should be boring showing near zero value all the time. So using that chart would be very easy (I believe even easier than in MASF Control Charts) to recognize unusual and statistically significant increase or decrease in actual data in very early stage (Early Warning!).

Here is the example of that EV-chart against the same sample data used in few previous posts:
1. Excel example: 

2.  BIRT/MySQL example as a continuation of the exercise from the previous post:

IT-Control chart vs. EV-Chart
Here is the BIRT screenshots that illustrate how that is built:

a.        A. Addition query to get EV calculated written directly in the additional BIRT Data Set object called “Data set for EV Chart”:
SQL query to calculate EV meta-metric
 SQL query to calculate EV metric from the data kept in MySQL table

B. Then additional bar-chart object is added to the report that is bind to that new “Data set for EV Chart”:
Result report is already shown here.





Tuesday, March 22, 2011

CMG Canada Paper about Threshold Managment

 CMG  Canada will  have a paper presented by J. Gladstone related to the following posting on this blog: http://itrubin.blogspot.com/2011/02/jonathan-gladstone-threshold-management.html

Thursday, February 17, 2011

Jonathan Gladstone: Threshold Management Diagram

Jonathan Gladstone has worked with a team to implement pro-active Mainframe CPU usage monitoring, basing his design partly on presentations and conversations with Igor Trubin (currently of IBM) and Boris Ginis (of BMC Software).

His system does not generate any alerts on this basis, but it’s a good place to go to
  • find out what’s been running hot (or cool) at the system level, and/or
  • figure out why at the service class level.

It compares each interval (in this case every 10 minutes) of the most recent day’s utilization (by system and by service class) with the average for a given hour on a given day of the week over the past six weeks. Each interval is compared to the set of the last 36 values in a similar timeframe. If more than one interval in an hour is higher than the 98th percentile for its hour & day, the hour is marked yellow; if more than four intervals are high the hour is marked red. If more than one interval is lower than the 2nd percentile for its hour & day, the hour is marked blue. Anything in between (i.e. anything that falls within roughly x-bar±2SD) is green.

Here’s the main “CPU Overview” page from his system:



The thumbnails give an idea of what’s going on – green is within normal range. Let’s look at the Sunday, Jan. 23rd (just because all the colours are there). Clicking on any thumbnail shows that day close up:



Without going into details about what runs in which systems, we can see that they’re listed in reverse alpha order and, of course, anyone who’s looking at this knows which system is which. The user can see that a lot of systems were running well below their normal utilization on this particular Sunday. That’s mostly because of some special testing: our developers were asked to stay off the systems if they could. To see more detail let’s choose SCA6, which has all of the colours. If we click anywhere on the bar for SCA6, this next level of detail is shown:


  

That chart shows the system’s total utilization (from SMF70s) for individual 10-minute intervals (green area) compared to the average, high (98%ile) and low (2%ile) values for each hour based on the last six weeks. We see why some hours are marked red, yellow or blue instead of green according to the rules above. Clicking  anywhere on the green area gets a long page full of control charts that show the same information for each defined service class within that system (from SMF72s).

Among them the following, BATH_A6, is high-priority batch. Clearly it was driving some of the yellow and red flags for this system in the 2-3 and 5-7h windows:





(This post is published here with Jonathan’s Gladstone permission. He retains all publication rights and copyright for this material)