Popular Post

Search This Blog

Wednesday, May 23, 2012

STEEDd: Another Implementation of The Near-Real-Time Control Charting and EV Calculating


Thierry Déléris is a French System Programmer on Mainframe in a team dedicated to performance, metrology & capacity planning. He used some ideas published in Trubin's CMG papers to implement the following:

1. The solution, wich gives a daily eMail by CEC with a spreadsheet by LPAR and Workload, on a daily basis: thresholds are calculated thanks to the R Language by day of the week, hour of the day, LPAR name and WLM Workload, based on a 6 month history data (based on SMF72 records) with exclusion of outliers using Tukey Statistical Method.
This initial part of the solution has a big inconvenient: it gives the resulting spreadsheet for a CEC only the next day because it is based on the SMF 72-3 records of the previous day collected during the last night by TDSz...

2. Then the second part of the solution called STEEDd (Statistical Tool for Enhanced Exceptions Detection and Diagnosis, and as a reference to the "Avenger" British TV Show character John Steed and is legendary bowl hat) was developed using a Java solution to use the same R calculated thresholds but on a 15 minutes control solution, which interacts with BMC Mainview on the Host to collect the current data (In fact the last 15 minutes data). This solution gives a main screen to select the metric to control, and a control screen by metric. An eMail alert is sent to the team if for some metric the result is higher or lower than the target high or low thresholds.

As an example, here is a picture of the control screen used for CPU Metric by Workload & LPAR :

Legend:
When the icon is selected, the associated control chart pops up showing the metric for the last 12 hours like the shown below:


The idea of EV (Extra Value or Exception Value, introduced in Trubin’s CMG papers and discussed in this blog) is used there (Red bars for EV+ and Yellow bars for EV- on above picture) . This helps filtering the right & false negative alerts.

3. Third part of the solution: On the way! An Artificial Intelligence solution based on a rule engine is studied to explore the detected problem by a hierarchical way... This application will be used to enhance the analysis of the metric alerts thanks to an "expert system" way.

(Posted with the Thierry Déléris permission)