System Management by Exception: Reporting By Exception

Tuesday, February 4, 2020

Reporting By Exception

Reporting is an important part of System Management and should be done also by exception.

During one of my passed job interview, one manager showed me some monthly capacity management report that consisted of several hundreds pages mostly with pretty busy charts. It looked overwhelming. If the common interest is in to report of systems only if they have issues currently or, based on some modeling, will have them soon. Plus that is suppose to be a regularly updated web report. I got that job...

Another challenge is how to built charts for that type of report. There are two ways to do that:

1. OLTP: like modern portal does; by querying on a fly some PDB and using on-line graph generator.
2. Batch: during off hours some batch job should pre-build all charts and the web report should just have links to those gif files.

Almost all known (at least by me) modern capacity/availability management tools uses the 1st approach (Generating charts on a fly). I have to use that time by time and I HATE that as the time to build more or less detailed charts (e.g. couple of week history of a few metrics) could take minutes and minutes! Plus the input web form to choose metrics, time-frames, systems and other options is usually very complicated and requires long earning curve.

The second one (regularly updated pre-built set of charts) is the fastest way to get report , but that approach has the following problem.

In one my other past job we used that approach to pre-built charts almost for every systems (servers, DBs and so on) and only for few main metrics (as getting that for every metrics is impossible task!). As a result we often had problems with nightly jobs plus a few times our Capacity Management environment had our own Capacity problem (BTW I mentioned that challenge in my 2004 CMG paper about Disk Subsystem Capacity Management)

Finally, I have found the better solution, which is using 2nd approach but on exception basis (Using SEDS) That requires generating much less number of charts/reports over-nightly and more metrics could be represented.

The optimum is always between two extremes. Generating reports on a fly is still not a bad idea.
I guess that approach could be used to compile exception report like health check of particular application or server. The input web form should give you options to select from a few choses like server or application name (e.g. based on CMDB server-application mapping), plus based on exception database (e.g. with SEDS type of exceptions-issues), that list of servers/applications and metrics (subsystems) could be filtered-out, showing systems/subsystems that had exceptions (anomalies) only.

For instance, if core part of SEDS-lite application (check my previous post about SEDS-lite project) should be written on "R", the presentation layer could be just some .NET application to take pre-built charts like IT-control charts or other trend/run/forecast charts and published them on the web using some simple GUI to choose server or application names...

Igor Trubin

He started in 1979 as IBM/370 system engineer. In 1986 he got his PhD. in Robotics at St. Petersburg Technical University (Russia) and then worked as a professor teaching CAD/CAM, Robotics for 12 years. He published 30+ papers and made several presentations for conferences related to the Robotics and Artificial Intelligent fields. In 1999 he moved to the US, worked at Capital One bank as a Capacity Planner. His first CMG.org paper was written and presented in 2001. The next one, "Exception Detection System Based on MASF Technique," won a Best Paper award at CMG'02 and was presented at UKCMG'03 in Oxford, England. He made other tech. presentations at IBM z/Series Expo, SPEC.org, Southern and Central Europe CMG and ran several workshops covering his original method of Anomaly and Change Point Detection (Perfomalist.com). Author of “Performance Anomaly Detection” class (at CMG.com). Worked 2 years as the Capacity team lead for IBM, worked for SunTrust Bank for 3 years and then at IBM for 3 years as Sr. IT Architect. Now he works for Capital One bank as IT Manager at the Cloud Engineering and since 2015 he is a member of CMG.org Board of Directors. Runs UT channel iTrubin

System Management by Exception

Popular Post

_

Tuesday, February 4, 2020

Reporting By Exception

No comments:

Post a Comment