System Management by Exception: SPC

Showing posts with label SPC. Show all posts

Thursday, April 5, 2012

Prehistory of SEDS: Virtual CMG'90 Trip Report about Control Chart Usage. Part 1.

Using the key word "Control Chart" I have found in the www.CMG.org knowledge base a few very old CMG papers with some discussions about using classical SPC approach against computer performance data.

Here is the first one:

Fine-Grain Analysis (FGA): A Methodology for Analyzing Intermittent Performance Problems

By Robert Berry & Jeffrey Hedglin

The paper describes what Mainframe metrics are good to use for Control Charting. They should be two types - a. Performance Quality Measure - sounds like modern KPI... (e.g. response time); b. System performance metrics (e.g. CPU queue length). Then the paper describes how the intermittent problem could be detected just by plotting SPC Control Charts for both type of metrics in sync (correlated).

I use that approach a lot now, but using MASF type of Control chart and specifically my IT-Control Charts. BTW I am writing now my next CMG paper and plan to add there a couple very persuasive examples of correlated IT-Control Charts, such as, number of concurrent user LOGONS vs. number of Ph. CPUs used by LPARS on some p770 AIX frame....

To be continued....

Igor Trubin

He started in 1979 as IBM/370 system engineer. In 1986 he got his PhD. in Robotics at St. Petersburg Technical University (Russia) and then worked as a professor teaching CAD/CAM, Robotics for 12 years. He published 30+ papers and made several presentations for conferences related to the Robotics and Artificial Intelligent fields. In 1999 he moved to the US, worked at Capital One bank as a Capacity Planner. His first CMG.org paper was written and presented in 2001. The next one, "Exception Detection System Based on MASF Technique," won a Best Paper award at CMG'02 and was presented at UKCMG'03 in Oxford, England. He made other tech. presentations at IBM z/Series Expo, SPEC.org, Southern and Central Europe CMG and ran several workshops covering his original method of Anomaly and Change Point Detection (Perfomalist.com). Author of “Performance Anomaly Detection” class (at CMG.com). Worked 2 years as the Capacity team lead for IBM, worked for SunTrust Bank for 3 years and then at IBM for 3 years as Sr. IT Architect. Now he works for Capital One bank as IT Manager at the Cloud Engineering and since 2015 he is a member of CMG.org Board of Directors. Runs UT channel iTrubin

Friday, October 7, 2011

EV-Control Chart

I have introduced the EV meta-metric in 2001 as a measure of anomaly severity. EV stands for Exception Value and more explanation about that idea could be found here: The Exception Value Concept to Measure Magnitude of Systems Behavior Anomalies

Basically it is the difference (integral) between actual data and control limits. So far I have used EV data mostly to filter out real issues or for automatic hidden trend recognition. For instance, in my paper CMG’08 “Exception Based Modeling and Forecasting” I have plotted that metric using Excel to explain how it could be used for a new trend starting point recognition. Here is the picture from that paper where EV called “Extra Volume” and for the particular parent metric (CPU util.) it is named ExtraCPUtime:

The EV meta-metric first chart

But just plotting that meta-metric and/or two their components (EV+ and EV-) over time gives a valuable picture of system behavior. If system is stable that chart should be boring showing near zero value all the time. So using that chart would be very easy (I believe even easier than in MASF Control Charts) to recognize unusual and statistically significant increase or decrease in actual data in very early stage (Early Warning!).

Here is the example of that EV-chart against the same sample data used in few previous posts:

1. Excel example:

2. BIRT/MySQL example as a continuation of the exercise from the previous post:

IT-Control chart vs. EV-Chart

Here is the BIRT screenshots that illustrate how that is built:

a. A. Addition query to get EV calculated written directly in the additional BIRT Data Set object called “Data set for EV Chart”:

SQL query to calculate EV meta-metric

SQL query to calculate EV metric from the data kept in MySQL table

B. Then additional bar-chart object is added to the report that is bind to that new “Data set for EV Chart”:

Result report is already shown here.

Igor Trubin

Monday, November 15, 2010

My CMG'10 presentation - "IT-Control Charts"

I will go to CMG conference this time only for one day just to present my paper "IT-Control Charts" on Wednesday December 8th 10:30 - You are WELCOME!

Check it in the CMG conference agenda - http://www.cmg.org/cgi-bin/agenda_2010.pl?action=more&token=5030

For Russian readers (Информация по русски здесь) I made a posting about that event in my Russian mirror blog: http://ukor.blogspot.com/2010/11/cmg10_15.html

Igor Trubin

Monday, October 18, 2010

Statistical Process Control to Improve IT Services - one more CMG'10 paper related to this blog subject

Using Statistical Process Control to Improve the Quality and Delivery of IT Services
Nathan Shiffman Armin Roeseler, Townsend Analytics Mike Pecak
This session presents a framework for the delivery of IT services based on Continuous Quality Improvement (CQI). Starting with the Capability Maturity Model (CMM), we develop a process oriented approach based on Statistical Process Control (SPC). We apply the framework to the Change Management process of a large IT environment for a trading software firm, and show how failure-rates of the Change Management process were reduced dramatically.

Igor Trubin

Monday, June 7, 2010

Near-Real-Time IT-Control Chart R-Simulation

UPDATE: Now the following free web tool to build IT-control charts is available:

www.Perfomalist.com

See more explanation

Review of IT Control Chart

Igor Trubin

Wednesday, October 21, 2009

Lower Control Limit Usage Examples for IT Capaciy Management

I have recently posted the following question as LinkedIn discussion subject for "Statistical Process Control" group: "Does it make any sense to use Control Charts for capacity management?" and got one pessimistic comment, which included the following statement:

"...The only situation I can think of using a control chart for capacity is if you had a piece of equipment that if over utilized would cause damage or premature wear in which case you would only have an upper control..."

I disagree. My system (SEDS) has a special part (updated lists) called "Unusual Capacity Usage OUTSIDERS" that can help to capture some serious issues with servers, such as database going down, LPAR migration out of a host and other unusual capacity releases, that are not necessarily good things:

The following control charts from my up-coming CMG'09 workshop presentation are good illustrations of those type of finding SEDS captures:

1. Vmware host issue (VM migration):

2. Unisys server database is down:

3. Mainframe application unusual low CPU usage:

Igor Trubin

Thursday, June 7, 2007