System Management by Exception: April 2010

Tuesday, April 13, 2010

Disk Subsystem Capacity Management - my CMG'03 paper - "Health Index" metric and Dynamic Thresholds

Here is the link to my CMG'03 paper: http://www.cmg.org/proceedings/2003/3099.pdf
(Free download but registration is required)
Presentation slides are freely available here:
Disk Subsystem Capacity Management, Based on Business ... - CMG

1. The paper showed interesting way to report Disk Space usage via BMC Perceive:

2. In the paper there is example of using some interesting "Health Index" metric. I just took it from Concord (now it is CA product, I believe) performance data collector as one of many performance metrics.

Based on Concord eHeallth tool documentation:

“System Health Index” is the sum of five components (variables):
–SYSTEM, which reports a CPU imbalance problem;
–MEMORY, which is exceeding some memory utilization threshold or reflects some paging and/or swapping problems;
–CPU, which is exceeding some utilization threshold;
–COMM., which reports network errors or exceeding some network volume thresholds;
–And STORAGE, which might be a combination of
a. Exceeding user partition utilization threshold;

b. Exceeding system partition utilization threshold;

c. File cache miss rate, Allocation failures and

d. Disk I/O faults problem that can add additional points to this Health Index component.

I used that long ago. Currently in my environment I do not have that collector.
But I have started calculating my own way of "health index", which is based on numbers and types of exceptions (e.g. Hot ones are defects like run-aways; warning ones are just severe deviations from statistical norms; also number of hours/days with exceptions that does matter). Filtering that by applications (using CMDB) it gives you an idea of how stable the application is. In my other papers there are some elements of that approach.

2011 update: Other important idea is in the paper is Dynamic Thresholds usage suggestion as for high level I/O related metrics there are no natural thresholds. Dynamic Thresholds got recently popular but I introduced that long ago!

Igor Trubin

He started in 1979 as IBM/370 system engineer. In 1986 he got his PhD. in Robotics at St. Petersburg Technical University (Russia) and then worked as a professor teaching CAD/CAM, Robotics for 12 years. He published 30+ papers and made several presentations for conferences related to the Robotics and Artificial Intelligent fields. In 1999 he moved to the US, worked at Capital One bank as a Capacity Planner. His first CMG.org paper was written and presented in 2001. The next one, "Exception Detection System Based on MASF Technique," won a Best Paper award at CMG'02 and was presented at UKCMG'03 in Oxford, England. He made other tech. presentations at IBM z/Series Expo, SPEC.org, Southern and Central Europe CMG and ran several workshops covering his original method of Anomaly and Change Point Detection (Perfomalist.com). Author of “Performance Anomaly Detection” class (at CMG.com). Worked 2 years as the Capacity team lead for IBM, worked for SunTrust Bank for 3 years and then at IBM for 3 years as Sr. IT Architect. Now he works for Capital One bank as IT Manager at the Cloud Engineering and since 2015 he is a member of CMG.org Board of Directors. Runs UT channel iTrubin

Capturing Workload Pathology By SEDS - my CMG'05 paper

The paper can be found here: https://www.researchgate.net/publication/221447101_Capturing_Workload_Pathology_by_Statistical_Exception_Detection_System
Here is the resume:
Problem definition: The Servers workload pathology (defects) such as run-away processes and memory leaks captures spare server resources and causes the following issues:
- being a parasite type of workload they compete for the resources with the real workload and causes performance degradations;
- they mimic capacity issue, but they are not a real capacity problem and just spoil the historical sample and causes wrong capacity trends as seen on the Figure below:

To fight this problem I have developed the way to capture those defects, report on them and then to remove them from historical sample to see real capacity trends. That was implemented as a part od SEDS application. Detailed explanations are in my CMG'05 paper "Capturing Workload Pathology by Statistical Exception Detection System"
"Capturing_Workload_Pathology_by_Statistical_Exception_Detection_System)

Other good result of implementing this problem resolution was dramatic reduce number of incidents related to run-away and memory leaks defects. The chart below shows 2+ time reduction for 2 years:

Other work in this area made by Ron Kaminski. See CMG paper here:

Automating Process and Workload Pathology Detection

presentation slides:

Automating Process and Workload Pathology