Popular Post

_

Tuesday, April 13, 2010

Disk Subsystem Capacity Management - my CMG'03 paper - "Health Index" metric and Dynamic Thresholds

Here is the link to my CMG'03 paper:  http://www.cmg.org/proceedings/2003/3099.pdf
(Free download but registration is required)
Presentation slides are freely available here:
Disk Subsystem Capacity Management, Based on Business ... - CMG

1. The paper showed interesting way to report Disk Space usage via BMC Perceive:


2. In the paper there is example of using some interesting "Health Index"  metric. I just took it from Concord (now it is CA product, I believe) performance data collector as one of many performance metrics.


Based on Concord eHeallth tool documentation:

“System Health Index” is the sum of five components (variables):
–SYSTEM, which reports a CPU imbalance problem;
–MEMORY, which is exceeding some memory utilization threshold or reflects some paging and/or swapping problems;
–CPU, which is exceeding some utilization threshold;
–COMM., which reports network errors or exceeding some network volume thresholds;
–And STORAGE, which might be a combination of
a. Exceeding user partition utilization threshold;

b. Exceeding system partition utilization threshold;

c. File cache miss rate, Allocation failures and

d. Disk I/O faults problem that can add additional points to this Health Index component.

I used that long ago. Currently in my environment I do not have that collector.
But I have started calculating my own way of "health index", which is based on numbers and types of exceptions (e.g. Hot ones are defects like run-aways; warning ones are just severe deviations from statistical norms; also number of hours/days with exceptions that does matter). Filtering that by applications (using CMDB) it gives you an idea of how stable the application is. In my other papers there are some elements of that approach.

2011 update: Other important  idea is in the paper is Dynamic Thresholds usage suggestion as for high level I/O related metrics there are no natural thresholds. Dynamic  Thresholds  got recently popular but I introduced that long ago!

Capturing Workload Pathology By SEDS - my CMG'05 paper

The paper can be found here: https://www.researchgate.net/publication/221447101_Capturing_Workload_Pathology_by_Statistical_Exception_Detection_System
Here is the resume:
Problem definition: The Servers workload pathology  (defects) such as run-away processes and memory leaks captures spare server resources and causes the following issues:
- being a parasite type of workload they compete for the resources with the real workload and causes performance degradations;
- they mimic capacity issue, but they are not a real capacity problem and just spoil the historical sample and causes wrong capacity trends as seen on the Figure below:

To fight this problem I have developed the way to capture those defects, report on them and then to remove them from historical sample to see real capacity trends. That was implemented as a part od SEDS application. Detailed explanations are in my CMG'05 paper  "Capturing Workload Pathology by Statistical Exception Detection System"
"Capturing_Workload_Pathology_by_Statistical_Exception_Detection_System)

Other good result of implementing this problem resolution was dramatic reduce number of incidents related to run-away and memory leaks defects. The chart below shows 2+ time reduction for 2 years:


Other work in this area made by Ron Kaminski. See CMG paper here:

Automating Process and Workload Pathology Detection


presentation slides:  

Automating Process and Workload Pathology