System Management by Exception: Quantifying Imbalance in Computer Systems: CMG'11 Trip Report, Part 2

Monday, January 23, 2012

Quantifying Imbalance in Computer Systems: CMG'11 Trip Report, Part 2

UPDATE 2018:
The technique was successfully tested in the SonR (SEDS based Anomaly detection system) as described in the following post:

"My talk, "Catching Anomaly and Normality in Cloud by Neural Net and Entropy Calculation", has been selected for #CMGimpact 2019

_______________________________________________________ original post:
As I promised in CMG'11 Trip Report, Part 1 here is my comments and some follow up analysis of the following paper: Quantifying Imbalance in Computer Systems that was written and presented at CMG'11 by Charles Loboz from Windows Azure.

The idea is to calculate imbalance of a system by using an entropy property which well know in the physics , economics and in the information theory.

In my other past posting I rose the following question:
"can the information theory (entropy analysis) could be applied to performance exception detection?"

Looks like the idea from the mentioning CMG paper of using entropy calculation against system performance data could lead to the answer of that my question!

Here is the quote from the paper:

"...Theil index is based on entropy - it describes the excess entropy in a system. For a data set xi,

i=1..n the Theil index is given by:

where n is the number of elements in the data set and xavg is the average value of all elements in the data set. To underline the application of the Theil index to measure imbalance in computer systems we call it henceforth the Imbalance Coefficient (IC).

Examining closer the IC formula above we can derive several properties:

(1) the ratio xi/xavg describes how much element i is above or below the average for the whole set. Thus IC involves only the ratio of each element against the average, not the absolute values of theelements.

(2) IC is dimensionless .– thus allows to compare imbalance between sets of substantially different quantities, for example when one set contains disk utilizations and another disk response times.

(3) The minimum value of IC is zero - when all elements of the data set are identical. The maximum value of the Imbalance Coefficient is log(n) - when all elements but one are equal; the maximum IC depends thus on the set size.

(4) We can view Imbalance Coefficient as a description of how concentrated is the use of some resource .– large values mean fewer users use most of the resource, small values mean more equal sharing.

We also define, for convenience, Normalized Imbalance Coefficient (nIC) as

to account for both imbalance within the set and the maximum entropy in that set. The nIC value ranges from 0 to 1 thus enabling comparison of imbalance between data sets with differing number of elements..."

Author applied that to the multiple disks utilization analysis, but he mentioned that approach could be used for measuring other computer subsystems imbalance. So I decided to try to calculate the imbalance of CPU utilization during the day (24 hours) and a week (168 hours) because the imbalance of capacity usage during a day or week is a pretty common concern. Also using my way to group base-line vs. actual data I have applied that twice to compare an "average" weekly/daily utilization vs. last week/days of actual utilization.

The raw data is the same as for the last Control Charting exercise I published here in the series of posts ( see EV-Control Chart as an example), where the actual data (in black) vs. historical averages (in green) are shown below:

Here is the result of calculating the actual vs. averaged nIC Imbalance difference for all 168 hours and for each weekdays (7 days by 24 hours):

You can see that in the day when the anomaly of CPU usage started - Wednesday - the imbalance was significantly different and all in all weekly imbalance was significantly different too! So indeed that metric can be use to capture some performance metric anomalies (pattern changes).

FYI: Here is the spreadsheet snapshot with actual calculation I used:

How better that method of imbalance change checking to compare with more traditional ways to do that (e.g. based on deviations) is hard to say. My personal preference is still EV-concept. Anyway someone needs to try that against more data...

BTW I have found another paper which relates to that topic:

Quantifying Load Imbalance on Virtualized Enterprise Servers by
Emmanuel Arzuaga and David R. Kaeli

In that paper here is the clear statement about imbalance: "A typical imbalance metric based on the resource utilization of physical servers is the standard deviation of the CPU utilization".

Still an entropy is interesting system property that should give us additional good source of information for pattern recognition, I believe. For instance, the balance of Capacity usage of large frames with a lot of LPARS (AIX p7s or VMware hosts) could be monitored by using that nIC metric to apply some possibly an automatic way to rebalanced capacity usage by using partition mobility or v-motion technologies.

Igor Trubin

He started in 1979 as IBM/370 system engineer. In 1986 he got his PhD. in Robotics at St. Petersburg Technical University (Russia) and then worked as a professor teaching CAD/CAM, Robotics for 12 years. He published 30+ papers and made several presentations for conferences related to the Robotics and Artificial Intelligent fields. In 1999 he moved to the US, worked at Capital One bank as a Capacity Planner. His first CMG.org paper was written and presented in 2001. The next one, "Exception Detection System Based on MASF Technique," won a Best Paper award at CMG'02 and was presented at UKCMG'03 in Oxford, England. He made other tech. presentations at IBM z/Series Expo, SPEC.org, Southern and Central Europe CMG and ran several workshops covering his original method of Anomaly and Change Point Detection (Perfomalist.com). Author of “Performance Anomaly Detection” class (at CMG.com). Worked 2 years as the Capacity team lead for IBM, worked for SunTrust Bank for 3 years and then at IBM for 3 years as Sr. IT Architect. Now he works for Capital One bank as IT Manager at the Cloud Engineering and since 2015 he is a member of CMG.org Board of Directors. Runs UT channel iTrubin

1 comment:

AlexGilgurFebruary 11, 2019
Looking forward to seeing your presentation!
ReplyDelete
Replies

Add comment

Popular Post

_

Monday, January 23, 2012

Quantifying Imbalance in Computer Systems: CMG'11 Trip Report, Part 2

"My talk, "Catching Anomaly and Normality in Cloud by Neural Net and Entropy Calculation", has been selected for #CMGimpact 2019

1 comment: