Popular Post

Search This Blog

Monday, January 23, 2012

Quantifying Imbalance in Computer Systems: CMG'11 Trip Report, Part 2

As I promised in CMG'11 Trip Report, Part 1 here is my comments and some follow up analysis of the following paper: Quantifying Imbalance in Computer Systems that was written and presented at CMG'11 by Charles Loboz from Windows Azure.

The  idea is to calculate imbalance of a system by using an entropy property which well know in the physics , economics and in the information theory

In my other past posting I rose the following question:
 "can the information theory (entropy analysis) could be applied to performance exception detection?"

Looks like the idea from  the mentioning CMG paper of using entropy calculation against system performance data could lead to the answer of that my question!

 Here is the quote from the paper: 



"...Theil index is based on entropy - it describes the excess entropy in a system. For a data set xi,
i=1..n the Theil index is given by:

where n is the number of elements in the data set and xavg is the average value of all elements in the data set. To underline the application of the Theil index to measure  imbalance in computer systems we call it henceforth the Imbalance Coefficient (IC). 

Examining closer the IC formula above we can derive several properties:
  • (1) the ratio xi/xavg describes how much element i is above or below the average for the whole set. Thus IC involves only the ratio of each element against the average, not the absolute values of theelements.
  • (2) IC is dimensionless .– thus allows to compare imbalance between sets of substantially different quantities, for example when one set contains disk utilizations and another disk response times.
  • (3) The minimum value of IC is zero - when all elements of the data set are identical. The maximum value of the Imbalance Coefficient is log(n) - when all elements but one are equal; the maximum IC depends thus on the set size.
  • (4) We can view Imbalance Coefficient as a description of how concentrated is the use of some resource .– large values mean fewer users use most of the resource, small values mean more equal sharing.

We also define, for convenience, Normalized Imbalance Coefficient (nIC) as

to account for both imbalance within the set and the maximum entropy in that set. The nIC value ranges from 0 to 1 thus enabling comparison of imbalance between data sets with differing number of elements..."

Author applied that to the multiple disks utilization analysis, but he mentioned that approach could be used for measuring other computer subsystems imbalance. So I decided to try to calculate the imbalance of CPU utilization during the day (24 hours) and a week (168 hours) because the  imbalance of capacity usage during a day or week is a pretty common concern. Also using my way to group base-line vs. actual data I have applied that twice to compare an "average" weekly/daily utilization vs. last week/days of actual utilization.

The raw data is the same as for the last Control Charting exercise I published here in the  series of posts ( see EV-Control Chart as an example), where the actual data (in black) vs. historical averages (in green) are shown below:

Here is the result of calculating the actual vs. averaged nIC Imbalance difference for all 168 hours and for each weekdays (7 days by 24 hours):

You can see that in the day when the anomaly of CPU usage started - Wednesday - the imbalance was significantly different and all in all weekly imbalance was significantly different too!  So indeed that metric can be use to capture some performance metric anomalies (pattern changes). 

FYI: Here is the spreadsheet snapshot with actual calculation I used: 

How better that method of imbalance change checking to compare with more traditional ways to do that (e.g. based on deviations) is hard to say. My personal preference is still EV-concept. Anyway someone needs to try that against more data...

BTW I have found another paper which relates to that topic:

Quantifying Load Imbalance on Virtualized Enterprise Servers by
Emmanuel Arzuaga and David R. Kaeli

In that paper here is the clear statement about imbalance: "A typical imbalance metric based on the resource utilization of physical servers is the standard deviation of the CPU utilization".

Still an entropy is interesting system property that should give us additional good source of information for pattern recognition, I believe. For instance, the balance of Capacity usage of large frames with a lot of LPARS (AIX p7s or VMware hosts)  could be monitored by using that nIC metric to apply some possibly an automatic way to rebalanced capacity usage by using partition mobility or v-motion technologies.