System Management by Exception: May 2013

Thursday, May 30, 2013

CMG papers: Knee detection vs. EV based trends detection (SETDS)

The CMG’12 paper “A Note on Knee Detection“ (J. Ferrandiz, A. Gilgur) presented a method of “system phase change” detection by using the piece‐wise linear model against data with any supply‐demand relationship, e.g. CPU vs. transactions, load vs. traffic.

The weakness of the approach is the following underlying assumption in the methodology: the most of the data points are in the low load region. But all in all that is a relatively simple and effective way to capture the fact of constantly exceeding some threshold (confidence level e.g. 95%) of the data beyond detecting “knee”.
I see some similarity in my method of detecting the system phase changes (trends detection implemented by SETDS).

Based on my paper CMG'08 “Exception based Modeling and Forecasting” I use EV (Exception Value) meta-metric to detect pattern change in the data. The phases in the data should be separated by roots of the EV = 0 equation because for EV > 0 the data mostly exceeds the upper control limit and for EV < 0 – it is mostly below low control limit, and data is stable where EV=0.

But I used my way only for time series data (EV= f(t)), so detected phases are separated by points in time. Is that possible to apply my EV based approach to non-time series data? Not sure. But knowing that EV is just the deference between actual data and control limits (e.g. percentile based), the above mentioned knee detection algorithm could be a some kind of EV- based approach applied to non-time series data…
And I believe EV based approach is free from the assumption that the data should be somewhat misbalanced and it can also detect multiple “knees” with both directions (up and down). Only one caviar still exists (the same with knee detection algorithm): too many phases detecting, but that could be tuned by limits change, e.g. from 95 percentile to 99 and by grouping data (like aggregating minutes to hours for time-series data).

Igor Trubin

He started in 1979 as IBM/370 system engineer. In 1986 he got his PhD. in Robotics at St. Petersburg Technical University (Russia) and then worked as a professor teaching CAD/CAM, Robotics for 12 years. He published 30+ papers and made several presentations for conferences related to the Robotics and Artificial Intelligent fields. In 1999 he moved to the US, worked at Capital One bank as a Capacity Planner. His first CMG.org paper was written and presented in 2001. The next one, "Exception Detection System Based on MASF Technique," won a Best Paper award at CMG'02 and was presented at UKCMG'03 in Oxford, England. He made other tech. presentations at IBM z/Series Expo, SPEC.org, Southern and Central Europe CMG and ran several workshops covering his original method of Anomaly and Change Point Detection (Perfomalist.com). Author of “Performance Anomaly Detection” class (at CMG.com). Worked 2 years as the Capacity team lead for IBM, worked for SunTrust Bank for 3 years and then at IBM for 3 years as Sr. IT Architect. Now he works for Capital One bank as IT Manager at the Cloud Engineering and since 2015 he is a member of CMG.org Board of Directors. Runs UT channel iTrubin

Tuesday, May 21, 2013

"Is your Capacity Available?" - A topic for CMG'13 Conference Paper

2016 UPDATE.Finaly the paper was written, presented and published www.CMG.org
Here is the presentation slides:

"Is your Capacity Available?" – CMG.org conference (imPACt’16) paper

_____
Capacity Management and Availability Management are two interconnected services. That connection is getting more important in current era of virtualization, clustering and especially for cloud computing. It is obviously, that IT customers not only want the sufficient capacity for their applications, but even more importantly want that capacity highly available.

I have to deal with that combination recently and firmly believe that could be a great topic for up-coming CMG'13 conference. I plan to attend the conference, but unfortunately in spite I have a topic, the title "Is your Capacity Available?" and some already published in this blog related materials, I do not have ability to write the paper for this year CMG conference.

May be somebody could pick that idea up and share your experience in this mix area of Capacity and Availability? I would be extremely happy!

Here is the list of my posts related to this subject:

- Availability vs. Capacity

- The Right Number of Cluster Redundancy to Achieve the Availability Goal.

- How to Calculate Availability of Clustered Infrastructure for Multi-Tier Application

BTW The last post has some suggestion to estimate each node (component) availability (which is needed for a cluster availability calculation) by just looking at the Incident records history and use MTTR from there. Why not? If you have a good Incident Management that could be a very cheap solution! I would suggest to calculate different degrees of estimated availability, such as "Absolute" availability estimation based on up-time completely free from any incidents. Or "N-degree availability" number if only severity <= N incidents are taken in account or filtering out only incidents related to a particular component capacity - "Capacity Availability". Sure if Incident Management service is not mature enough, that will provide incorrect input for estimation. So you may consider other mechanisms I mentioned in that post... But from other hand that would encourage you (maybe via CSI) to improve your Incident management service!

- Trubin’s Availability Formula (Формула доступности Трубина)

Igor Trubin

Tuesday, May 7, 2013

CMG'12: "Time‐Series: Forecasting + Regression"

I continue sharing my impressions about last 2012 year international www.CMG.org conference.

(see previous: HERE and HERE).

This post is about Alex Gilgur paper. First time I met Alex in CMG'06 conference and put some note of his 2006 paper HERE. We met at several other CMG conferences and we talked a lot, mixing high matters (philosophy, physics and math) with our Capacity Management needs.... One of that discussion expired him to write the following paper:
Time-Series: Forecasting + Regression: “And” or “Or”?

In the agenda announcement he mentioned that fact and I really appreciate it:
"At CMG’11, I had a fascinating discussion with Dr. I.Trubin. We talked about Uncertainty, Second Law of Thermodynamics, and other high matters in relation to IT. That discussion prompted this paper..."

Reading the paper I was impressed how he combined the trending analysis with the business driver data correlation technique. I do that quite often, one example was published in my CMG'12 paper as well and summary slide can be seen here:

:

But in his work the technique is formed in a very elegant mathematical way and also he used Little’s Law to fight the most unpleasant statisticians rule: "Correlation Does Not Imply Causation".

In the next post I will put some comments about his another CMG'12 paper called:
"A Note on Knee Detection"

Igor Trubin

Friday, May 3, 2013

SCMG report: Business and Application Based Capacity Management

Another presentation at our April 2013 SCMG meeting was very interesting. Ann Dowling presented the result of large project she led to improve the capacity management of some big IT organization. Here is the title and the link:

Business and Application Based Capacity Management

Actually I was involved in that project too. I was at the beginning of the effort to build custom Cognos based reporting process. We tried to use BIRT reporting and then successfully switched to Cognos.

Interesting, that the presentation has mentioned “Dynamic thresholds”. I thought it is something like control charts do (and by the way during my work on that project I found "out-of-box" BIRT based control chart reports in the TCR and I suggested to use it or to build better ones using COGNOS), but looks like by “Dynamic thresholds” they mean “The ability to change (manually?) to meet customer department adjustments. The threshold metrics can be at the prompt page.”…

I prefer using that term for thresholds that a reporting system automatically changes based on a behavioral learning process.

Much later I have played with BIRT and COGNOS to build control charts. See examples below:
BIRT:

COGNOS:

Igor Trubin

Wednesday, May 1, 2013

SCMG report: Jobs Scheduling in Cloud and Hadoop

2015 UPDATE: Now having access to HADOOP I am thinking of how to use Map-Reduce to speed up SETDSing against the big performance data. The flowing thesis could very helpful for that:
Distributed Anomaly Detection andPrevention for Virtual Platforms by Ali Imran Jehangiri

Last week we ran our Richmond SCMG meeting following the agenda published HERE (links to presentations are there too including mine). The 1st presentation was titled: "Some Workload Scheduling Alternatives for High Performance Computing Systems" and presented by Jim McGalliard, the frequent CMG presenter and our friend. He mentioned some old topic he already presented in the past – Supercomputer batch jobs optimization by categorizing and scheduling them. Then after a brief description of MapReduce

( “method for simple implementation of parallelism in a program..”)

he explained how HADOOP

(“Designed for very large (thousands of processors) systems using commodity processors, including grid systems, Hadoop is a specific open source implementation of the MapReduce framework written in Java and licensed by Apache” )

does job scheduling using MapReduce and some other means.

That presentation leads me to another task to consider – job scheduling in the cloud. Ironically just before the meeting I had red interesting article about it (BTW it was recommended reading from my current manager as we are also going to clouds… What about you?). Here is the link to that article from one of the author’s webpage (Asit K Mishra ) and title:

”Towards Characterizing Cloud Backend Workloads: Insights from Google Compute Clusters”

I firmly believed that the workload characterization is going away due to virtualization - each workload/apps can have separate virtual server now. Right? But based on the article looks like the jobs categorization could be useful for optimizing their schedule to run in the cloud and maybe in HADOOP…

Igor Trubin

Popular Post

_

Thursday, May 30, 2013

CMG papers: Knee detection vs. EV based trends detection (SETDS)

Tuesday, May 21, 2013

"Is your Capacity Available?" - A topic for CMG'13 Conference Paper

- Availability vs. Capacity

- The Right Number of Cluster Redundancy to Achieve the Availability Goal.

- Trubin’s Availability Formula (Формула доступности Трубина)

Tuesday, May 7, 2013

CMG'12: "Time‐Series: Forecasting + Regression"

Friday, May 3, 2013

SCMG report: Business and Application Based Capacity Management

Wednesday, May 1, 2013

SCMG report: Jobs Scheduling in Cloud and Hadoop