System Management by Exception: How to Calculate Availability of Clustered Infrastructure for Multi-Tier Application

Friday, August 17, 2012

How to Calculate Availability of Clustered Infrastructure for Multi-Tier Application

That is the task I am working on right now. I have some progress and the approach I found is to build availability graph to consider the clustered infrastructure as a chain of parallel and sires connected nodes described here with formulas. So below is a simple example:

And the availability calculation formula will be:

A = A1*(1-(1-(A2*A3)ⁿ)*A4

You can play with different level of redundancy "n" of the cluster here. Currently it is 2 but you could estimate it for n=3 or n=4. That approach opens possibility to quantitatively justify you architectural decisions (not just using "best practices" or "gut feelings").

If you know MTTR for each individual component (SW and HW) you could estimate the whole infrastructure availability using this approach. But how to get that individual MTTR? From vendors - good luck! Maybe from incident records? Or set up special monitoring for that (Synthetic- robotic?)

Other useful resources with formulas that relevant to this:

http://www.barringer1.com/ar.htm

http://www.angelfire.com/ca/summers/Business/MTBFAllocAnalysis1.html

http://www.pixelbeat.org/docs/reliability_calculator/

BOOK
One more book: Breaking the Availability Barrier

and Availability Digest

_______________

BTW This "Saga of 9's" continuous in the next posts:
LinkedIn Discussion around Trubin's Availability Formula
The Right Number of Cluster Redundancy to Achieve the Availability Goal. Trubin's Law #4!
Cluster Availability 9's Equation

Igor Trubin

He started in 1979 as IBM/370 system engineer. In 1986 he got his PhD. in Robotics at St. Petersburg Technical University (Russia) and then worked as a professor teaching CAD/CAM, Robotics for 12 years. He published 30+ papers and made several presentations for conferences related to the Robotics and Artificial Intelligent fields. In 1999 he moved to the US, worked at Capital One bank as a Capacity Planner. His first CMG.org paper was written and presented in 2001. The next one, "Exception Detection System Based on MASF Technique," won a Best Paper award at CMG'02 and was presented at UKCMG'03 in Oxford, England. He made other tech. presentations at IBM z/Series Expo, SPEC.org, Southern and Central Europe CMG and ran several workshops covering his original method of Anomaly and Change Point Detection (Perfomalist.com). Author of “Performance Anomaly Detection” class (at CMG.com). Worked 2 years as the Capacity team lead for IBM, worked for SunTrust Bank for 3 years and then at IBM for 3 years as Sr. IT Architect. Now he works for Capital One bank as IT Manager at the Cloud Engineering and since 2015 he is a member of CMG.org Board of Directors. Runs UT channel iTrubin

System Management by Exception

Popular Post

_

Friday, August 17, 2012

How to Calculate Availability of Clustered Infrastructure for Multi-Tier Application

No comments:

Post a Comment