System Management by Exception: Systems Availability Arena: Battle between "Gut-feeling" and Engineering. Round 2.

Wednesday, October 3, 2012

Systems Availability Arena: Battle between "Gut-feeling" and Engineering. Round 2.

This is continuation of the previous post.

Gut-feeler 1 • Gut feelings aren't always bad - for instance, when working on an IT project meant to support really critical business processes, and hence with a lot of money or even lives at stake, your gut feeling might be that standard clustering just isn't good enough and you need something significantly better here.

Then it comes to curiosity - is there something better around than the usual standard clustering ? Is there some other IT infrastructure, one that is fault tolerant and self-healing, providing much more reliability right out of the box than you could ever achieve using the plain vanilla stuff and the most sophisticated clustering conceiveable ?

If being that curious, chances are you will end up at NonStop.

Not being curious and just doing calculations, you just might end up with adding more nodes to your standard cluster hoping to make it more reliable - which in the real world often turns out to be a false hope ...

But don't get me wrong, I'm not at all against calculations. A very important one is on capacity, will (n - 1) nodes still support my workload, when a node went down for whatever reason ? That's often overlooked ...

Engineer 2 Two comments, one on the Trubin law, and one on .. comments [above]. .... Adding a node of m 9s to a node of m9s adds m9s to a cluster. The overall availability of a cluster of n nodes, each with m 9s availability, is mn 9s.... For instance, a three-node cluster of nodes with 2 9s availability will be six nines...

Let f = the failure probability of a node [A=(1-f)]. If f is an even number of nines, then the failure probability of a node is f= 0.1^m, where m is the number of nines (for instance, for three nines, f=0.1^3 =.001 and A =(1-0.1^3) = 0.999). For an n-node cluster, its availability is 1-(1-A)^n = 1-[1-(1-0.1^m)^n = 1-(0.1^m)n = 1-0.1^mn.

In general, if a node has a failure probability of f, then an n-node cluster has an availability of 1-f^mn. Two nodes with availabilities of 0.95 will have an availability of 0.9975.

Of course, this assumes that the cluster fails only if all nodes fail. Generally, a cluster can withstand the failure of some nodes but not all. In this case, the above relations can be modified to accommodate this situation.

"Gut-feeler 1"'s suggestion that adding nodes does not result in this additional availability is quite correct. The above relations apply only to hardware failures (or whatever failures might be included in the nodal availability), and are accurate for those. However, once the hardware availability becomes high (say four 9s), other factors that are not node related come into play, such as software bugs, operator errors, and environmental faults (power, air conditioning). These limit the practical availability that can be achieved. In effect, after a certain point, increases in hardware availability become irrelevant to system availability.

Thanks for starting a very interesting and meaningful thread, Igor.

See the next post for the next round.

Igor Trubin

He started in 1979 as IBM/370 system engineer. In 1986 he got his PhD. in Robotics at St. Petersburg Technical University (Russia) and then worked as a professor teaching CAD/CAM, Robotics for 12 years. He published 30+ papers and made several presentations for conferences related to the Robotics and Artificial Intelligent fields. In 1999 he moved to the US, worked at Capital One bank as a Capacity Planner. His first CMG.org paper was written and presented in 2001. The next one, "Exception Detection System Based on MASF Technique," won a Best Paper award at CMG'02 and was presented at UKCMG'03 in Oxford, England. He made other tech. presentations at IBM z/Series Expo, SPEC.org, Southern and Central Europe CMG and ran several workshops covering his original method of Anomaly and Change Point Detection (Perfomalist.com). Author of “Performance Anomaly Detection” class (at CMG.com). Worked 2 years as the Capacity team lead for IBM, worked for SunTrust Bank for 3 years and then at IBM for 3 years as Sr. IT Architect. Now he works for Capital One bank as IT Manager at the Cloud Engineering and since 2015 he is a member of CMG.org Board of Directors. Runs UT channel iTrubin

System Management by Exception

Popular Post

_

Wednesday, October 3, 2012

Systems Availability Arena: Battle between "Gut-feeling" and Engineering. Round 2.

No comments:

Post a Comment