Popular Post

Search This Blog

Tuesday, October 2, 2012

Systems Availability Arena: Battle between "Gut-feeling" and Engineering!

I have put my Cluster Availability 9's Equation post to LinkedIn Continuous Availability forum and got 19 comments, divided in about two camps: "Gut-feelers" and "Engineers". Below is the first two comments. (See next posts for other comments.) 

Gut-feeler 1 "Nice formula, looks good in theory – but won’t hold true in the real world. The reason for this is the underlying assumption of an ideal cluster, which does not exist in the real world. When looking at some real world implementation like the Oracle RAC cluster, you will find that a simple two-node cluster configuration will typically deliver somewhere between three and four nines of availability.

Now, will adding a third node to that cluster add another 9 to the availability figure ? Will a ten node cluster really provide 99.999999999999 % availability ? Will a cluster with hundred nodes run continuously for thousands of years without any outage ?

Certainly not, and talking to system administrators running typical cluster installations will quickly reveal that large clusters are quite complex and difficult to handle, hence more prone to failure than simple two-node clusters.

Even when looking at the HP NonStop architecture – which comes pretty close to the ideal cluster – the formula would not apply. A NonStop system (which internally is indeed a cluster, each NonStop CPU resembling a node) delivers roughly five nines of availability – but there is no significant availability difference between systems eg. with four and with sixteen CPU’s (cluster nodes).

So it is not so important how many nodes you have – but it is very important what kind of cluster you have ! 

Engineer 1I know this particular formula is too simple for the real world, so I completely agree with your comment. But still the complexity of big clusters can be modeled by more complex math. models by e.g. adding more boxes with parallel and series type of connections. And the formula will be much more ugly but useful I believe... Plus each individual node could be decomposed on some structure to model both HW and SW parts. The approach is written in some books I mentioned in my other posts, and it is suppose to be a tool to do that, but I am not aware of any. Are you?

In my real life this type of calculation/modeling is just a starting point to get rough estimation and then using some monitoring data to get that adjusted. What I do not like is when some Architects makes some decisions of the level of cluster redundancy without ANY calculation (!), just based on their gut feelings....

(NB: Real names can be found on the actual LinkedIn forum tread)
UPDATE: see the start point here