Popular Post

_

Showing posts with label Availability. Show all posts
Showing posts with label Availability. Show all posts

Tuesday, October 16, 2012

Availability vs. Capacity

Continuing the previous posts about  "Battle between "Gut-feeling" and Engineering." ....

Engineer 2: Igor - if you would like to find further extensions of your equation, you might check out Volume 1 of "Breaking the Availability Barrier," which I co-authored and which is available on Amazon. Also, check out several papers I published in the Availability Digest in the Geek Corner (http://www.availabilitydigest.com/articles.htm). A subscription is free.


Monday, October 8, 2012

Systems Availability Arena: Battle between "Gut-feeling" and Engineering. Round 3. (2 and 3 are in the previous posts)

UPDATE: the start is here
UPDATE: other rounds are here

Wednesday, October 3, 2012

Systems Availability Arena: Battle between "Gut-feeling" and Engineering. Round 2.

This is continuation of the previous post.

Gut-feeler 1 Gut feelings aren't always bad - for instance, when working on an IT project meant to support really critical business processes, and hence with a lot of money or even lives at stake, your gut feeling might be that standard clustering just isn't good enough and you need something significantly better here.
 
 Then it comes to curiosity - is there something better around than the usual standard clustering ? Is there some other IT infrastructure, one that is fault tolerant and self-healing, providing much more reliability right out of the box than you could ever achieve using the plain vanilla stuff and the most sophisticated clustering conceiveable ?

If being that curious, chances are you will end up at NonStop.

Not being curious and just doing calculations, you just might end up with adding more nodes to your standard cluster hoping to make it more reliable - which in the real world often turns out to be a false hope ...

But don't get me wrong, I'm not at all against calculations. A very important one is on capacity, will (n - 1) nodes still support my workload, when a node went down for whatever reason ? That's often overlooked ... 

Engineer 2  Two comments, one on the Trubin law, and one on .. comments [above]. .... Adding a node of m 9s to a node of m9s adds m9s to a cluster. The overall availability of a cluster of n nodes, each with m 9s availability, is mn 9s.... For instance, a three-node cluster of nodes with 2 9s availability will be six nines...

Let f = the failure probability of a node [A=(1-f)]. If f is an even number of nines, then the failure probability of a node is f= 0.1^m, where m is the number of nines (for instance, for three nines, f=0.1^3 =.001 and A =(1-0.1^3) = 0.999). For an n-node cluster, its availability is 1-(1-A)^n = 1-[1-(1-0.1^m)^n = 1-(0.1^m)n = 1-0.1^mn.

In general, if a node has a failure probability of f, then an n-node cluster has an availability of 1-f^mn. Two nodes with availabilities of 0.95 will have an availability of 0.9975.

Of course, this assumes that the cluster fails only if all nodes fail. Generally, a cluster can withstand the failure of some nodes but not all. In this case, the above relations can be modified to accommodate this situation.

"Gut-feeler 1"'s suggestion that adding nodes does not result in this additional availability is quite correct. The above relations apply only to hardware failures (or whatever failures might be included in the nodal availability), and are accurate for those. However, once the hardware availability becomes high (say four 9s), other factors that are not node related come into play, such as software bugs, operator errors, and environmental faults (power, air conditioning). These limit the practical availability that can be achieved. In effect, after a certain point, increases in hardware availability become irrelevant to system availability.

Thanks for starting a very interesting and meaningful thread, Igor. 

See the next post for the next round. 
 

Tuesday, October 2, 2012

Systems Availability Arena: Battle between "Gut-feeling" and Engineering!

I have put my Cluster Availability 9's Equation post to LinkedIn Continuous Availability forum and got 19 comments, divided in about two camps: "Gut-feelers" and "Engineers". Below is the first two comments. (See next posts for other comments.) 


Gut-feeler 1 "Nice formula, looks good in theory – but won’t hold true in the real world. The reason for this is the underlying assumption of an ideal cluster, which does not exist in the real world. When looking at some real world implementation like the Oracle RAC cluster, you will find that a simple two-node cluster configuration will typically deliver somewhere between three and four nines of availability.

Now, will adding a third node to that cluster add another 9 to the availability figure ? Will a ten node cluster really provide 99.999999999999 % availability ? Will a cluster with hundred nodes run continuously for thousands of years without any outage ?

Certainly not, and talking to system administrators running typical cluster installations will quickly reveal that large clusters are quite complex and difficult to handle, hence more prone to failure than simple two-node clusters.

Even when looking at the HP NonStop architecture – which comes pretty close to the ideal cluster – the formula would not apply. A NonStop system (which internally is indeed a cluster, each NonStop CPU resembling a node) delivers roughly five nines of availability – but there is no significant availability difference between systems eg. with four and with sixteen CPU’s (cluster nodes).

So it is not so important how many nodes you have – but it is very important what kind of cluster you have ! 

Engineer 1I know this particular formula is too simple for the real world, so I completely agree with your comment. But still the complexity of big clusters can be modeled by more complex math. models by e.g. adding more boxes with parallel and series type of connections. And the formula will be much more ugly but useful I believe... Plus each individual node could be decomposed on some structure to model both HW and SW parts. The approach is written in some books I mentioned in my other posts, and it is suppose to be a tool to do that, but I am not aware of any. Are you?

In my real life this type of calculation/modeling is just a starting point to get rough estimation and then using some monitoring data to get that adjusted. What I do not like is when some Architects makes some decisions of the level of cluster redundancy without ANY calculation (!), just based on their gut feelings....

(NB: Real names can be found on the actual LinkedIn forum tread)
UPDATE: see the start point here
 

Monday, September 17, 2012

LinkedIn Discussion around Trubin's Availability Formula

The previous post "Cluster Availability 9's Equation" triggered a very good discussion on LinkedIn Continuous Availability Forum.  It currently has 19 comments (!)... I plan to re-post some comments from the discussion here in my blog. (UPDATE: it is re-posted here)

BTW in one of the comments Bill Highleyman (co-author of the Breaking the Availability Barrier) pointed on the mistake in my formula which I corrected by replacing "n+n" with "mn". He also provided the excellent resource about availability calculation where he writes articles at the The Geek Corner for Availability Digest. One of the articles there extends the subject of this ( and couple previous) post and called: "Calculating Availability – Redundant Systems " 

As I suspected my formula ("Trubin law") is just a particular case of more generic rule Bill Highleyman formulates in that article. That says:

"... Adding a spare node adds the number of nines associated with that node to the system
availability but reduced by the increase in failure modes.


That is, adding an additional spare node adds the number of 9s of that node to the system
availability – almost. This improvement in availability is reduced a bit by the increase in the
number of failure modes in the system. More nodes mean more failure modes..."


 

Friday, September 14, 2012

Cluster Availability 9's Equation


Based on the "Trubin" Law (see my previous post) each additional node adds one more 9's to overall cluster availability. That is exactly true only if the single node has only one 9's (A=0.9), which the above "Trubin"'s equation shows.

But how that would work for other single node availability numbers? What if that has two or three 9's? I have generalized my previous equation to cover that and it shows that the cluster availability number of 9's will be increasing in arithmetic progression (sequence)!
Check more in the next post "LinkedIn Discussion around Trubin's Availability Formula"


Friday, August 17, 2012

How to Calculate Availability of Clustered Infrastructure for Multi-Tier Application

That is the task I am working on right now. I have some progress and the approach I found is to build availability graph to consider the clustered infrastructure as a chain of parallel and sires connected nodes described here with formulas. So below is a simple example:









And the availability calculation formula will be: 

A  = A1*(1-(1-(A2*A3)n)*A4

You can play with different level of redundancy "n" of the cluster here. Currently  it is 2 but you could estimate it for n=3 or n=4. That approach opens possibility to quantitatively justify you architectural decisions (not just using "best practices" or "gut feelings"). 

If you know MTTR for each individual component (SW and HW) you could estimate the whole infrastructure availability using this approach.  But how to get that individual MTTR? From vendors - good luck! Maybe from incident records? Or set up special monitoring for that (Synthetic- robotic?)

Other useful resources with formulas that relevant to this: