Assume you are tasked with running a set of mission critical web services with a targeted up time of 99.5%
To achieve this, you have two identical clusters of hardware in two entirely separate data centers with different bandwidth providers etc.
At the top of each rack, there needs to be some top of rack hardware (switching and firewall equipment).
Your software setup is such that, with human intervention, it is very easy to reroute all services from one of the data centers to the other in the event one of the clusters is knocked offline. Say this would take about 15 minutes.
The question is this: would you spend 8k extra per data center to have no single point of failure top of rack equipment (hot fail over switching and firewalls). By way of reference, each rack of equipment has about 50k in servers in it.
The redundant setup is much more complex, has more failure modes, and is guaranteed to cost both in terms of initial purchase price, and also in the time spent maintaining and troubleshooting the more complicated setup.
Also, we have been running the (redundant) setup for over 8 years and have never once (knock on wood) had a failure of the top of rack equipment. We replace all our stuff on 3 year in service cycles.
The downside of the single point of failure model is that if we loose a switch or a firewall, the entire data center would go down. Also, a human must throw a switch to route the failed services to the other data center (for various reasons, there is no reliable way to do this automatically)
I suspect that, because it uses less hardware and is simpler the single point of failure option will result in greater up time in the real world. It has been my experience that hardware failures occur much less frequently than people failures do and switches/firewalls (with no spinning disk drives and secure OS's etc.) rarely fail...
What does the server fault community think?
-
No matter what you do, hardware will fail. People will make mistakes.
Without a shadow of a doubt I'd upgrade each rack to have multiple of everything.
You say that you have 50k of servers in each rack, but only a single switch connecting them to the outside world? Also a single router and single firewall I presume.
I'm not sure I could personally cope with that if I was the Sysadmin. I'd demand multiple diverse transit providers, a pair of edge routers in HA/HSRP mode, a pair of HA firewalls, at least 2 switches, and all servers having dual nics, get a different switch on each of the NICs.STP handles the failure of a switch or a port, that's automagic. Failure of a router is handled by the HA software on the pair. Ditto firewalls. Losing a datacentre, and switching the traffic between them, I'm assuming you use some form of GSLB device?
I completely grok your idea, but the problem is, say DC1 goes offline because of a major incident, this will take days or weeks to bring up again (fire, flood, act of $imaginary_deity).. then you have a router failure in DC2. This isn't a terribly impossible scenario. Your entire infrastructure is now unreachable from the internet, based on what you've told us.
Is this one of the acceptable failure modes? I surely wouldn't stand for this kind of outage, when it's so easily (not cheaply) avoidable.
I suppose if you do the risk assessment for this kind of outage, and factor in the lost business that your employer would suffer, if the cost of the upgrade is less than the loss of business for a week, then it's a good deal.
From Tom O'Connor
0 comments:
Post a Comment