I recently read a news article about Intermedia's service level agreement 'miss' that was linked to a performance issue on an EMC CLARiiON array
.
http://searchstorage.techtarget.com/news/article/0,289142,sid5_gci1510721,00.html
There have also been a couple of subsequent posts and email responses linked to this story;
http://itknowledgeexchange.techtarget.com/storage-soup/one-storage-pros-response-to-intermedias-hosted-email-outage/
http://chucksblog.emc.com/chucks_blog/2010/04/helping-to-avoid-a-really-bad-day.html
I wanted to make a few commetns myself in regards to the story and the responses shown above.
Firstly, I agree with everything Chuck Hollis at EMC says in his post, and I wanted to emphasis and elaborate on his points.
Products Fail?
Damn right they do, all the time...sometimes without causing much of a fuss, but trust me failures don't seem that common because you only hear about the big ones (like Intermedia's). It is a testament to IT hardware vendor's engineering that alot of these "failures" go unnoticed because fo the rigorous redundancy build into their systems...not to mention field support services which, in the case of EMC, are some of the best around.
A short anecdote that relates to this story; an insurance client of ours suffered a similar failure on their IBM N-Series (NetApp) devices a few years back. A controller panicked due to a power supply issue and tried to hand over its load to the other controller but due to incorrect configuration of multi-pathing, dropped all the workloads that it was serving. Result; reboots, reboots, reboots. Missed SLA.
Design for Failure
It will happen...not if, but when. You will have a component failure somewhere in your data path at some point in the future. Design for it (or insure for it!).
CLARiiON arrays (like N-Series, HDS and many other array vendors) have controllers that operate in active/active configuration, which is great when both controllers are working, and 99.99% of the time it works fine when one fails (the beauty of PowerPath). But the disadvantage of running and active/active architecture in a disk array is that, unless you religiously monitor your workloads, you can never be sure if you can meet performance demands in a degredated state (this principle applies all down the data path, even to RAID Group design and LUN layout). My favourite disk array of the last 10 years is EqualLogic's PS Series, now owned by Dell. These fellas only operate in active/passive mode to ensure customers don't accidentally find themselves in Intermedia's situation where peak load cannot be accommodated in degradated mode.
The Alarm is Ringing but Everyone's Asleep
This is an interesting point...vendors and integrators like ourselves put effort into engineering and deploying monitoring and alerting for systems in client sites. That's great but if the client doesn't put in place procedural steps that are triggered into action by these tools, all is for nought. There is no point in having a tight RPO and the ability to deliver a quick RTO unless you have the procedural surety to act when issues are identified. EMC's DialHome feature is a good example of removing this dependency but its simply not possible (nor do you want it to be possible) for all system or component failures. In short your recovery time is only as good the weakest trigger point and usually that trigger point is simply deciding to act on a error/mis-configuration event.
Practice Failure
Great tip here...I hear clients and prospects talk about their highly redundant environments and their sub-minute failover setups and ask have they tested it...usually the answer is no. Reminds me of people who love to talk about how much their house has gone up in value...inevitably when they actually want to sell they are a little disappointed. Proof is in the pudding. You must test your failure recovery procedures. VMware's SRM product is an excellent tool for doing this non-disruptively. Clients should regularly test failover of their Tier 1/2 applications to ensure that the 'best laid plans' are also the 'tried and true' method.