I recently read a news article about Intermedia's service level agreement 'miss' that was linked to a performance issue on an EMC CLARiiON array.
http://searchstorage.techtarget.com/news/article/0,289142,sid5_gci1510721,00.html
There have also been a couple of subsequent posts and email responses linked to this story;
http://itknowledgeexchange.techtarget.com/storage-soup/one-storage-pros-response-to-intermedias-hosted-email-outage/
http://chucksblog.emc.com/chucks_blog/2010/04/helping-to-avoid-a-really-bad-day.html
I wanted to make a few commetns myself in regards to the story and the responses shown above.
Firstly, I agree with everything Chuck Hollis at EMC says in his post, and I wanted to emphasis and elaborate on his points.
Products Fail?
Damn right they do, all the time...sometimes without causing much of a fuss, but trust me failures don't seem that common because you only hear about the big ones (like Intermedia's). It is a testament to IT hardware vendor's engineering that alot of these "failures" go unnoticed because fo the rigorous redundancy build into their systems...not to mention field support services which, in the case of EMC, are some of the best around.
A short anecdote that relates to this story; an insurance client of ours suffered a similar failure on their IBM N-Series (NetApp) devices a few years back. A controller panicked due to a power supply issue and tried to hand over its load to the other controller but due to incorrect configuration of multi-pathing, dropped all the workloads that it was serving. Result; reboots, reboots, reboots. Missed SLA.
Design for Failure
It will happen...not if, but when. You will have a component failure somewhere in your data path at some point in the future. Design for it (or insure for it!).
CLARiiON arrays (like N-Series, HDS and many other array vendors) have controllers that operate in active/active configuration, which is great when both controllers are working, and 99.99% of the time it works fine when one fails (the beauty of PowerPath). But the disadvantage of running and active/active architecture in a disk array is that, unless you religiously monitor your workloads, you can never be sure if you can meet performance demands in a degredated state (this principle applies all down the data path, even to RAID Group design and LUN layout). My favourite disk array of the last 10 years is EqualLogic's PS Series, now owned by Dell. These fellas only operate in active/passive mode to ensure customers don't accidentally find themselves in Intermedia's situation where peak load cannot be accommodated in degradated mode.
The Alarm is Ringing but Everyone's Asleep
This is an interesting point...vendors and integrators like ourselves put effort into engineering and deploying monitoring and alerting for systems in client sites. That's great but if the client doesn't put in place procedural steps that are triggered into action by these tools, all is for nought. There is no point in having a tight RPO and the ability to deliver a quick RTO unless you have the procedural surety to act when issues are identified. EMC's DialHome feature is a good example of removing this dependency but its simply not possible (nor do you want it to be possible) for all system or component failures. In short your recovery time is only as good the weakest trigger point and usually that trigger point is simply deciding to act on a error/mis-configuration event.
Practice Failure
Great tip here...I hear clients and prospects talk about their highly redundant environments and their sub-minute failover setups and ask have they tested it...usually the answer is no. Reminds me of people who love to talk about how much their house has gone up in value...inevitably when they actually want to sell they are a little disappointed. Proof is in the pudding. You must test your failure recovery procedures. VMware's SRM product is an excellent tool for doing this non-disruptively. Clients should regularly test failover of their Tier 1/2 applications to ensure that the 'best laid plans' are also the 'tried and true' method.
DavsDisorder
This blog captures some of the observations of Tim Davoren, Data Engines' founder and Managing Consultant. Do not expect an especially coherent delivery here!
RSS Feed
SubscribeRecent Posts
- Quick listing of backup software players
- Social media account hacking and the need for two factor authentication services
- Please let us know what you think of our new logo
- engines in the data center
- They don't do that do they??
- Cloud does not equal better BC/DR
- Political Malaise
- Storage development in perspective
- Response to musings over NetApp's future
- What all prospective SaaS buyers should never see
Tags
M&A security EMC NetApp generic musings VMware tidbits 'Cloud'...cringe featured music lists internal goings on sales vendor shenanigans IT Management backup government archive HDS humour budgets storage redundancy frustrations test web applications
- archive (2)
- backup (5)
- budgets (1)
- 'Cloud'...cringe (1)
- EMC (6)
- featured (5)
- frustrations (5)
- generic musings (9)
- government (3)
- HDS (1)
- humour (2)
- internal goings on (1)
- IT Management (9)
- lists (1)
- M&A (2)
- music (1)
- NetApp (1)
- redundancy (4)
- sales (1)
- security (1)
- storage (4)
- test (1)
- tidbits (1)
- vendor shenanigans (1)
- VMware (3)
- web applications (1)
Archive
Search the Data Engines Site
Featured Content
Backup or Archive? An age old question - after almost 60 years of data
storage and backup on electro-magnetic media, people are still confused
as to what a "Backup" is and what an "Archive" is. See Tim's blog post explaining the difference.
Do you "Splunk" ?? It's not a rude question, but it could lead you to some empowering insights into what's happening out there in your multi-vendor, multi-faceted IT infrastructure.
Data Engines have developed a set of field tested, vendor backed data-at-rest encryption solutions that can help organisations mitigate data security risks for removable storage media like tape. Ask us how to ensure your primary data storage or backup data is safely encrypted, but most importantly, how you can insure full recovery in the future.

Comments
Post has no comments.