DavsDisorder

This blog captures some of the observations of Tim Davoren, Data Engines' founder and Managing Consultant. Do not expect an especially coherent delivery here!

Cloud does not equal better BC/DR

Tim Davoren - Friday, October 08, 2010
I refer here to a recent article penned by Tony Pearson of IBM discussing a recent catastrophic failure of an EMC Symmetrix within the State of Virginia's IT environment. Aside from some cheap inter-vendor point scoring, Tony mentions as one of the 'lessons' from this event; "Lesson 4: This can serve as a wake-up call to consider Cloud Computing as an alternative option". There is a faint tinge of irony here in that this post of Tony's was written in Australia in early September. At the end of that month the IT community (and the broader travelling public) witnessed how a 'cloud' provider can be just as exposed to downtime as your now unfashionable internal IT team. Virgin Blue's ticketing and reservation systems were brought offline due to an as yet unidentified systems failure within the storage data path. Virgin Blue do not own/operate their own ticketing and reservation system, but source such a service from Navitaire (disclosure: Navitaire are an old client of my firm). It took Virgin Blue (and I assume Navitaire) almost 7 days to return this service to normal operation. I wont call these observations 'lessons', rather just 'comments;

  1. I agree with Tony (and probably every other seasoned data storage professional)...storage systems fail, that's why we have backup systems. These systems in turn are definitely only as good as their last successful test...if regularly testing is too much of a burden then you ought to at the very least audit the environment according to some baseline.
  2. Whilst the person in question at the State of Virginia may be a little ashen faced currently, I can assure you that the "service delivery manager" (as they were called in the hey day of outsourcing), at Virgin Blue for the reservation and ticketing system will be feeling the same churning in the lower part of the stomach...his contact at Navitaire probably likewise. Just because a cloud provider is 'big' or ' branded' or, (inserted alarm bells), 'multi-tenanted', does not mean for one second that they can do better/cheaper job of helping you meet your SLAs for service uptime and/or data recovery.
I advise strong governance of how storage systems are used in medium - large organisations, as well as a strict focus on 'recoverability' in governance of backup systems.

In the former instance remember that 'speeds and feeds' as Tony puts it indicate in my experience the 'bleeding edge' of what a product can reliably do...divide by 2 and set that as your peak load. The more complex the data storage layout on a disk array (fragmented RAID  groups, meta-LUNs, concatenated LUNs, etc, etc), the longer your restore/rebuild will be. Remember that in the never ending race toward better storage performance, there is a necessary compromise around recoverability.

In the latter instance, just 'think' in terms of recovery, not 'did it get backed up'...build backup systems that focus on the process of data restoration (we talk only of data availability here, compute availability is a whole different story). It is far better to have a backup run for 8-10 hours, complete, validate and be easily restorable than a backup that runs in half that time but require multi-step, error prone recovery procedures.

What all prospective SaaS buyers should never see

Tim Davoren - Wednesday, June 30, 2010
Whilst doing some research around options for moving our internal mail and file serving/collaboration requirements into a 'cloud' provider, I came across the details of the Telstra T-Suite offering, part of which is the Microsoft BPOS suite. As we are thinking of consolidating telecoms with Telstra also, I thought I'd test how their "implementation" (assuming they are actually hosting it in Australia somewhere) of the suite responded (browser refreshes etc). As a Microsoft partner we can use their BPOS but I am guessing it is hosted in an US or other remote DC so I thought I would try Telstra; unfortunately, whilst trying to secure a trial I got the follow screen:

 
Not an encouraging way to greet prospective SaaS buyers!!!

8 Steps to Effective DR Planning and Budget Requisition

Tim Davoren - Sunday, June 20, 2010
Similar to a post from a few weeks back, I found this little summary amongst some old client correspondence which you may find useful.

8 Steps to Effective DR Planning and Budget Requisition

  1. Use the term Disruption Recovery rather than Disaster Recovery.
  2. Ensure you currently have some kind of DR/BCP management framework no matter how rudimentary. Show business management that there are current documented processes, metrics and testing that can be optimised. Technology supports the business insurance requirements, it doesn’t necessitate it. You need to show that DR planning is an ongoing process not a point in time flight of fancy.
  3. Engage the right people in your organisation. Applications support and owners, facilities management, and of course, when ready, financial and executive management.
  4. Conduct a joint Business Impact Analysis or Risk Assessment. What are we insuring/protecting against...delineate the actual dangers. Threats = Impacts = $$.
  5. Then proceed to a ‘costs of downtime’ calculation. Dependency mapping of all business applications is the starting point for this. This is never a clean cut equation but, the revenue that a particular application or ecosystem of applications support is the dividend, downtime impact is the divisor and roughly speaking then the cost of downtime is the quotient (which should align roughly with a budget!)
  6. Position a DR investment as having some competitive market value – ROI. Present ‘best-in-class’, industry peer data. – Reputational loss, client retention may be affected.
  7. Develop a DR services catalogue...align costs to system criticality (RPO/RTO). Relative costs vs. Absolute costs wherever possible. Suggest chargeback to Bus etc.
  8. Align DR investment with other IT budgets. Try and link the technologies used in providing DR services to other areas of IT ‘spend’. EG Data centre consolidation, server consolidation (virtualisation), and utilising DR infrastructure for development or test purposes.

Designing for Failure

Tim Davoren - Friday, April 30, 2010
I recently read a news article about Intermedia's service level agreement 'miss' that was linked to a performance issue on an EMC CLARiiON array.

http://searchstorage.techtarget.com/news/article/0,289142,sid5_gci1510721,00.html


There have also been a couple of subsequent posts and email responses linked to this story;

http://itknowledgeexchange.techtarget.com/storage-soup/one-storage-pros-response-to-intermedias-hosted-email-outage/

http://chucksblog.emc.com/chucks_blog/2010/04/helping-to-avoid-a-really-bad-day.html


I wanted to make a few commetns myself in regards to the story and the responses shown above.

Firstly, I agree with everything Chuck Hollis at EMC says in his post, and I wanted to emphasis and elaborate on his points.

Products Fail?
Damn right they do, all the time...sometimes without causing much of a fuss, but trust me failures don't seem that common because you only hear about the big ones (like Intermedia's). It is a testament to IT hardware vendor's engineering that alot of these "failures" go unnoticed because fo the rigorous redundancy build into their systems...not to mention field support services which, in the case of EMC, are some of the best around.

A short anecdote that relates to this story; an insurance client of ours suffered a similar failure on their IBM N-Series (NetApp) devices a few years back. A controller panicked due to a power supply issue and tried to hand over its load to the other controller but due to incorrect configuration of multi-pathing, dropped all the workloads that it was serving. Result; reboots, reboots, reboots. Missed SLA.

Design for Failure
It will happen...not if, but when. You will have a component failure somewhere in your data path at some point in the future. Design for it (or insure for it!).

CLARiiON arrays (like N-Series, HDS and many other array vendors) have controllers that operate in active/active configuration, which is great when both controllers are working, and 99.99% of the time it works fine when one fails (the beauty of PowerPath). But the disadvantage of running and active/active architecture in a disk array is that, unless you religiously monitor your workloads, you can never be sure if you can meet performance demands in a degredated state (this principle applies all down the data path, even to RAID Group design and LUN layout). My favourite disk array of the last 10 years is EqualLogic's PS Series, now owned by Dell. These fellas only operate in active/passive mode to ensure customers don't accidentally find themselves in Intermedia's situation where peak load cannot be accommodated in degradated mode.

The Alarm is Ringing but Everyone's Asleep
This is an interesting point...vendors and integrators like ourselves put effort into engineering and deploying monitoring and alerting for systems in client sites. That's great but if the client doesn't put in place procedural steps that are triggered into action by these tools, all is for nought. There is no point in having a tight RPO and the ability to deliver a quick RTO unless you have the procedural surety  to act when issues are identified. EMC's DialHome feature is a good example of removing this dependency but its simply not possible (nor do you want it to be possible) for all system or component failures. In short your recovery time is only as good the weakest trigger point and usually that trigger point is simply deciding to act on a error/mis-configuration event.

Practice Failure
Great tip here...I hear clients and prospects talk about their highly redundant environments and their sub-minute failover setups and ask have they tested it...usually the answer is no. Reminds me of people who love to talk about how much their house has gone up in value...inevitably when they actually want to sell they are a little disappointed. Proof is in the pudding. You must test your failure recovery procedures. VMware's SRM product is an excellent tool for doing this non-disruptively. Clients should regularly test failover of their Tier 1/2 applications to ensure that the 'best laid plans' are also the 'tried and true' method.



Search the Data Engines Site

Featured Content

Backup or Archive? An age old question - after almost 60 years of data storage and backup on electro-magnetic media, people are still confused as to what a "Backup" is and what an "Archive" is. See Tim's blog post explaining the difference. 

Do you "Splunk" ?? It's not a rude question, but it could lead you to some empowering insights into what's happening out there in your multi-vendor, multi-faceted IT infrastructure.

Data Engines have developed a set of field tested, vendor backed data-at-rest encryption solutions that can help organisations mitigate data security risks for removable storage media like tape. Ask us how to ensure your primary data storage or backup data is safely encrypted, but most importantly, how you can insure full recovery in the future.