Sunday 27 July 2008

Tech Tip du Jour: Disaster Recovery

Blogging has been light this weekend, as I spent some time with an old (very old, in fact) crony of mine. We swapped a number of war stories, and I had to pass on a couple of amusing ones:

Can you patch us back up? We're in a bit of a hurry!

I just had to laugh. It was that, or cry.

Customer: "Hi, we've had a crash, we're restoring the backup but it's taking too long because we have level 0, level 1, level 2 and eight days of logical logs to restore. We can't wait for the logs any more, can you just patch the system 'up' for us?"
Long-suffering engineer: "Yes, we can, but you are aware that if we do this, your system could be inconsistent?"
C: "Yes, it's fine, where do I sign and can you do it immediately, please?"
LSE: "OK."

Everyone's happy, right? Well, kind of. Six weeks later:

C: "Hi, sorry to bother you, but our data seems to be inconsistent, and we can't figure out why. Is it possible that you could fix this for us?"
LSE: " ... "

We don't have enough disk space, so let's just expire the storage pools quickly.

Customer, proudly walking me through their (not so) massive storage manager setup: "We had a real problem with getting enough space on the Legato servers, but we managed to work around it by expiring the logical logs storage pools after 20 minutes."
Me, edging for the door in a suitably restrained fashion: "And how long did you say it took to restore your level 0? Two hours?"

The moral of the above stories is: don't assume that because you're taking a backup, everything in DR-land is cool.

  1. Test your recovery process before you need it in anger.
  2. Get your DR plan vetted by a disinterested expert.
  3. Assume the worst and don't just depend on backups for getting out of a bad situation -- look at HDR and / or RSS as well.

No comments: