Wednesday 30 July 2008

Tech Tip du Jour: The Idiot's Guide to Disaster Recovery

Non-geeks may want to look away now.

Well, now that I've stopped laughing, I guess I'd better get on with more serious issues, like what a DR plan should really look like.

A lot depends on what your availability requirements are, how critical your data is, and mostly, sadly, what your budget looks like. For the sake of the argument, I'm going to assume that your data is critical. If nobody cares about the loss of data, cool! I've never met anyone in that fortunate position, though.

So, anyway, here's "The idiot's guide to disaster recovery."

Hardware

Hardware is definitely getting more reliable ... on average! However, it's a brave man who bets the farm on his hardware never failing.

There are things that you can do with hardware to ensure greater availability, such as RAID disks, server clusters, redundant hardware, etc.

It's still a very brave man who bets the farm on his hardware never failing!

Backups

If your budget is low and your data doesn't need to be available all the time, then a backup is a good, simple solution.

You do need to be aware that restoring a backup often is slower than taking the backup, and you need to be aware that restoring multiple levels of backups and restoring hours worth of logical logs are also going to be depressingly slow. For that reason, I always advocate taking only Level-0 (and not Level-1, Level-2) backups if you can get away with it. Secondly, unless you're running a data warehouse that you can rebuild from scratch, ALWAYS back up your logical logs. To a tape. Backing up logical logs to /dev/null is quick, but the restore takes forever!

Another option to consider if you are using a SAN and have a massive amount of data is to take "business continuity volume" backups: you have your disk mirrored within the SAN, pause Informix, break the mirror and let Informix continue. You then take a backup of the off-line, broken mirror copy. When you restore the mirroring, the SAN "catches up" the mirror to the online disks behind the scenes and all is well.

Update: Generally you wouldn’t resync until immediately before you wanted to split again. One reason is that for some advanced SANs (and BCV implies EMC Symmetrix, to which this does apply), the syncing software maintains an list of the pages that have changed between the two copies since the split. This potentially allows for exceptionally fast restore: suppose you had a 1Tb database but a delta of page changes of only 10g a day, you could expect to do an external restore in a few seconds.

Hat tip to Neil Truby of Ardenta for the technology-specific advice.

HDR

If you can spare the extra cash for a High-availability Data Replication secondary, you are onto a winner: you can still take backups to keep things safe and sound in the worst case (and I will assume that you will always take backups!) but you can also keep your data immediately available. It's very easy to set it up, if a little pernickety about how exactly the primary and the secondary have to match up. HDR works by taking an exact copy of what is happening on the primary and applying it continuously to the secondary via the logical logs. The secondary is also available for read-only work like reporting, but you have to remember that it's also doing all the work that the primary is doing, alongside whatever else you're putting on it.

With 11.5 and the connection manager, you can hide failover pretty much completely from your users as well, so they would probably never notice the database going down. (Developers still have to code for a potential failure, though! The connection manager will not re-submit any failed SQL to the "new" server.)

RSS

One of the most common observations I've heard about HDR is "we'd really like to have more than one secondary!" Well, with IDS 11 and Remote Standalone Secondaries, you can. Kind of. It's conceptually the same as an HDR secondary but there are differences: it isn't guaranteed to be synchronised at checkpoint time, it uses a different communication protocol and an RSS cannot be promoted directly to a primary. It can, however, be promoted to HDR secondary. And you can have as many RSS nodes as you like!

RSS is effectively another level of backup, it doesn't do much for availability, but does offer the ability to do a very far off site backup or replicate the database instance out over a WAN, where HDR might suffer because of network latency.

SDS

Shared Disk Secondary technology seemed a bit odd in IDS 11.1, after all, what really is the point of being able to point multiple servers at a shared disk? You'd have to be running very slow servers and/or insanely fast disk to justify it.

However, even with IDS 11.1, there is a fantastic upside to SDS: failing over a primary to an SDS node is very quick and painless.

In IDS 11.5, Redirected Writes make SDS a much more compelling option. You can make a clustered server more or less instantly. Setting up a basic 4-note SDS cluster can easily be done in a day, starting from the IDS install point!

SDS gives you a fantastic availability (and performance) option, but does nothing for recovery if you lose the disk.

ER

Enterprise Replication is the ultimate in flexible replication for increased availability and performance. It is entirely flexible in how you choose to deploy it, you can replicate everything in a multi-node update-anywhere model, replicate some columns of some rows of some tables, disseminate out, consolidate in, forests and trees: "the possibilities are endless."

However, the more flexible the topology you choose, the more work you have to put into designing it. Even a simple two-node update-anywhere topology is, relatively speaking, a lot more effort to configure than the comparable "HDR with Redirected Writes" setup. However, ER has one crucial advantage over HDR, RSS and SDS: it allows rolling upgrades, because the versions of Informix on different nodes do not have to be absolutely identical.

Your DR plan

So, what should your DR plan look like?

  • Well, you should ALWAYS take backups.
  • RAID1+0 is the next step.
  • If you can afford it, go for HDR.
  • If you can afford even more, go for RSS.
  • If availability is critical, go for SDS.
  • If you need rolling upgrades and 24x7 uptime, you're going to need an ER update-anywhere stack in there somewhere.

In a truly resilient environment, you'd probably have a number of ER update-anywhere nodes, possibly quite widely geographically dispersed. Each one of those nodes would be an SDS cluster pointing at a high-speed SAN and also an HDR cluster, with the HDR secondary pointed at a different SAN or RAID. Each of the ER nodes would also have at least one RSS server backing it up in yet another distant "bunker". And a lot of money.

1 comment:

Unknown said...

One thing to remember. No matter how much hardware you throw at a DR scenario, if the engine itself throws a wobbly you can be left trying to explain hours of down-time to management, and why all the money they have just spent couldn't help !! :-((