In the mid-2000’s, our organization started to get serious about disaster recovery. By that time our core application was an e-learning application that was heavily used (a hundred thousand students on a typical day). That app became critical to our mission.
To bootstrap a DR capability we paid consultants for what was at best a craptastic DR plan. The plan was not implementable under any realistic scenario.
The consultants ignored our total lack of a DR site, insisted that we could buy servers overnight, and that because every server had its own tape drive, we could hire an army of techs from Geek Squad and recover all servers simultaneously from individual tape backups. Of course we had no failover site, no hardware, and we had tape-changers and a Legato infrastructure that streamed and interleaved multiple backups onto a single tape instead of individual tape drives in each server. I couldn’t imagine buying dozens of servers and successfully recovering in any reasonable time frame. The consultants formally presented a 56 hour RTO to our Leadership, when my own gantt charts showed a 3-week RTO after we had a DR site leased, a data center network built, and hardware purchased and racked. So I pushed back hard – and stopped getting invited to the meetings.
They used nice fonts though. Give them credit for that.
After seeing where consultants were taking us, I pushed our organization toward full hardware and application redundancy and full data center failover capability for all data center hosted systems. My goal was to have two fully functional data centers, identically configured, identical hardware, full redundancy at the failover site, and near real-time data replication between them, all matched to realistic and achievable RPO and RTO. My rational was that an organization as small and under-resourced as ours would not be able to built, maintain, and routinely test a disaster recovery site that was not already built, running, and replicated; and that the failover hardware would be usable as pre-production, staging or some other purpose.
The way we accomplished this was to tackle the longest lead-time constraints first, starting with space. We learned that our partners at the State of Minnesota has several thousand square feet of data center space sitting empty – as they had just consolidated down to smaller mainframes. I offered to lease that space, and then worked with their electricians to preposition the correct power under the floor, having them build out PDUs and pigtails for the servers and storage that we’d parachute in if we had a disaster. That took care of the longest lead time items – space and power. We then built out a data center network – stubbed out at first, but eventually fully configured and routed to the backbone.
We then invested heavily in failover hardware.
The ‘full failover’ strategy meant that if a back end database required ‘N’ CPU’s in production, we had to purchase and maintain ‘2N’ in each of primary and secondary data centers, and in most cases a fraction of N in one or more QA and development instances. The QA and development instances were configured behind a fully redundant network stack that was used by the network team to QA network, firewall and load balancer technologies.
As we cycled through normal hardware replacement and rotation, we first filled out the failover data center with one-generation old hardware, figuring that half a loaf was better than none. Later we started buying and configuring identical hardware in both data centers – ideally upgrading failover first, so we never were in a spot where failover was behind production.
Hardware vendors loved us.
I felt strongly that if we didn’t use the failover environment regularly, it would fall behind production and become unusable for failover – primarily because of configuration rot. This meant that wherever possible we needed to automate the configuration of devices and systems. It simply is not possible to ensure that two systems are identical in any case where they are manually configured. I.E. – you must have Structured System Management – scripts, not clicks. For UNIX systems this was fairly straight forward. For Windows, the options were few and painful. On the Windows systems we had far more clicks than scripts.
We achieved usable DR capability for our primary e-learning application in 2006, full capability for that application in about 2009, and for our ERP some years later. The team that ran the e-learning environment conducted a run-from-failover exercise annually, so we were assured that we could meet our published RPO and RTO for that application and its supporting technology.
Selling Disaster Recovery is hard. Most teams did not buy into the ‘Failover is a first class citizen’ mantra that I’d been preaching. For example, even though we had identical failover hardware for the ERP, the ERP team did not maintain failover in a fully configured state – often not even acknowledging the existence of the failover servers – and hence was not capable of conducting a failover within a reasonable RTO.
We did however – after a 6-month reconfiguration and testing effort – fail the ERP over to a new data center, so we knew it was possible. That effort required a reverse-engineering of an app (that we had written ourselves) sufficiently so we understood exactly how it was configured. We were then able to re-configure both production and failover identically, and successfully fail over the application. The team that ran the app didn’t think it could be done. My team proved them wrong.