Anatomy of Disaster Recovery
Jersey City, November, 2001. It could have been yesterday. I was working for Credit Suisse. “You’re a DBA. A database system is down. Go fix it.” said my boss. On site, I replaced the failed drives – there were two – reconfigured the systems, reloaded from backups, tested and verified functionality, documented my findings, and headed home.
The immediate cause of the problem was the failure of two (2) drives in a “RAID” system. RAID, “Redundant Array of Inexpensive Disk,” provides fault tolerance via redundancy. Looking at it simplistically, RAID 5, essentially uses one drive out of three (3) or more to store information about the data stored on the other drives. RAID 5 system can tolerate the failure of any one drive. When the second drive fails you have to restore from backup. (This is a very simplistic description. A more detailed description can be found at “The Geek Stuff”, here: https://www.thegeekstuff.com/2010/08/raid-levels-tutorial.
It is as if you’re driving down the road and you get a flat tire. You mount the spare, and then you continue on your way. If you get a second flat, then you have to find a new tire.
For reasons that were never explained to me, this was an unmonitored production system. The first hard drive- the system kept going. They noticed the second; the system crashed to a halt.
I had been involved in Disaster Recovery planning since March, 1996, when I joined the Professional Services team at CommVault Systems. My focus, and CommVault’s, was backup and recovery, specifically automated and verifiable backups. And disaster recovery, of course, begins with good backups.
A few months later a law firm had the same problem as the financial company. After the Blackout of August, 2002, two (2) hard drives failed at essentially the same time. A local system administrator installed new drives, reloaded Windows and MS SQL Server, and called me to restore the database from backups.
Lightning struck again, figuratively speaking, in September, 2010, when two (2) old hard drives failed in rapid succession. Both probably failed of old age. Our hypothesis is that the second probably failed due to all the disk activity required to rebuild the RAID array. To return to the car tire analogy, suppose you’re driving a bumpy road, on old bald tires. If you get one flat, there’s a high probability of a second flat. The key lessons learned are to replace all drives in a RAID array long before they are likely to fail and maintain good backups.
The process was the same:
- Identify the failed hard drives. This is easy – read the logs or look for blinking red lights.
- Install good hard drives.
- Reload the operating system.
- Reload the applications.
- Restore the data from backup.
- Test, and verify, and figure out what needs to be done differently moving forward to prevent events like this from happening again
Fortunately, we had spare parts on hand. Finding, buying and shipping spare parts repairs would add, at best, 24 hours to recovery times.
The financial company understood that it needed to monitor ALL production systems The law firm technology team knew it needed good backups and spare hardware but it was unable to convince management of the need to upgrade systems in a more timely manner.
How can virtualization (whether in your own data center, Microsoft’s Azure Cloud or Amazon Web Services) speed up disaster recovery and add additional layers of fault tolerance?
When Hurricane Sandy hit I was managing Information Technology for a law firm based in lower Manhattan. Our building was closed, without power, for one week. The Disaster Recovery Plan in place before Hurricane Sandy allowed us to emerge from the storm with essentially no damage to the information technology infrastructure.
However, information technology exists to facilitate the work that brings in revenue. The firm’s data center had been unavailable for an entire week when the building was without power and therefore the attorneys and staff could not work. Even attorneys and staff in satellite offices or who had power at home and who worked from home when they needed to could not access email and various other systems; they could not work.
As described in the table below, the firm lost an estimated $1.3 Million. In addition, the firm lost an unknown amount of new business because phone calls and voice mails went unanswered.
COSTS OF HURRICANE SANDY TO ONE LAW FIRM
|Avg. Hours Rate||$400|
|Avg. Daily Billable Hours||8.75|
|Avg. Daily Billings||$280,000|
|Days the Firm was Closed||5|
|Lost New Business||UNKNOWN|
This loss could have been minimized, if not avoided entirely. Beginning with the disaster of 2010, Morris Djavaheri and I had proposed a series of disaster recovery plans built around virtualization. These started at $50,000 with annual operations costs of $5,000. This was rejected by management for being expensive and unnecessary.
The law firm had good telecommunication links between its NYC and Long Island offices. All we needed to add was a $35,000 “Virtualization Host” in the firm’s Long Island offices, add $10,000 for software and $5,000 for services, and configure virtual copies of the NYC based database servers.
In the event that the primary data center became unavailable – after Hurricane Sandy – we would have been able to shift operations to the backup data center.
The firm lost an estimated $1.4 million in one week because it wouldn’t spend $50,000.
Virtualization Would Have Enabled Recovery in Minutes, not Hours or Days
The system we presented to the law firm required that the firm purchase hardware for it’s remote disaster recovery site. Today Microsoft’s Azure* (azure.microsoft.com) and Amazon Web Services* (aws.amazon.com) provide fault tolerance and disaster recovery without the need for customers to own and maintain computers.
Virtualization in and of itself, allows us to focus on the application or the information system, not the server. Thus we now use the term “Serverless Computing.” We can think about email, but don’t need to think about email servers, accounting, document management, or medical imaging systems, not accounting software, document management software, medical images and imaging software on database servers and associated file servers.
Effective use of virtualization on reliable and monitored hardware would have prevented the hardware related incidents of 2001, 2002, and 2010 and facilitated recovery in the other incidents.
Azure Site Recovery and Real Time Backups allow rapid recovery to points in time, minimizing the time that voice mail or email were unavailable in the incidents of 2009 and 2011 and facilitating recovery in the ransomware incidents of 2016 and 2017, and minimized the loss of data in those incidents. Azure Availability Zones ensure that systems in one location can be replicated to alternate locations. This would not have prevented the terrorist attack of Sept. 11 or the force majeure of Hurricane Sandy in 2012 and the subsequent loss of access to various buildings in lower Manhattan. However, Azure Availability Zones would allow continued access to information systems in the face of these kinds of events.
Links for Azure, AWS, & ana’s cloud
|Amazon Web Services||https://aws.amazon.com|
|Serverless on Azure||https://azure.microsoft.com/en-us/overview/serverless-computing/|
|Serverless on AWS||https://aws.amazon.com/lambda/|
|Azure Site Recovery||https://azure.microsoft.com/en-us/services/site-recovery/|
|Azure Availability Zones|| https://docs.microsoft.com/en-us/azure/availability-zones/az-overview|
Microsoft Azure, Azure Availability Zones, Azure Site Recovery and other terms are trademarks of Microsoft. AWS is a trademark of Amazon. CommVault is a trademark of CommVault Systems, inc.