By Planning Ahead, Businesses Can Survive Disasters

In 2010, we presented a plan to a law firm to expand its Backup and Disaster Recovery system (DR) into a Business Continuity system (BC). Disaster Recovery enables a business to resume operations after a disaster concludes; Business Continuity allows the business to operate during and after the disaster.

The DR system was built around preserving data by backing up to tape and restoring data to repaired or replaced computers. The risk is the time that might be needed to obtain replacement parts or replacement computers. The BC system would copy production data to a second location daily. In an emergency, the second location would become primary. To keep costs down, one offsite computer would contain the information processing capability of the 12 computers at the firm’s headquarters. 

At the time, we budgeted $80,000 to implement and $6,400 per year to test and maintain the system. Unfortunately, the managing partner declined to fund the project. Two years later, when Hurricane Sandy hit, the firm was closed for one week. Had he funded the project, the attorneys who lived in areas with power and internet access would have been able to work remotely. But because the firm only had a Disaster Recovery system and not a Business Continuity plan, they were unable to work until power came back to their lower Manhattan offices. We estimate that the firm lost $390,000 to $1.1 million in revenue in the disaster. 

As illustrated below, the costs of the system would have been negligible compared to the billings if some or all attorneys had been able to work remotely. 

A year and a half later, in June of 2014, the firm closed its doors. The details are below, in Risk Mis-Management at a Law Firm.
We proposed a similar plan to an auto parts company with offices in New York and Connecticut. We implemented a DR & BC plan that replicates critical systems and data to the Microsoft Azure Cloud.  The details are below in A Home Run in Auto Parts.

Risk Mis-Management at a Law Firm 

The original Backup and DR plan at the law firm was implemented in 2006. The plan was tested in 2009, when we had to re-establish email after an incident which corrupted the email system. It took three (3) days to recover including one (1) day lost to obtaining a computer from a facility on Long Island. This illustrates that the best-case scenario for obtaining replacement servers, hard drives, or other parts is 24 hours; meaning a critical system would be unavailable for an additional 24 hours. 

After the e-mail incident, we proposed using offsite backup technology to replicate the production systems, located in lower Manhattan, to the firm’s satellite office in Mineola, NY. The key benefit would be the ability to immediately transfer Information Technology operations to Mineola, if a disaster hit New York City. People would not have to go to Mineola to work. They would be able to work from home, while accessing email, the accounting system, and other critical data.  Unfortunately, the firm’s managing partner did not see the need for this backup safety system. 

When Hurricane Sandy hit in October, 2012, the building housing the firm’s main offices and data center lost power for a week. All 65 attorneys, including those who worked from home or in remote offices, were unable to work. 

As is detailed in Table 1, below, if all 65 attorneys had been able to work from home, the firm could have billed $1.056 million during that week, over 10 times the cost of the BC system. 

Cost of Business Continuity versus the Cost of Disaster
– All Attorneys Working Remotely
BC with DRDisaster
Upfront (2010)$80,000Hourly Rate$325
Testing in 2011$6,400Hours50
Testing in 2012$6,400Attorneys65
Total Cost$92,800Total$1,056,250
Table 1: Cost of BC v Cost of Disaster, All Attorneys Working Remotely

Even if only the 24 attorneys who routinely worked from Mineola and other remote offices had been able to work, the firm could have billed $390,000 for the week; over four times the cost of the system.

Business Continuity versus Cost of Disaster
– With 24 Attorneys Working Remotely
BC with DRDisaster
Total Cost$92,800Total$390,000
Table 2: Cost of BC v Cost of Disaster, Some Attorneys Working Remotely

While we don’t know the extent, if any, that this incident influenced their thinking, in June, 2014, a year and a half after the incident, the managing partners sold the firm and closed its doors. Our final project for the firm was migrating the firm’s servers to “The Cloud” in order to allow the equity partners and their accountants to close the books properly.

Virtualization

Computer Virtualization installs the programs, storage, and processing capabilities of several computers in one powerful computer. This one computer is referred to as the “Virtualization Host” or “Host.” The individual computers in the “Host” are referred to as “Virtual Computers,” “Virtual Machines,” or “VMs.” All of the VMs in a single Host can run at the same time. Each VM has it’s own processor, memory and disk resources, and these resources of one VM are not accessible from the other VMs. Common virtualization software environments include VMware and Microsoft Hyper-V.

Without virtualization, when designing Business Continuity systems we need one spare computer for every business-critical computer. We would need one accounting server for every accounting server, one email server for every email server, and so on. With 12, 15, or 100 servers in our production environment we would need 12, 15, or 100 servers in our DR facility. In addition, we need duplicate copies of all software. We also need to duplicate our networking, power and air conditioning systems to support this environment. Then we need to copy, or replicate, the data from the production environment to the BC environment, and copy all changes on a daily basis. 

In addition to paying the full price of these systems, we only plan on using them three or four days per year; when we test the Business Continuity Plan. And, of course, we need to perform other standard maintenance. As with other production systems, the best practice is to replace the systems every four (4) years. 

With Virtualization, we could replicate the data processing capabilities of five or 10 individual computers in one Virtualization host. However, with virtualization in a company’s offices, we still need to invest in local, on-premises, computers, software, power, air conditioning and maintenance. This may be why the Managing Partner at the law firm decided not to spend the money on the BC system. We presented it as an investment in a fail-safe plan to manage risk; he may have seen it as wasting money in a system that he would never use. 

The 2010 Plan

The 2010 plan was to duplicate the systems and data located in the data center in lower Manhattan, in a “Virtualization Host” located in the firm’s offices in Mineola, NY. The systems in the data center are illustrated in Figure 1, below. The “Virtualization Host,” is illustrated in Fig 2.

server rack
Fig 1: Typical Server RackFig 2: Typical Virtualization Host

Enter the Cloud

Today, we can also replace actual, physical computers with virtual machines in “The Cloud.” These Virtual Machines can be used for backup storage, disaster recovery and business continuity, or daily operations. If we are building a so-called “Private Cloud” in our own facility, then we need to buy or lease the computers and storage systems. If we are using Microsoft Azure, Amazon Web Services, AWS, or another “Public Cloud,” using the model they call “Infrastructure as a Service,” or “IaaS,” then Microsoft, Amazon, or the other service provider buy the hardware, software, power, air conditioning and other costs. We effectively rent data storage space and processing power when we need it.

Cost Comparison of DR & BC Options 

ItemPremises Based*Cloud Based
Upfront Costs$80,000$10,000
Annual Testing$4,000$4,000
Annual Maintenance$2,400
Annual Data Storage$1,200
Annual Costs$6,400$5,200
Costs over 4 years$105,600$30,800
* Estimates from 2010.
Table 3: Long Term Cost of Local DR versus DR in Public Cloud

Both upfront costs and annual costs are significantly lower with the cloud based business continuity than they would be with premises based systems. In addition, the upfront costs for a premises based system must be incurred every four (4) years when hardware is replaced. The upfront costs for a cloud-based system are not repeated because there are no significant hardware systems that need to be replaced. The costs of Amazon Web Services, AWS, should be in line with the costs of Microsoft Azure. Please note that all environments are different. These estimates are based on the infrastructure that would support a small to mid-sized law firm or accounting firm.

Cost Comparison of DR & BC Options over Time

Item Premises Based*Cloud Based
Costs over first 4 years$105,600$30,800
Costs over next 4 years$105,600$25,600
Total over 8 years$211,200$56,400
Annual Cost$26,400$7,050
*Estimates from 2010
Table 4: Long Term Cost of Local DR versus DR in Public Cloud

A Home Run in Auto Parts

We also developed a Disaster Recovery & Business Continuity system for an auto body parts company. The company has a data center warehoused in the New York Metro area. Our goal was to create disaster recovery and business continuity plan that would allow the company to continue to operate even if its primary location was not accessible. 

Both locations are connected to the Internet using local cable providers and to each other using a Virtual Private Network. (VPNs use encryption to simulate a private line.) New York has a virtualization host running Microsoft Hyper-V with a local ISCSI storage array. This is illustrated in Schematic 1, below.

Schematic 1: Network Topology

The original plan was to replicate the systems in the New York data center to Connecticut.

Strike 1: Hyper-V Replication

The systems are virtualized using Microsoft Hyper-V software. We attempted Hyper-V Replication, but that failed. Because of insufficient bandwidth from the Internet Service Providers, ISPs, data storage needs outstripped replication capacity: data were being generated faster than could be transmitted; This was “Strike 1.”

Strike 2: ISCSI Replication

As is often the case, the database servers used local disk on Network Attached Storage systems, NAS systems. These are configured using ISCSI protocols. We tried replicating the data on these disk volumes across the networks using technology built into the NAS systems. This also failed, also due to network bandwidth. Strike 2.

Home Run: Azure Site Recovery Services, Azure SRS, or Azure.

Rather than replicate the New York site to Connecticut, we configured Microsoft Azure Site Recovery Services, Azure SRS, or Azure, to replicate the production data center from New York into the Microsoft Azure cloud. As illustrated in Schematic 2, below, New York connects to Azure via the Internet, and the data are protected via VPN and firewalls.

Schematic 2: Network Topology

We could have used Amazon Web Services, AWS, however, Microsoft’s Azure SRS is a better choice in this case, because the systems are virtualized using Microsoft’s Hyper-V software.

It took us less than 20 hours to configure Azure SRS and prove the technology with successful tests. Thus, a “Home Run.”

What’s Next?

We have tried to present the business case for disaster recovery and business continuity using The Cloud. The next posts will discuss moving infrastructure – computer servers and data storage systems – out of premises-based data centers and into The Cloud using Microsoft Azure. Amazon Web Services, IBM Cloud, and other Cloud service offerings.

The team at ana’s cloud is available for long term and short term projects relating to disaster recovery planning, cloud migration, software development and other information technology consulting.

Appendices

Appendix 1: Project Plan: Cloud Based Disaster Recovery and Business Continuity

  1. Identify the systems that must be protected and available during a disaster.
  2. Replicate those systems to the Cloud.
  3. Verify data integrity: the data must be complete and correct.
  4. Add Data to the primary site and verify that it replicates to the DR site on schedule.
  5. Test “Fail Over” from Primary to Replica.
  6. Test Return to the Primary site. This is called “Fail Back.”
  7. Document the Fail Over and Return processes.
  8. Test Fail Over and Return on a quarterly basis.

Appendix 2: Technical Notes regarding Azure SRS Replication Methods

Azure Site Recovery Services offers different replication methods for three (3) sources of data: Physical Servers, Hyper-V Clients, and VMware Clients.

We set up replication to Azure based on their storage type of each server. As is often the case, for better performance, the SQL and ERP database servers were configured to use direct drive access. Replication on these systems requires installation and configuration of Azure SRS agent software. They are configured as if they are standalone computers, or “physical” servers, rather than “virtual” servers. The other servers use Virtual Hard Disk, or VHD storage. Hyper-V and Azure SRS connect seamlessly without the need for additional software.

Roles, Storage Type & Replication Method

Server RoleStorage TypeReplication Method
Domain ControllerHyper-V with VHD StorageHyper-V Replication, Azure SRS
Database Server, MS SQLHyper-V with Direct Drive AccessPhysical Server, Azure SRS Agent
Web ServerHyper-V with VHD StorageHyper-V Replication, Azure SRS
File ServerHyper-V with Large VHD StorageHyper-V Replication, Azure SRS
Database Server, ERPHyper-V with Direct Drive AccessPhysical Server, Azure SRS Agent
Table 3: Server, Storage & Replication

Appendix 3: Technical Notes regarding Troubleshooting Replication

1. Start with the Domain Controllers, DC’s.

  • Always start the fail over test of the DC. Failover is impossible if the DC does not replicate correctly and does not boot with all correct roles. Common issues include:
    • SYSVOL is not created. (For more information click here for MS Support article 94722.)
    • The DC is not advertising its services. For more information click here for Spiceworks article 2129243, “AD Problems, DC Not Advertising.”

2. Sufficient local storage for successful replication?

“Sufficient local storage” is a function of the amount of changes during the replication interval. For example, when replication is configured as a nightly process, and there are 250 GB of data to transmit, you need at least 250 GB of space. You should also provide a margin for extra space as data transmission may vary.

3. Sufficient Bandwidth to carry the traffic?

There are three (3) ways to resolve issues arising from low bandwidth Internet service such as from cable modems or small business class Internet.

  • 1. Turn off verification during upload.
    • Verification can generate so much log data and network traffic that replication will fail.
  • 2. Force traffic to external connection that is less busy.
    • You may need to set up a second Internet connection for each site, and force certain kinds of traffic, e.g., email, over one connection and the replication over the other.
  • 3. Don’t start initial copy of all the servers simultaneously.
    • Wait until a server has completed its initial replication before starting replication for the next server. Schedule daily replication to start after backups have completed.

4. VMware or Linux?

MS Azure supports VMware, and virtual machines running the Linux operating system. VMware Admin credentials are required for successful installation and execution of the AZURE SRS software. These include the Domain Administrator and VMware Linux server login IDs and Passwords.

5. Hyper-V Virtual Machines using Direct Access to local Drives?

As noted above, for reasons having to do with performance, databases are often set up with direct access to local or ISCSI disk resources. Azure SRS can’t sync VM clients with local drives using Hyper-V. You must set it up as a local physical server with the Azure SRS agent.


Morris Djavaheri, CEO of Ana’s Cloud, has over 25 years of experience in the financial industry and legal community. He has planned and executed virtualization, Disaster Recovery, Business Continuity, and agile software development projects. Morris can be reached at MorrisD@anascloud.com 

Lawrence Furman, MBA, PMP, currently a Project Manager at the U. S. Dept. of Veterans Affairs, has over 25 experience in the financial industry and the public sector. He has worked on backup and disaster recovery and infrastructure projects since 1996.  Larry can be reached at LarryF@anascloud.com. 

Disaster Recovery Without Virtualization or the Cloud

Anatomy of Disaster Recovery

Jersey City, November, 2001. It could have been yesterday. I was working for Credit Suisse. “You’re a DBA. A database system is down. Go fix it.” said my boss. On site, I replaced the failed drives – there were two – reconfigured the systems, reloaded from backups, tested and verified functionality, documented my findings, and headed home.

If you fail to plan,

you are planning to fail – Ben Franklin

The immediate cause of the problem was the failure of two (2) drives in a “RAID” system. RAID, “Redundant Array of Inexpensive Disk,” provides fault tolerance via redundancy. Looking at it simplistically, RAID 5, essentially uses one drive out of three (3) or more to store information about the data stored on the other drives. RAID 5 system can tolerate the failure of any one drive. When the second drive fails you have to restore from backup. (This is a very simplistic description. A more detailed description can be found at “The Geek Stuff”, here: https://www.thegeekstuff.com/2010/08/raid-levels-tutorial.

five (5) drives in a RAID device
five (5) drives in a RAID device

It is as if you’re driving down the road and you get a flat tire. You mount the spare, and then you continue on your way. If you get a second flat, then you have to find a new tire.

For reasons that were never explained to me, this was an unmonitored production system. The first hard drive- the system kept going. They noticed the second; the system crashed to a halt.

I had been involved in Disaster Recovery planning since March, 1996, when I joined the Professional Services team at CommVault Systems. My focus, and CommVault’s, was backup and recovery, specifically automated and verifiable backups. And disaster recovery, of course, begins with good backups.

A few months later a law firm had the same problem as the financial company. After the Blackout of August, 2002, two (2) hard drives failed at essentially the same time. A local system administrator installed new drives, reloaded Windows and MS SQL Server, and called me to restore the database from backups.

Lightning struck again, figuratively speaking, in September, 2010, when two (2) old hard drives failed in rapid succession. Both probably failed of old age. Our hypothesis is that the second probably failed due to all the disk activity required to rebuild the RAID array. To return to the car tire analogy, suppose you’re driving a bumpy road, on old bald tires. If you get one flat, there’s a high probability of a second flat. The key lessons learned are to replace all drives in a RAID array long before they are likely to fail and maintain good backups.

The process was the same:

  1. Identify the failed hard drives. This is easy – read the logs or look for blinking red lights.
  2. Install good hard drives.
  3. Reload the operating system.
  4. Reload the applications.
  5. Restore the data from backup.
  6. Test, and verify, and figure out what needs to be done differently moving forward to prevent events like this from happening again

Fortunately, we had spare parts on hand. Finding, buying and shipping spare parts repairs would add, at best, 24 hours to recovery times.

LESSONS LEARNED

The financial company understood that it needed to monitor ALL production systems The law firm technology team knew it needed good backups and spare hardware but it was unable to convince management of the need to upgrade systems in a more timely manner.

How can virtualization (whether in your own data center, Microsoft’s Azure Cloud or Amazon Web Services) speed up disaster recovery and add additional layers of fault tolerance?

Water St, NYC, looking north to the Brooklyn Bridge

When Hurricane Sandy hit I was managing Information Technology for a law firm based in lower Manhattan. Our building was closed, without power, for one week. The Disaster Recovery Plan in place before Hurricane Sandy allowed us to emerge from the storm with essentially no damage to the information technology infrastructure.

However, information technology exists to facilitate the work that brings in revenue. The firm’s data center had been unavailable for an entire week when the building was without power and therefore the attorneys and staff could not work. Even attorneys and staff in satellite offices or who had power at home and who worked from home when they needed to could not access email and various other systems; they could not work.

As described in the table below, the firm lost an estimated $1.3 Million. In addition, the firm lost an unknown amount of new business because phone calls and voice mails went unanswered.

COSTS OF HURRICANE SANDY TO ONE LAW FIRM

Lawyers80
Avg. Hours Rate$400
Avg. Daily Billable Hours8.75
Avg. Daily Billings$280,000
Days the Firm was Closed 5
Lost Revenue $1,400,00
Lost New Business UNKNOWN

This loss could have been minimized, if not avoided entirely. Beginning with the disaster of 2010, Morris Djavaheri and I had proposed a series of disaster recovery plans built around virtualization. These started at $50,000 with annual operations costs of $5,000. This was rejected by management for being expensive and unnecessary.

The law firm had good telecommunication links between its NYC and Long Island offices. All we needed to add was a $35,000 “Virtualization Host” in the firm’s Long Island offices, add $10,000 for software and $5,000 for services, and configure virtual copies of the NYC based database servers.

In the event that the primary data center became unavailable – after Hurricane Sandy – we would have been able to shift operations to the backup data center.

The firm lost an estimated $1.4 million in one week because it wouldn’t spend $50,000.

Virtualization Would Have Enabled Recovery in Minutes, not Hours or Days

The system we presented to the law firm required that the firm purchase hardware for it’s remote disaster recovery site. Today Microsoft’s Azure*[1] (azure.microsoft.com) and Amazon Web Services* (aws.amazon.com) provide fault tolerance and disaster recovery without the need for  customers to own and maintain computers. 

Virtualization in and of itself, allows us to focus on the application or the information system, not the server. Thus we now use the term “Serverless Computing.” We can think about email, but don’t need to think about email servers, accounting, document management, or medical imaging systems, not accounting software, document management software, medical images and imaging software on database servers and associated file servers.

Effective use of virtualization on reliable and monitored hardware would have prevented the hardware related incidents of 2001, 2002, and 2010 and facilitated recovery in the other incidents.

Azure Site Recovery and Real Time Backups allow rapid recovery to points in time, minimizing the time that voice mail or email were unavailable in the incidents of 2009 and 2011 and facilitating recovery in the ransomware incidents of 2016 and 2017, and minimized the loss of data in those incidents. Azure Availability Zones ensure that systems in one location can be replicated to alternate locations. This would not have prevented the terrorist attack of Sept. 11 or the force majeure of Hurricane Sandy in 2012 and the subsequent loss of access to various buildings in lower Manhattan. However, Azure Availability Zones would allow continued access to information systems in the face of these kinds of events.

Links for Azure, AWS, & ana’s cloud

Microsoft Azure https://azure.microsoft.com/en-us/
Amazon Web Services https://aws.amazon.com
Serverless Computing https://www.techopedia.com/definition/32477/serverless-computing
Serverless on Azure https://azure.microsoft.com/en-us/overview/serverless-computing/
Serverless on AWS https://aws.amazon.com/lambda/
Azure Site Recovery https://azure.microsoft.com/en-us/services/site-recovery/
Azure Backup https://azure.microsoft.com/en-us/services/backup/
Azure Availability Zones https://docs.microsoft.com/en-us/azure/availability-zones/az-overview
https://azure.microsoft.com/en-us/global-infrastructure/availability-zones/
Ana’s Cloud https://www.anascloud.com

[1]Microsoft Azure, Azure Availability Zones, Azure Site Recovery and other terms are trademarks of Microsoft. AWS is a trademark of Amazon. CommVault is a trademark of CommVault Systems, inc.