Disaster Recovery and Business Continuity Basics Both Business Continuity and Disaster Recovery are very important business issues for every organization. Global businesses cannot simply stop operating, because the location from which they run their major IT operation is failing for various reasons. Such incident of failure could include anything such as floods, storms, serious power outages, terrorist attacks or even a malicious manipulation targeting data and applications to bring the operation down. The difference between Disaster Recovery and Business Continuity While many people mean the same when they use the terms Business Continuity and Disaster Recovery, those terms are not exactly identical. However, the common denominator between both is that those terms are understood as measures to prepare the IT for a worst case scenario. The term Business Continuity is targeting more the organizational aspects of potential outages. Questions raised in the context of Business Continuity are for example: What are contingency plans for the business organization? Who has which responsibility in the process of getting back in business as soon as possible? What needs to happen to declare a specific situation as disaster? How much time is granted until the operation needs to be up again? And so on. The term Disaster Recovery is rather targeting the technical aspects of a potential outage. How is the IT operation going to recover from a disaster? What technologies are used to address what kind of incident? Which applications will be switched when to the Disaster Recovery site and how will they be switched? How will the users get access to the remote site? And so on. How we understand Disaster Recovery We understand Disaster Recovery as all plans and technologies which are in place to prevent downtime in case a disaster strikes. The Disaster Recovery Architecture has to include a remote site, which can take over the key IT operations when the primary site fails in the shortest possible timeframe. From a business perspective, the remote site should be as far away as possible. The farer the safer! It is not sufficient to stay within a radius of a couple of miles. What about power outages? What about long-term global perspectives on data safety? Real protection requires a distance between individual states, countries or continents.
The weaknesses of hardware-based Disaster Recover Concepts The typical way a customer is starting his Disaster Recovery Project is usually to ask existing hardware vendor or system integrators for possible solutions. The hardware vendor then may come up with a concept based on traditional server replication (cluster) over a large distance in combination with storage replication (block-based replication). In addition to serious technical hurdles (network latency, data consistency, distance limitations, bandwidth requirements ) the Return on Investment of this solution is not that great. It is important to understand that hardware vendors usually propose tools, which worked fine in the past. However, new requirements such as much larger distances between two datacenters sometimes require to think out-of-the-box and to come up with something new instead of re-engineering and trying to enhance existing technologies. Please do not get this wrong: Cluster Server and RAID systems are still the greatest way to ensure local hardware availability, they are simply not that ideal for mirroring data over hundreds or thousands of miles for Disaster Recovery purposes. A different approach to Business Continuity and Disaster Recovery As mentioned earlier, new requirements sometimes require some out-of-the-boxthinking. Before talking about a possible way to address the Disaster Recovery project in a more progressive way, let s face the major criteria for a ideal Disaster Recovery project: The application should be up and running at the remote site within a short period of time when a Disaster strikes (e.g. 20 minutes) Potential data loss should be either eliminated or minimized A minimum distance between both datacenters (e.g. at least 200-500 miles) Manual work for switching over the application should be minimized Data consistency should be guaranteed Interference with production servers should be kept to a minimum In addition, your Disaster Recovery project is typically on a tight budget and also the bandwidth between remote sites is always limited. Overcoming Disaster Recovery challenges with Business Shadow Instead of using block-level storage mechanisms we are operating on a database and application level. The key idea is to work with dedicated changes in the database or
in the file system which are copied to the mirror system. Communication is purely based on standard TCP/IP protocols. The good news about the Business Shadow Software is that all above mentioned Disaster Recovery requirements can be addressed. Business Shadow consists of three components: DB Shadow for mirroring databases, FS Shadow for mirroring your file systems and Switch Application to cover the automated switch of NetBIOS names and IP-Addresses automatically between servers (whereas the Switch Application software only makes sense if primary and disaster recovery datacenters are within the same network segment or you are working with own name servers). With DB Shadow we would copy your production database to your remote site only once. Then all changes (database archive files) are continuously picked up and copied to the backup site automatically - let s say every two or five minutes. In addition the mirror server is working with a built-in time delay when adapting archive files. That offers you additional protection against any user or software errors within your application since you are enabled to perform a point-in-time recovery to any time before an error occurred. The same happens with file systems using FSShadow. What do you need to implement Business Shadow There are only a few very basic requirements to implement a Disaster Recovery concept. First of all, you need a second server with the same operating system as
the production server. Supported operating systems are HP Tru64, HP-UX, IBM AIX, LINUX, SINIX, SUN SOLARIS and Microsoft Windows Server 2003/2000/NT. The second server needs to have the same disk space as the production server, but there is no restriction on what kind of storage system you use on your Disaster Recovery site. Many clients use expensive RAID-systems on the production site and some more cost-effective storage systems on the DR site. Since for example DB Shadow is working with backup and recovery mechanisms of databases, it is necessary that Business Shadow supports your database which could be any of the following: DB2 UDB, Microsoft SQL Server, Oracle and MySQL SAPDB/ MAXDB. The next step is to have a network connection between both servers and at least one open IP-Port. The necessary bandwidth between both sites is depending on a variety of factors. We have a variety of tools available and would be happy to give you a free estimation about potential network requirements. Disaster Recovery projects don t have to be unaffordable when you check the alternatives on the market. Don t let your hardware vendor talking your business into a two million dollar Disaster Recovery project when you can have the same coverage for up to 90% less of the investment. Overcoming network bottlenecks in Disaster Recovery projects Moving away from block-based replication towards (log) file based mirroring is a huge step towards being able to handle network, bandwidth and distance limitations at all. However, we are still not where we need to be. There is lot of work left to make sure that the communication between your production server and disaster recovery server is working smoothly. Let s look at the major network limitations. Disaster Recovery Major Network Limitations Even when working with files instead of block-based replication, there are still some major network limitation when moving a large number of large-sized files over the network: Network bandwidth is still limited. If changes to your production application are for example 10 GB per day, those changes need to be moved to the Disaster Recovery site with a minimal delay. Otherwise those files potentially clock up your production server and the changes are nowhere safe in case of a disaster Network latency is still a problem. Some parts of the problem of network latency got removed by moving from block-based replication towards log-based replication. However, standard TCP/IP is working sequential. That means IP Packet are being sent one after the
other, whereas each packet waits for the feedback ticket of the previous one. This is causing serious delay in getting your data over. Wide Area Networks are not stable. The larger the distance, the more unstable networks are getting. It is not unusual that a connection between two sites is getting lost every once in a while for a couple of minutes. It can be frustrating to continuously intervene to get the DR site back up and running Utilizing the Potential for Network Optimization Some basic mechanisms of Business Shadow are including that for example copy processes can highly parallelized which increases the performance and network utilization. In addition, the number of parallel copy processes can be changed dynamically to fine-tune to copy performance. Another feature is that the process of copying data can be set to compressed or high compressed. Whereas compressed copy is focused on the speed of the compressing time, the high compression is using some more CPU power to focused on a higher compression. In addition to those basic mechanisms, the Business Shadow Long Distance version is offering dedicated features such as for example: Parallel Archive Shipping (PAS-Technology): An important functionality of the Long Distance Edition is the so-called PAS- Algorithm. Core of this concept is the possibility to ship the archive files parallel over the Wide Area Network. This includes sending single files parallel and shipping multiple packages of the same file parallel. Beneath the Initial Copy, this functionality is offering a much better utilization of the existing network bandwidth. Data Encryption :When customers are using Wide Area Networks to mirror critical data, we suggest to use very secure lines between the locations for security reasons. One effective mechanism is to rely on Virtual Private Networks or use additional hardware-based encryption mechanisms. With the Features of the Long Distance Edition, Aivant is offering additional security mechanisms. The encryption methods are based on own procedures which result from e.g. compressing the data or specific bit-rotation mechanisms. Optimizing IP-Communication (VLP-Technology) A traditional TCP-IP Stack usually is not the best solution for mirroring standard Enterprise Resource Planning (ERP) databases or typical file systems. For the maximum utilization of the bandwidth, the Long Distance Edition is offering intelligent technologies for optimizing the basic IPcommunication. This communication optimization package is including the so-called VLP-Technology (very large packages) along with
additional package checksums. By implementing these features, the overall IP-communication is clearly oriented towards the typical database and file system traffic and the DBShadow/FSShadow processes. Beneath the PAS-Algorithms, optimizing of the IPcommunication is a great contribution for increasing the performance and utilizing limited bandwidth.... and of course many more