Disaster Recovery Stanley Lopez Premier Field Engineer Premier Field Engineering Southeast Asia Customer Services and Support
Categories of Risk Financial Operational Reputational Market share Revenue Regulatory Other risks that are more specific to a particular organization s or industry E.g. Healthcare, national security for the government, or a presently occurring activity (such as merger or acquisition)
Classification of Risks People Process Technology Environmental
Potential Causes of Disasters Natural Political Hardware Software Human error
Avoiding Disaster Prevention causes fewer headaches than having to go through a DR effort Preventive measures should target known hazards: Natural Political Hardware Software Human error
Preparing for Recovery from Disasters Objectives: Identify what data to collect prior to a disaster Ways that preparedness efforts can go awry Limitations on preparedness efforts Identify key points of information that need to be communicated to lines of business during a disaster
What Can Go Wrong? Murphy s law What can go wrong will go wrong It is difficult to anticipate every eventuality. Focus on preventing key scenarios Preventable failures during recovery Backups not functioning Hardware error with tape drives or tape failures Tapes lost Administrative passwords lost Operational error due to lack of practice in recovery
Limits on Preparedness Efforts Restoring from backups can roll back state changes Frequency of changes vs. duration of backups impacts recoverability
Estimating Time for Recovery How long does it take to provision the server? Consider environment, build process, and ease of finding an available computer or accessing the computer in the datacenter Usually 1 to 4 hours is a reasonable estimate How long does it take to restore data from backup? Test restore times so that you can make a reasonable estimate of recovery time Your estimate should include requesting a tape from offsite storage
Communicating with Lines of Business What to communicate: Scope of impact Communicate what is broken Explain how the business will experience the issue Time to resolution Err on the side of more rather than less Give the business a time frame to work within
Estimating Costs (Example) Compare cost of prevention vs. cost of service outage Estimating cost of service outage: Lost profitability = Revenue lost due to an outage Lost productivity = (Number of users affected) * (effect on productivity) * (percent chance of event) * (duration of downtime) * (average hourly wage) Soft costs: for example, loss of customer confidence
Case Study: User Descriptions are Accidentally Deleted It is often not the technical skill of the IT personnel that determines DR success or failure.
Underestimating Recovery Time It is often not the technical skill of the IT personnel that determines DR success or failure. Other factors can include: Underestimating the time needed to recover from an AD disaster (the exercise of restoring from backup represents only a fraction of the total recovery time)
Lack of Communication It is often not the technical skill of the IT personnel that determines DR success or failure. Other factors can include: Underestimating time needed to recover from an AD disaster (the exercise of restoring from backup represents only a fraction of the total recovery time) Failure to communicate during a crisis (no one told the driver)
Inadequate Training for DR Scenarios It is often not the technical skill of the IT personnel that determines DR success or failure. Other factors can include: Underestimating time needed to recover from an AD disaster (the exercise of restoring from backup represents only a fraction of the total recovery time) Failure to communicate during a crisis (no one told the driver) Lack of training for DR scenarios
Lack of Contingency Planning It is often not the technical skill of the IT personnel that determines DR success or failure. Other factors can include: Underestimating time needed to recover from an AD disaster (the exercise of restoring from backup represents only a fraction of the total recovery time) Failure to communicate during a crisis (no one told the driver) Lack of training for DR scenarios Failure to implement contingency planning as part of the DR process (such as asking for all the tapes from the past week instead of just the most recent one)
As a summary of what happened. Murphy s law: What can go wrong will go wrong Little things add up Training and practice are essential Contingency plans should be implemented within the DR strategy
In summary. Evaluate Risks Identify Possible causes of Disasters Categorize Disasters (critical low risk) DR Documents should be in place for every services the organization utilizes Regular Fire drills should be done Awareness and preparedness
Microsoft Premier Support Risk Assessment Programs Active Directory Risk Assessment Exchange Risk Assessment SQL Risk Assessment Microsoft Office SharePoint Server Risk Assessment Cluster Risk Assessment Workshops Active Directory Troubleshooting Active Directory Disaster Recovery Exchange Disaster Recovery Exchange Troubleshooting And a lot more