Disaster Recovery Awareness 1
Threats to Business Environmental: Fire, flood, earthquake, etc. Loss of utilities & services: Power interruption, etc. Systems or equipment failure Information security breach Civil protest/unrest Disruptions in third parties and business partners Pandemic outbreaks Acts of terrorism 2
Impacts Due to Disaster Revenue Loss Angry Customers Non-Compliance with Laws and Regulations Embarrassment 3
Definitions, Acronyms and Abbreviations High Availability (failover) is the characteristic of a system to protect against or recover from minor IT outages in a short time frame with largely automated means. Disaster Recovery (DR) is the ability to continue with services in the case of major IT outages, often with reduced capabilities or performance. Disaster Recovery solutions typically involve manual activities. Business Continuity (BC) takes Disaster Recovery one step further, and includes facilities, work force continuity, vital business operations, etc. 4
Definitions, Acronyms and Abbreviations (Continued) Recovery Time Objective (RTO) Recovery Point Objective (RPO) Tier 0 The maximum elapsed time a business function can sustain a service interruption from the time a crisis is identified to the restoration of service. The amount of data loss that is acceptable. For example, if an RPO of 6 hours is acceptable, SPE will be able to restore systems back to the state they were in 6 hours before the disaster or less. Any data created or modified inside a recovery point objective will be either lost or must be recreated during a recovery. Basic Infrastructure (Such as DNS, LDAP, Active Directory, network services, etc.), required for business application to run. Tier 1 Application with Recovery Time Objective of less than 12 hours and Recovery Point Objective of less than 5 minutes. Tier 2 Tier 3 Application with Recovery Time Objective of more than 12 hours but less than 48 hours, and Recovery Point Objective of less than 24 hours. Application with Recovery Time Objective of more than 48 hours and less than 15 days, and Recovery Point Objective of less than 36 hours. Tier 4 Application with Recovery Time Objective of more than 15 days and Recovery Point Objective of less than 36 hours. 5
DR Plan Disaster Recovery Plan: Aims to provide an organized way to respond to a disruptive event. Addresses the IT procedures to be followed before, during and after a loss to identified applications at Primary Datacenter (Chandler, AZ) The primary objective is to provide the capability to bring up critical applications and data at an alternate site within a time frame that minimizes the loss to the organization. Per Business Impact Analysis 2008 Tier s determined & DR Environment implemented for the following SPE Business Critical Applications. As of January 2013 DR Systems located at El Segundo colo. (SAP in process of establishing DR environment) evmi ITSM/C2C PRISM TAAS GPMS 6
DR Site DR Site Name: Digital Realty Trust co-location. (El Segundo Datacenter) Address: 2260 E El Segundo Blvd, City, State ZIP: El Segundo, CA 90245 Directions: Contact Name / Phone: 15 minute drive from Corporate Pointe, Culver City, CA Ron Goede Director Facilities SPE - EIS 600 Corporate Pointe 310.665 6639 Office 310.678.6461 Cell E-mail: Ron_Goede@spe.sony.com Web: http://www.digitalrealtytrust.com/ 7
DR Process Life Cycle Project Planning- Completed in 2008 Completed in 2008 Risk Assessment & Business Impact Analysis Established in 2008 updates in progress DR Environment Implementation Maintenance & Updating evmi, PRISM, ITSM/C2C, TAAS, GPMS DR Procedure Documentation Awareness & Training DR Testing & Exercising DR testing & exercising is performed after implementation phase to ensure DR environment can be brought up within the required time (RTO) using the documented DR procedure. 8
DR Methodology Applications were classified as Tier1/Tier2/Tier3 based on the results of Risk Assessment Determined RTO and RPO to ensure minimum business impact DR environments were set up at the DR site (currently being updated to mirror Prod.) DR Procedure Documents developed (currently being updated) Tests conducted Tier Maintain RTOthe Document RPO 1 Less than 12 hrs Minutes to 6 hrs 2 12-48 hrs 6-24 hrs 3 2 wks Less than 36 hrs 4 4 wks Less than 36 hrs 9
Tiers based on Business Impact Analysis 2008 Application name Division Tier Financial Impact Per Day 5 Time And Attendance System (TAAS) Corporate Tier 1 $6 M (If it occurs on any Monday) 25 evmi HE Tier 1 $4,693,334 38 ITSM, C2C, ITSM-S, SARA TV-I Tier 2 $3.2 Million 30 SPIRIT MP Tier 1 $2.7 Million 8 Timecapture Imageworks Corporate Tier 1 > $2 M (If it occurs on any Monday or Friday) 24 PRISM & Pathfinder HE Tier 2 $1 Million 34 Domestic TV Sales Management TV-D Tier 2 $64,286 23 Sales Estimation Tool (SET) HE Tier 2 (Tier 1?) $50,000 31 SPIRITworld MP Tier 1 $18,600 28 GPMS Suite MP Tier 3 $18,000 27 InterPlan MP Tier 2 (Tier 1?) $16,000 7 Hyperion Enterprise Corporate Tier 1 (Tier 2?) $5,000 14 Sony Pictures Stock Footage DMG Tier 1 $5,000 22 Revenue Sharing System (REVSHARE) HE Tier 3 $5,000 11 cineshare+ / cineview DMG Tier 2 $3,900 4 HR Connection Corporate Tier 3 $3,000 1 IRIMS Incident Management Corporate Tier 3 $800 6 Medical Services System - Medgate Corporate Tier 3 $350 3 Remedy IT Corporate Tier 1 $0 9 E-911 LAN Alert (Siemens Phone System) Corporate Tier 1 (Tier 3?) 17 SPiDR (Stellent) ETS Tier 1 (Tier 0?) $0 12 Secure File Distribution Services DMG Tier 2 $0 20 myspe, myspej, myspti ETS Tier 2 $0 13 Digital Media Repository DMG Tier 3 $0 36 DealMaker TV-D Tier 3 $0 37 SPT B2B Marketing TV-D Tier 3 $0 10 ACORN/DMI DMG Tier 1 * Tier 1 Tier 2 Tier 3 29 Spent and Committed MP Tier 1 * 32 Automated System for Ad/Pub fulfillment (ASAP) MP Tier 2 * $0 39 Harris Landmark Traffic and Ad Sales TV-I Tier 2 * 40 Harris VISION Scheduling TV-I Tier 2 * 2 CorpTax Corporate Tier 3 * * Awaiting final number 10
Disaster Levels Asset Level Recovery Level I Site Level Recovery Level II City Level Recovery Level III Can we work from the same premises? Can we operate from any other location within LA (SPE Lot)? Recovery Team Leader to mobilize all Recovery Teams and guide them to relocate to another SPE location outside LA or mobilize teams from other SPE location outside LA (London, etc.) Operations at Recovery Location Recovery Team Leader to mobilize the required Recovery Teams and guide them to relocate to another area within the same premises Recovery Team Leader to mobilize the required Recovery Teams and guide them to relocate to another SPE location within LA Restoration Begin Recovery Procedures Begin Recovery Procedures Begin Recovery Procedures 11
DR Network Strategy Emergency Types Description Causes Network Recovery Strategy Level 1 Asset Level Failure System crashes, disk failures, theft, hacking, virus threats SPE LAN to be used primarily. However, if administrators are away from the primary site, they could use VPN and make the DR site as primary. Level 2 Site Level Failure Major power outages, communication link failures, network component failures, sabotage Administrators would be required to use Sony LAN from the alternate site or VPN and make the DR site as primary. Level 3 City Level Failure Total outage because of natural or manmade disasters Administrators in the US may not be able to use either LAN or VPN and would need to depend on administrators outside US (Europe/India etc). 12
DR Operations Flow Recovery Flowchart Disaster Notify DR Team & Crisis Management Team Crisis Management Team/Facilities Management Alert Damage Assessment Team Emergency Plan Activation (Evacuation) Verification & Assessment DR Plan Activation & Notification Procedures Assemble Teams DR Teams IT Recovery Team Salvage Team Perform Recovery at Alternate Site Begin Salvage Operations at Primary Site Normal Operations Team Return to Normal Operation End of Crisis 13
Teams & Responsibilities Team Crisis Management Team Responsibilities Directs, allocates resources and oversees the situation Damage Assessment Team Assesses the impact of disaster event and informs Crisis Management Team Coordinates the collection of information & evidence required for insurance claims Salvage Team Disaster Recovery Team Normal Operations Team Returns primary site to normal operating conditions Clears and repairs the primary processing facility Implements recovery procedure at DR Site Recovers the systems & data at DR Site Returns Production from DR to Primary Site Other teams required during Disaster Recovery are: Emergency Response Team Evacuation of employees, coordinating with non-spe emergency teams (Fire Dept., etc.) Public Relations Team Handling media, public, etc. Security Incident Response Team If there is any security incident (Denial of Service (DoS), Virus Attack, etc.) 14
Current Status DR architecture has not been updated since 2011 to match Production architecture. Operational Testing documents for the 5 identified applications that have DR in place (PRISM, evmi, ITSM/C2C, GPMS and TAAS ) date back to 2011 In the process of determining the requirements for additional infrastructure: oto account for changes in the architecture and Infrastructure for currently established DR applications oto provision Infrastructure for changes in the architecture and infrastructure for existing DR platforms as well as new DR platforms (such as Oracle on AIX, JBoss, etc.)
Current Status Business Impact Analysis was completed in 2007 since then no formal updates have been published. New applications may not have a DR environment/plan SAP -Draft of Disaster Recovery Plan being finalized. Infrastructure has been provisioned. Per Business Continuity Team (Scot Falkenstein) - we are aligned with policy effective May 12, 2012 Testing will need to be re-scheduled once the architectural and design changes, Version upgrades and WebLogic to JBoss conversions are completed. (TAAS, ITSM/C2C, GPMS, PRISM)
Process Flow for 2013 DR Parallel Testing SAP evmi 2. 1. Review current documents on File Perform Initial Analysis Architecture updates? Personnel changes? Call tree? 3. Meet with Application & Platform Leads 4. Coordinate Application review and updates 5. Confirm DR Servers including platform interfaces 6. Upgrade platforms 7. Update DR Env. & Testing Documents 8. Schedule Test 9. Conduct Test 10. Document results, learnings and resolve issues 11. Prepare for Full Test ITSM, PRISM, TAAS, WebMethods, WDG, Datastage, IDM, D-Series, Alfresco, Oracle, JBoss, Weblogic, Apache, BO
Disaster Recovery Flow Chart During Disaster (evmi Recovery at DR Site) SAMPLE Crisis Management Team Disaster Declared @ Chandler Network/ Storage Team Server Team Database Team ETL Team Application Team IT Recovery Coordinator Activate the Salvage team to recover/salvage from the Primary site Send notifications To DR Teams Ensure connectivity to DR Site Check & stop applications or services running at Chandler (if any) Stop replication Recover & import volumes Mount volumes & check file systems E No All volumes mounted & file systems consistent? Network team Storage team Yes Mount and start the Oracle and DB2 databases Start middleware (webmethods, Datastage & D-Series) Change Pegasus interface to point to DR site A 18
Disaster Recovery Flow Chart During Disaster (Recovery at DR Site) Continued... SAMPLE Crisis Management Team Network/ Storage Team Server Team Database Team ETL Team Application Team IT Recovery Coordinator A Start evmi Application & Services Test evmi services & Test evmi application E No Is evmi application running? Yes Inform users to Switch over to DR URL Declare DR site up Inform Crisis Management Team Monitor the situation Is Primary site up? B Yes No 19
Disaster Recovery Flow Chart After Disaster (Restoration to Normal Operations) SAMPLE Crisis Management Team B Network/ Storage Team Ensure connectivity to primary site Server Team Database Team ETL Team Application Team IT Recovery Coordinator Start reverse replication of data from DR site Stop replication, recover & import primary volumes Mount Primary volumes & check file systems No All volumes mounted & file systems consistent? Yes Mount and start the Oracle and DB2 databases Start middleware (webmethods, Datastage & D-Series) Change Pegasus interface to point to Primary site Start Siteminder & evmi Application & Services Test evmi services & Test evmi application Network team Storage team Is evmi application running? Yes 20 C No
Disaster Recovery Flow Chart After Disaster (Restoration to Normal Operations) Continued SAMPLE Crisis Management Team Network/ Storage Team Server Team Database Team ETL Team Application Team IT Recovery Coordinator C Check & stop applications or services running at DR (if any) Inform users to Switch-over to Chandler Prod. URL Declare Primary site up Inform Crisis Management Team Start replication to DR STOP Network team Storage team 21
Important Contact Details SAMPLE Notification Tree Sample - evmi. 22
Type of Tests Parallel The DR site is activated Only test transactions are carried out at the DR site Main business continues at the primary site without any interruptions Full The primary (Prod) site is shut down DR site is activated as the primary Business is conducted at the DR site 23
Testing Strategy and Prioritization DR testing for current DR Applications has been prioritized by determining when their changes, upgrades and conversions will be complete. oif changes are due to complete in the next 3 months (May 2013), we have scheduled DR testing to occur after changes are complete. (PRISM & ITSM/C2C) oif changes are greater than 3 months away we will test as soon as architecture has been updated, then re-test after changes are made. o evmi has fewer dependencies therefore we will conduct some testing in April 2013, a Parallel test in June 2013 & re-test in the 4 th Quarter when all changes are complete. othe GPMS team has specifically requested that we delay all DR testing till October 2013 osap failover to El Segundo has been tested successfully (March 2013). DR Parallel testing will be conducted once all changes/updates are complete
Tentative Schedule DR Parallel Testing TAAS MAY 2013 PRISM MAY/JUNE 2013 ITSM/C2C MAY/JUNE 2013 GPMS 4th QUARTER PER REQUEST evmi APRIL 2013 SAP MARCH 2013 (FAILOVER TESTED)
You cannot control disaster. But you can be prepared for it. 26