Sound Transit Internal Audit Report - No. 2014-6 Maturity Assessment: Information Technology Division Disaster Recovery Planning Report Date: June 5, 2015 Table of Contents Page Executive Summary 2 Background 3 Audit Approach and Methodology 3 Maturity Assessment 4 Management Response 7 Audit Timeline Audit Entrance Meeting 02/13/15 First draft report issued 06/05/15 Exit Meeting 06/04/15 Final Management responses received 08/12/15 Final report issued 08/19/15 Presented to Audit & Reporting Committee 10/15/15
Executive Summary The Information Technology Division s Disaster Recovery Plan was included on the Internal Audit Division s Work Plan in 2012, 2013 and 2014. The audit was deferred in both 2012 and 2013 because the IT Division was in the of updating their plan. When we were advised that the IT Disaster Recovery Plan was to be updated again in 2014, we again considered deferring the audit, but instead determined to conduct a maturity assessment of the current state of disaster recovery planning in the IT Division. To perform the assessment, we used two well-established industry standards. First, we used COBIT 5 as the framework for establishing disaster recovery requirements. Second, we utilized the capability ratings standards established by the International Standards Organization. According to IT Division management, their 2014 effort was focused on their data center as an expedient method to develop a disaster recovery plan for the most critical agency applications. The agency built two new data centers in 2013 and 2014, which provide fail-over redundancy. Our maturity assessment found that the 2014 Data Center Disaster Recovery Plan did not score well in terms of COBIT 5 requirements, primarily for three reasons. First, it was developed utilizing a top-down approach that assumed all agency business needs would be captured within the data center, when in fact they were not. Second, the plan assumed recovery time objectives, 1 rather than analyzing business practices to determine how long agency personnel could operate while awaiting service restoration following a disruption. Third, it assumed all applications within the data center were of equal criticality, thus it did not provide guidance regarding the priority to restore service to each application. Because of these limitations, the current IT Division disaster recovery planning effort scored low in this assessment. Please refer to the detailed reporting within. Note that IT Division management is aware of this and has contracted with a consulting firm to create a new Disaster Recovery Plan. We will plan to review this effort and report a revised maturity assessment in future years. This audit only pertains to information technology under the control of the IT Division, which includes information technology infrastructure located in the data center and certain transit systems located throughout the region (TVM, CCTV). This audit does not include the Supervisory Control and Data Acquisition or Positive Train Control, because disaster recovery planning for these systems is the responsibility of the Operations Department and is considered outside the scope of this maturity assessment. 1 Recovery Time Objective: the targeted duration of time and a service level within which a business must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity. Page 2 of 11
Background IT Division management is currently in the of aligning business es with the COBIT 5 framework and also with division strategic planning and performance monitoring. COBIT 5 provides a manageable and logical structure for internal controls. The COBIT 5 business that aligns with the IT Division Disaster Recovery Program at Sound Transit is DSS04, which is titled, Manage Continuity. The main focus of the IT Division Disaster Recovery Program is to continue agency operations effectively and efficiently after a disaster or unexpected business interruption. Prior to the 2014 Data Center disaster recovery planning effort the program was last updated in 2007. A plan to update the program was presented to the Technology Governance Team (TGT) in June 2014. The plan included a threeyear, three-phase that focused on IT infrastructure in year one, IT business applications in year two and Supervisory Control and Data Acquisition (SCADA) in year three. In 2014, the IT Division worked with an IT disaster recovery consultant to complete the Data Center Disaster Recovery Plan. In 2015, the IT Division is working with a new IT disaster recovery planning consultant to complete the IT business applications and SCADA disaster recovery plans. The consultant will: 1. Define and document the IT Continuity/Disaster Recovery Program policy, objectives and scope. 2. Maintain a continuity strategy. 3. Develop and implement a business continuity response. 4. Manage backup arrangements. Audit Approach & Methodology Internal Audit approached the audit by gaining an understanding of the current plans and es described above. We reviewed COBIT 5 guidance and other resources to gain an understanding of industry best practices in disaster recovery planning. We met with management to discuss audit scope, objectives, timing and to obtain general knowledge of current practices. Based on analysis of the data gathered and discussion with ST management, the following objective was developed: 1. Perform a COBIT 5 maturity assessment regarding IT Division disaster recovery planning and management. During the fieldwork phase of the audit, all collected information was examined, including the IT Data Center Disaster Recovery Plan, TGT presentations and IT Division procedure documents. All information collected was used to formulate conclusions and recommendations. The final phase was reporting. All information was summarized and organized. Preliminary results were communicated with management, findings were clarified, and conclusions and recommendations were presented. The report was provided for appropriate Sound Transit personnel for review and comment. The report was revised to include the required management responses. We conducted this performance audit in accordance with Generally Accepted Government Auditing Standards and the International Standards for the Professional Practice of Internal Auditing. Those standards require that we plan and perform the audit to obtain sufficient, appropriate evidence to provide a reasonable basis for our findings and conclusions based on our audit objectives. We believe that the evidence obtained provides a reasonable basis for our findings and conclusions based on our audit objectives. Page 3 of 11
IT Division Disaster Recovery Planning Maturity Assessment This audit evaluated the maturity of eight management activities within the COBIT 5, Manage Continuity (see Table below). According to COBIT 5 standards and ISO rating methodology 2, seven management activities are rated Level 1 (Partial) and one is rated Level 0. The management activities with maturity levels rated Level 1 are qualified as partially achieved because controls within the es are not adequate to ensure predictable outcomes. The key to achieving Levels 2 and 3 is improved documentation and development of self-audit es that evaluate the effectiveness of control es. The following table describes the current state of the eight defined management activities applicable to the COBIT 5, Manage Continuity, which is described as Establish and maintain a plan to enable the business and IT to respond to incidents and disruptions in order to continue operation of critical business es and required IT services and maintain availability of information at a level acceptable to the enterprise. Management Activity 1 Define the business continuity policy, objectives and scope. 2 Maintain a continuity strategy. 3 Develop and implement a business continuity response. 4 Exercise, test & review the Business Continuity Plan. 5 Review, maintain & improve the continuity plan. 6 Conduct continuity plan training. 7 Manage backup arrangements. 8 Conduct postresumption review. Description of Current State The Data Center Disaster Recovery Plan is not aligned with the agency-wide Emergency Management Plan because: It was not developed based on analysis of agency services/business es that are critical to the agency. All business es and systems are not included. Performance metrics to track progress of the Data Center Disaster Recovery Plan have not been developed. The Data Center Disaster Recovery Plan does not assess the likelihood of disasters, is not based on business impact analyses or recovery time objectives for critical agency services and business es. The Data Center Disaster Recovery Plan does not include agency operational continuity plans, key suppliers or outsourced partner s plans or backup requirements. Annual continuity testing plan has not been developed, documentation of existing testing should be improved, and performance metrics should be used to determine whether test results were addressed adequately and timely. The Data Center Disaster Recovery Plan has not been reviewed and approved and needs further improvement. A continuity training program has not been developed. Staff competencies required for continuity training and testing have not been defined or training plans documented. The IT Division procedure document addressing backup and retention requirements was last reviewed in 2012. The Data Center Disaster Recovery Plan includes steps to conduct post-resumption review, however it does not address all applicable systems, and the procedures have not been tested. Capability Level 0.00 2 Refer to Appendix I for detailed description of rating system. Page 4 of 11
Recommendations: Based on interviews with management and analysis of the two consultant agreements for Disaster Recovery Planning (the 2014 effort and the current contract) it appears that IT Division management understands the current plan needs improvement. As noted previously in this report, the current plan is deficient primarily because it was based upon three incorrect assumptions. First, it was developed utilizing a top-down approach, assuming that all agency business needs would be captured. Second, the plan assumed recovery time objectives, 3 rather than analyzing business practices to determine how long agency personnel could operate while awaiting service restoration following a disruption. Third, it assumed all applications within the data center were of equal criticality, thus it did not provide guidance regarding the priority to restore service to each application. We recommend the Information Technology Division consider the following: Planning 1. In order to better align the IT Business Continuity and Disaster Recovery Plan with the ST Emergency Management Plan, identify IT responsibilities and document the policies and procedures required to continue business operations after a disaster. 2. Assess the likelihood of business disruption for each incident type. This can help focus training and preparation based on incident type. 3. Develop a backup and restore test plan that includes periodic testing of on-site and off-site data for critical systems. 4. Develop an annual business continuity and disaster recovery testing plan. A testing plan should include a schedule of types of testing, systems to test and a full test of the Disaster Recovery Plan. Monitoring 5. Identify all internal IT services and business es that are critical to the agency. Creating and maintaining this list will help IT focus its resources. 6. The IT Business Continuity and Disaster Recovery plan development should be reported to the TGT and executive management annually, to improve management controls and agency involvement in the. The three-year, three-phase to develop the plan was last presented to the TGT in June 2014. 7. Review the IT Data Center Disaster Recovery Plan on a regular basis against major changes to: agency organization, business es, outsourcing arrangements, technologies, infrastructure, operating systems and application systems. Track and report the frequency of updates to the risk profile. 8. Ensure business impact assessments are revised when changes to agency business practices are identified. 9. Ensure that all changes in policy, plans, procedures, infrastructure, and roles and responsibilities are approved by agency management and communicated to appropriate agency staff. 10. Develop, track and report performance metrics for the IT Business Continuity and Disaster Recovery plan development project. The COBIT 5 framework for managing continuity recommends many performance metrics, which we have provided for consideration in Appendix II. 11. Periodically assess adherence to the documented Disaster Recovery Plan. 12. Determine the effectiveness of the plan, continuity capabilities, roles and responsibilities, skills and competencies, resilience to incidents, technical infrastructure, and organizational structures and relationships. 13. Identify weaknesses or omissions in the plan and capabilities and make recommendations for 3 Recovery Time Objective: the targeted duration of time and a service level within which a business must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity. Page 5 of 11
improvement. Track and report percent of agreed-on improvements to the plan that have been reflected in the plan and percent of issues identified that have been subsequently addressed in the plan. Documenting 14. Determine whether agency divisions have developed operational business continuity plans for critical business es and/or temporary ing arrangements, including links to plans of outsourced service providers. Track and report percent of agency divisions satisfied that IT service delivery required in their continuity plans meet agreed-upon service levels. 15. Develop adequate Business Impact Assessments and Recovery Time Objectives (RTO). The assessments should include input from relevant stakeholders of each IT system and business function and should be documented and approved by stakeholders. The business impact assessment and RTO are listed as required outputs by COBIT 5. 16. Ensure that key suppliers and outsource partners have effective continuity plans in place. Track and report the percent of critical key suppliers and outsource partners who do not have effective continuity plans in place. 17. Review and update the Information Technology Procedure No. 9, Backup and Retention Requirements: Production Environment. 18. Include a reference to system backup requirements in policy and procedures required to support the IT Data Center Disaster Recovery Plan. 19. Include systems, applications, data and documentation maintained or ed by third parties in the ST Procedure No. 9, Backup and Retention Requirements: Production Environment.. 20. Improve the documentation of test results by: a. Recording the date of the test b. Recording the roles and responsibilities of test participants c. Labeling recommendations identified from the post-test debriefing and analysis d. Including review and approval signature from IT management 21. Document post-resumption review following the successful resumption of business es after service interruption. 22. Obtain management approval of the post-resumption review documentation. Training 23. Create an IT business continuity and disaster recovery training program. Track and report percent of issues identified that have been subsequently addressed in the training materials. 24. Define training requirements for agency staff performing continuity planning, impact assessments, risk assessments, media communication and incident response. Ensure that the training plans consider frequency of training and training delivery mechanism. Track and report the percent of internal and external stakeholders that have received training. 25. Document agency staff competencies in business continuity and disaster recovery based on completed trainings and participation in business continuity tests. Page 6 of 11
Management Response Recommendation - Planning 1. In order to better align the IT Business Continuity and Disaster Recovery Plan with the ST Emergency Management Plan, identify IT responsibilities and document the policies and procedures required to continue business operations after a disaster. 2. Assess the likelihood of business disruption for each incident type. This can help focus training and preparation based on incident type. 3. Develop a backup and restore test plan that includes periodic testing of on-site and off-site data for critical systems. 4. Develop an annual business continuity and disaster recovery testing plan. A testing plan should include a schedule of types of testing, systems to test and a full test of the Disaster Recovery Plan. Recommendation - Monitoring 5. Identify all internal IT services and business es that are critical to the agency. Creating and maintaining this list will help IT focus its resources. 6. The IT Business Continuity and Disaster Recovery plan development should be reported to the TGT and executive management annually, to improve management controls and agency involvement in the. The three-year, three-phase to develop the plan was last presented to the TGT in June 2014. 7. Review the IT Data Center Disaster Recovery Plan on a regular basis against major changes to: agency organization, business es, outsourcing arrangements, technologies, infrastructure, operating systems and application systems. Track and report the frequency of updates to the risk profile. 8. Ensure business impact assessments are revised when changes to agency business practices are identified. 9. Ensure that all changes in policy, plans, procedures, infrastructure, and roles and responsibilities are approved by agency management and communicated to appropriate agency staff. Management Response Party Agree: Disaster Recovery/Business Continuity should be an Agency-level policy. Procedures for IT recovery will be in the form of updated Runbooks, as part of 2016-2017 deliverables of the DR Program. Partly Agree: Incident types have been described in IT Vulnerability Assessment Report in 2015. This assessment will influence how much investment we make in the DR program going forward. It will not be likely used for classifying incident response es or training. This is not intended to be repetitive task in the DR program. Partly Agree: Procedure #9 will be updated to include Backup and Recovery will be completed by the end of July 2015. Not Agree: IT will conduct a planned DR test annually, beginning Jan. 2016. IT does not have the resources to fully test the DR plan annually. Partly Agree: A Business Impact Analysis and was completed in the 2015 deliverables of the DR Program. This is not intended to be a repetitive task in the DR Program. Partly Agree: The DR plan is scheduled to be presented to the TGT by 3Q2016. Partly Agree: The DR Plan will be reviewed at a minimum every 3 years, beginning 2019. Partly Agee: Business impacts will be identified as part of new application/system rollouts. The BIA s are not intended to be a repetitive task. Partly Agree: Changes in DR/BC policy will be communicated as part of the Agency policy. Internal to IT communication may not take on a formal communication channel. The DR program will not be Page 7 of 11
10. Develop, track and report performance metrics for the IT Business Continuity and Disaster Recovery plan development project. The COBIT 5 framework for managing continuity recommends many performance metrics, which we have provided for consideration in Appendix II. 11. Periodically assess adherence to the documented Disaster Recovery Plan. 12. Determine the effectiveness of the plan, continuity capabilities, roles and responsibilities, skills and competencies, resilience to incidents, technical infrastructure, and organizational structures and relationships. 13. Identify weaknesses or omissions in the plan and capabilities and make recommendations for improvement. Track and report percent of agreed-on improvements to the plan that have been reflected in the plan and percent of issues identified that have been subsequently addressed in the plan. Recommendation - Documenting 14. Determine whether agency divisions have developed operational business continuity plans for critical business es and/or temporary ing arrangements, including links to plans of outsourced service providers. Track and report percent of agency divisions satisfied that IT service delivery required in their continuity plans meet agreed-upon service levels. requesting approval other than IT management; and the policy committee for policy changes. Partly Agree: Given IT resources, this will not occur other than as a result of the annual IT DR exercise which will provide Pass/Fail on the exercise and lessons learned. Agree: This will be completed with annual DR exercises. Agree: The effectiveness of the plan will be determined by the success or failure of annual DR exercises. Not Agree: Tracking and reporting on a percentage basis will be overly cumbersome for the available resources and Agency commitment to the DR Program and therefore will not be developed. Partially Agree: During the Business Impact Analysis developed in the DR Program, IT has documented whether plans for business continuity exist. It will be the Agency COOP who provides continued oversight with the business to keep critical business es updated. 15. Develop adequate Business Impact Assessments and Recovery Time Objectives (RTO). The assessments should include input from relevant stakeholders of each IT system and business function and should be documented and approved by stakeholders. The business impact assessment and RTO are listed as required outputs by COBIT 5. 16. Ensure that key suppliers and outsource partners have effective continuity plans in place. Track and report the percent of critical key suppliers and outsource partners who do not have effective continuity plans in place. 17. Review and update the Information Technology Procedure No. 9, Backup and Retention Requirements: Production Environment. 18. Include a reference to system backup requirements in policy and procedures required to support the IT Data Center Disaster Recovery Plan. Completed: This task was completed during Phase 1 of the DR Program. Stakeholders filled out and went through thorough review of the Business Processes and desired RTO s. Agree: This task is being planned as part of 2017 deliverables of the DR Program development. Agree: Updated as part of 2015 DR Program Reference item 1 19. Include systems, applications, data and documentation Agree: Will be updated in 2016 DR Page 8 of 11
maintained or ed by third parties in the ST deliverables. Procedure No. 9, Backup and Retention Requirements: Production Environment.. 20. Improve the documentation of test results by: Agree: This will be included as part of the annual IT DR exercises. a. a. Recording the date of the test Blank b. b. Recording the roles and responsibilities of test Blank participants c. a. Labeling recommendations identified from the Blank post-test debriefing and analysis d. a. Including review and approval signature from Blank IT management 21. Document post-resumption review following the successful resumption of business es after service interruption. 22. Obtain management approval of the post-resumption review documentation. Agree: This will be included as part of annual IT DR exercises and any significant actual incidents. Agree: This will be included as part of the annual IT DR exercise and any significant incident(s). Recommendation - Training 23. Create an IT business continuity and disaster recovery training program. Track and report percent of issues identified that have been subsequently addressed in the training materials. 24. Define training requirements for agency staff performing continuity planning, impact assessments, risk assessments, media communication and incident response. Ensure that the training plans consider frequency of training and training delivery mechanism. Track and report the percent of internal and external stakeholders that have received training. 25. Document agency staff competencies in business continuity and disaster recovery based on completed trainings and participation in business continuity tests. Not Agree: Developing a training program, tracking and reporting on a percentage basis will be overly cumbersome for the available resources and Agency commitment to the DR Program. Partly Agee: IT will develop and maintain a DR training program for appropriate staff. Developing a training program, tracking and reporting on a percentage basis will be overly cumbersome for the available resources and Agency commitment to the DR Program. Partly Agree: IT will maintain a training inventory for staff to insure key DR personnel have received proper training. However, developing a training program, tracking and reporting on a percentage basis will be overly cumbersome for the available resources and Agency commitment to the DR Program. Page 9 of 11
Appendix I: Description of COBIT 5 and ISO Ratings In COBIT 5, an ISO/IEC 15504 compliant capability assessment system is used to assess whether goals have been achieved. The capability level of a is determined on the basis of the achievement of specific attributes. Table COBIT 5 Process Capability Ratings (based on ISO/IEC 15504) Capability Level Level Name 5 Optimizing 4 Predictable 3 Established 2 Managed 1 Performed 0 Incomplete Description The level 4 predictable is continuously improved to meet relevant current and projected business goals. The level 3 established now operates within defined limits to achieve its outcomes. The level 2 managed is now implemented using a defined that is capable of achieving its outcomes. The level 1 performed is now implemented in a managed fashion (planned, monitored and adjusted) and its work products are appropriately established, controlled and maintained. The implemented achieves its purpose. The is not implemented or fails to achieve its purpose. Attributes may also be rated with a standard qualification scale that is also defined in the ISO/IEC 15504 standard: Rating Symbol N P L F Rating Not achieved. Partially achieved. Largely achieved. Fully achieved. Description There is little or no evidence of achievement of the defined attribute in the assessed. There is some evidence of an approach to, and some achievement of, the defined attribute in the assessed. Some aspects of achievement of the attribute may be unpredictable. There is evidence of a systematic approach to, and significant achievement of, the defined attribute in the assessed. Some weakness related to this attribute may exist in the assessed. There is evidence of a complete and systematic approach to, and full achievement of, the defined attribute in the assessed. No significant weaknesses related to this attribute exist in the assessed. Page 10 of 11
Appendix II: The COBIT 5 framework for managing continuity recommends many performance metrics, including the following: a. Percent of critical business es, IT services and IT-enabled business programs covered by risk assessment. b. Number of significant IT-related incidents that were not identified in risk assessment. c. Frequency of update or risk profile. d. Number of business disruptions due to IT service incidents. e. Percent of business stakeholders satisfied that IT service delivery meets agreed-on service levels. f. Percent of users satisfied with the quality of IT service delivery. g. Level of business user satisfaction with quality and timeliness (or availability) of management information. h. Number of business incidents caused by non-availability of information. i. Ratio and extent of erroneous business decisions where erroneous or unavailable information was a key factor. j. Percent of IT services meeting uptime requirements. k. Percent of successful and timely restoration from backup or alternate media copies. l. Percent of backup media transferred and stored securely. m. Number of critical business systems not covered by the plan. n. Number of exercise and tests that have achieved recovery objectives. o. Frequency of tests. p. Percent of agreed-on improvements to the plan that have been reflected in the plan. q. Percent of issues identified that have been subsequently addressed in the plan. r. Percent of internal and external stakeholders that have received training. s. Percent of issues identified that have been subsequently addressed in the training materials. Page 11 of 11