Size: px
Start display at page:



1 CU*ANSWERS DISASTER RECOVERY TEST GAP ANALYSIS MAY 17, 2013 OVERVIEW The annual CU*Answers Disaster Recvery Test was cnducted n 5/7-5/9, 2013 at the IBM BCRS (Business Cntinuity and Recvery Services) ht-site facility in Sterling Frest, NY. The test incrprated the successful recvery f ur cre prcessing (CU*BASE/GOLD) prductin hst and netwrk, multiple third parties including ATM/Debit transactins, Credit Card and member statement prcessing, and prxy credit unin verificatin ver MPLS and VPN netwrks. This reprt highlights the key changes implemented fr this test t increase the scpe and cmplexity, identifies issues that surfaced requiring trubleshting and/r prcedural changes, and prvides recmmendatins fr imprving the effectiveness f ur disaster recvery prgram. Fr the secnd year in a rw, recvery teams were divided amng the IBM ht-site facility and the CU*Answers secndary (HA) datacenter in Muskegn, MI. Recvery team members at the secndary datacenter accessed the recvered IBM hst via remte access tls. By dividing the staff, we were nce again able t rtate additinal (new) team members perating in multiple shifts thrughut the 60-hur event. The utline f the recvery event is shwn belw. At each step, multiple tests and audits are perfrmed t ensure system and data accuracy: The first 24 hurs f the test included the fllwing activities: IBM hst perating system installed and cnfigured Netwrk and USER security cnfigured CU*BASE/GOLD envirnment and applicatins restred Credit unin libraries and bjects restred Daily peratinal tasks cnfirmed cncluding with full EOD/BOD system prcessing The next 24 hurs f the test included the fllwing activities: Applicatin testing including ItsMe247 (nline banking) perfrmed Cnnectivity and transactin testing with prxy third party vendrs cmpleted CU*BASE/GOLD availability and data verificatin testing by prxy credit unin sites cnfirmed The remaining 12 hurs cncluded the test with: Any pending issues that surfaced during the test are reslved r identified fr new prjects Test cmpletin and remte hst sanitatin (full data wipe) Full test debriefing by Recvery Teams t dcument required changes t plans and prcedures Gap analysis reprt generated and submitted 1 P age

2 APPRECIATION FOR PARTICIPANTS In additin t the internal staff members wh are trained fr specific rles and respnsibilities fr the CU*Answers Recver Teams, we wuld like t extend a special thank yu t the fllwing clients and vendrs fr their participatin in this year s successful disaster recvery test: State Transprtatin Emplyees Credit Unin (Prxy credit unin) Lake Hurn Credit Unin (Prxy credit unin) Hnr Credit Unin (ACH Buddy Bank partner) SAGE Direct (Member credit card and end-f-mnth statement prcessing) Vantiv (ISOFTH) (Third party transactins) Federal Reserve Bank (ACH file transmissins) IMPROVED COORDINATION OF FUTURE TESTS AND EXERCISES Fr an ideal recvery test, the element f surprise cntributes t the true measure f an rganizatin s ability t recver within an acceptable timeframe. Fr the purpse f ur recvery tests, we require the participatin f multiple vendrs and clients, all within a 60-hur windws, reserved mnths in advance. T maximize this limited pprtunity, careful preplanning and crdinatin is required. In past tests, this was perfrmed in a decentralized manner, with each recvery team crdinating tests with their respective vendrs and clients. With the increasing cmplexity and scpe f tests being perfrmed each year, the need fr imprved crdinatin became very evident during the 2013 recvery test. Fr all recvery tests and exercises mving frward, cmmunicatin and crdinatin with external vendrs and clients will be perfrmed by the event crdinatr (Business Resumptin Manager). This will include the selectin f participating vendrs and clients based n the determined business bjectives f each recvery test. In additin, future tests will be designed and rchestrated with a prgressive degree f difficulty t strengthen ur prcedures and prblem slving skills. 2 P age

3 ENHANCED SECURITY MEASURES ENFORCED AT RECOVERY SITES One f the cntrls put in place fr the purpse f this test included the blcking f traffic frm each recvery site t the prductin netwrk at the firewall level. This included access t VIP phnes, crprate , netwrk file and print services, prductin hsts, etc. Being the first annual test t strictly enfrce these restrictins, recvery team members were frced t be creative in their trubleshting methds, ften crafting tls and prcesses n the fly, as might be the case in a true disaster scenari. Anther security cntrl put in place includes the missin f cmmand-line access fr recvery team members. In its place, a TOOLBOX menu (cmmand-line replacement app) has been implemented in the prductin envirnment. As an additinal security measure, recvery team USER prfiles n all CU*BASE/GOLD prductin hsts had been disabled fr the duratin f the test. In hindsight, recvery team leaders had actually anticipated even mre bstacles and issues than were bserved, given the number cntrls added t this year s test. The ability f teams t quickly identify these bstacles, evaluate their ptins and navigate arund them prvided a valuable learning experience. A key takeaway frm these increased restrictins underscre the value f designing these simulated recvery events t clsely reflect actual disaster scenaris. SAGE DIRECT DISASTER RECOVERY TEST Althugh this key statement prcessing vendr has participated in past recvery tests, 2013 marked the first time bth parties perfrmed recvery tests simultaneusly, each invlved as a participant in the thers test. Fr the purpse f this test, we were able t generate and securely send encrypted statement files frm ur recvered hst in NY t ur redundant FTP server at ur secndary (HA) datacenter, and then t the redundant hst at SAGE Direct s ffsite ffice lcatin fr prcessing. Frm there the cmpleted files were sent securely t SAGE Direct s alternate print prvider (Presrt Services) fr printing and mailing. A CU*Answers recvery team member was n hand t witness the file receptin, prcessing, and printing t cnfirm the vendr s ability t prvide critical services in the event f a disaster. Details frm the SAGE Direct recvery test are available in a separate reprt. AUTOMATED CLEARING HOUSE TRANSMISSIONS Multiple alternate ptins exist during a scenari in which the transmissin f ACH files thrugh ur prductin datacenter is prhibited due t a service disruptin r disaster. The preferred ptin is the utilizatin f a redundant FedLine VPN cnnectin thrugh ur secndary (HA) datacenter. This redundant VPN ptin is tested multiple times each year during ur HA rllver exercises. An alternate ptin is the use f ur FRB Buddy Bank partnership with Hnr Credit Unin. Since this methd had nt been tested since 2011, it was included in the 2013 test. (As nted elsewhere in this reprt), we learned that the prcess required frm the FRB t initiate Buddy Bank transmissins had changed since ur last test. These changes have been nted in ur dcumentatin fr future use in case this methd fr ACH transmissins is required. 3 P age

4 ITSME247 APPLICATION TESTING Als included in the annual recvery test is the availability and functinality f the ItsMe247 applicatin hsted n the redundant stand-by server pl at the secndary (HA) datacenter retrieving data frm the recvered hst in NY. During this test, internal and remte users were able t pint their brwsers t the recvered site (withut interruptin t the prductin site), authenticate and check applicatin features. Fr this test, servers hsting OBC (Online Banking Cmmunity) were nt included, taking the users directly t the lgin prmpt. Fr future tests, we will lk t expand the scpe and bjectives fr the ItsMe247 testing prcess including a mre direct presence and invlvement frm the ASP Prgramming Team. SYNCHRONIZATION OF DATA RESTORED FOR TESTING The rapid pace f new applicatin revisins requires careful planning f data archiving strategies t ensure that recvery timelines can be met under unexpected circumstances. This includes system, applicatin, and member data in a rtatin mix f mnthly, weekly, twice daily (EOD/BOD), using a cmbinatin f full, incremental and differential strategies t ptimize available time and resurces. Media tapes are then rtated between ff-site secure strage facilities t ensure availability shuld a recvery incident ccur. T maximize the 60-hur windw pprtunity fr recvery testing n at the IBM BCRS ht-site lcatin, a prtin f the encrypted tapes required are shipped in advance f the recvery team. The mst current daily tapes travel with the recvery team t the restratin site. Recvery tests are planned in advance and are perfrmed parallel t prductin cre prcessing. This means that the ptential fr applicatin and cnfiguratin changes between data restred n the test hst and prductin data may nt match at the time f the test. In an actual disaster scenari, there wuld be n applicatin and cnfiguratin changes until the hst has been recvered. T better accmmdate these recvery test that include participatin with prductin clients at prxy credit unin lcatins, the synchrnizatin f data and applicatin versins requires cnsideratin f shipping, travel, and restratin timing. This was evident during this year s test where prxy credit unin PCs running current prductin versin sftware attempted t access data n the restred hst frm a prir backup (fr the purpse f testing). This slight differentiatin f versins prvided applicatin errr messages that, althugh easily crrectable in this situatin, culd in fact present a greater impact under certain recvery scenaris (i.e. beta sftware users). As nted elsewhere in this reprt, a prject has been launched t reevaluate the prcess fr archiving and restring data t see where efficiencies can be implemented and recvery f synchrnized data and applicatin versins mre effective. Recmmended changes will be implemented and tested at the next hst recvery pprtunity. 4 P age

5 GAP ANALYSIS The remainder f this reprt highlights the issues nted and lessns learned as well as recmmendatins fr changes t existing peratins and/r future testing. ISSUES NOTED AND LESSONS LEARNED 1. iseries Administratrs were nt able t cnfigure PCs at the secndary (HA) datacenter fr IBM remte access cntrl f the recvered hst due t insufficient privileges n the lcal PCs. This was reslved by escalating privileges fr iseries Administratrs n each recvery site PCs. 2. As a result f the firewall air-gapping, netwrk administratrs at the secndary (HA) datacenter were nt initially able t access/manage the firewall appliance at the IBM recvery site, nr were they able t access/manage prxy client firewall appliances fr cnfiguratin during prxy testing. A cnfiguratin change was required n the firewall at the IBM recvery site. 3. Early in the restratin prcess it was learned that a required tape was nt included in thse that traveled with the recvery team members t the IBM BCRS facility. Vital infrmatin n this tape included necessary data t enable the decryptin f the remaining tapes. This frced recvery teams t brainstrm alternate methds f transmitting the data t the recvery site while hnring the assumptins established fr this test (that the prductin datacenter is nt available), shrt f sending anther team member n the next plane t NY. Recvery teams were able t securely transmit the needed data frm an alternate lcatin (surviving site) t the recvery hst in NY. In additin, lst recvery time was negated by making effective use f multiple tape drives fr prtins f the restratin prcess. 4. During the restratin prcess, the recvery f the TOOLBOX (cmmand-line replacement app) required the use f cmmand-line access. This required interventin by an iseries Administratr. Mdificatins t the TOOLBOX applicatin will be made and tested during the next system restratin pprtunity. 5. It was discvered that sme f the encrypted system passwrd databases stred n the mnthly DR/DVD were nt current versins. Recvery teams were able t retrieve current system passwrd database files frm an alternate lcatin at a surviving site. See Future Recmmendatins belw. 6. Restring the daily cnfiguratin file n the redundant 400ftp server did nt create new directries. This was due t a permissins issue n the stand-by hst. (Issue reslved fr future tests). 7. Prcedures fr initiating and cnfiguring FRB ACH Buddy Bank had changed since the last time this prcess was tested (2011). Dcumentatin has been updated. 5 P age

6 FUTURE RECOMMENDATIONS 1. A tape cntaining critical data required t decrypt all ther tapes did nt travel with the recvery team t the remte lcatin. A review f the prcedures fllwed befre recvery teams leave fr the remte site need t be reevaluated t ensure travel teams are prperly prepared and equipped (media, equipment, plan dcuments and diagrams, etc.). These prcedures shuld be dcumented in a manner that assumes a hurried scenari during an actual disaster (cnfirmatin steps). 2. Server-side applicatin data restred n the recvered hst did match the client-side (prductin) applicatin versin resulting in sme versin errr messages fr prxy credit unins. a. The prcedures fr selecting and restring data n the recvery hst shuld be reviewed t ensure that recvery tests include data that is as current as feasible. b. CU*BASE/GOLD USER passwrds that had been changed in the days since the media was shipped t the recvery site resulted in sme prxy test participants and recvery team members needing t recall their previus passwrds. c. System errrs were generated during the EOD prcedures regarding missing prject libraries due t incrrect applicatin versins restred n the recvery hst. 3. System and applicatin passwrds that are critical during the first hurs f a recvery need t be accessible t all critical business units cmpany-wide, nt just IT recvery team members. Althugh select passwrd databases are archived, a prject shuld be cnsidered t archive all emplyee s passwrd databases and t replicate thse databases t surviving datacenters in the case f an actual disaster. 4. Reevaluate the prcedures fr archiving data stred n the mnthly DR/DVD media t ensure data is current, and identify additinal files and dcumentatin that wuld be beneficial during an actual recvery event. 5. Future tests invlving ItsMe247 shuld include an increase in participatin frm the ASP Prgramming Team as recvery scpe and bjectives are enhanced (especially due t persnnel recent changes within the department). 6 P age