How To Fix A Powerline From Disaster To Powerline



Similar documents
Perforce Backup Strategy & Disaster Recovery at National Instruments

MICROSOFT EXCHANGE best practices BEST PRACTICES - DATA STORAGE SETUP

Creating A Highly Available Database Solution

WHITE PAPER: ENTERPRISE SECURITY. Symantec Backup Exec Quick Recovery and Off-Host Backup Solutions

an introduction to networked storage

Perforce with Network Appliance Storage

High Availability and Disaster Recovery Solutions for Perforce

Protect SQL Server 2012 AlwaysOn Availability Group with Hitachi Application Protector

WHITE PAPER PPAPER. Symantec Backup Exec Quick Recovery & Off-Host Backup Solutions. for Microsoft Exchange Server 2003 & Microsoft SQL Server

Data Protection with IBM TotalStorage NAS and NSI Double- Take Data Replication Software

Solution Architecture for Mailbox Archiving 5,000 Seat Environment

Surround SCM Backup and Disaster Recovery Solutions

IP Storage On-The-Road Seminar Series

Business Continuity: Choosing the Right Technology Solution

Contingency Planning and Disaster Recovery

Why Fails MessageOne Survey of Outages

Leveraging Virtualization for Disaster Recovery in Your Growing Business

High Availability and Disaster Recovery for Exchange Servers Through a Mailbox Replication Approach

Virtualization, Business Continuation Plan & Disaster Recovery for EMS -By Ramanj Pamidi San Diego Gas & Electric

NETWORK ATTACHED STORAGE DIFFERENT FROM TRADITIONAL FILE SERVERS & IMPLEMENTATION OF WINDOWS BASED NAS

How To Protect Data On Network Attached Storage (Nas) From Disaster

5054A: Designing a High Availability Messaging Solution Using Microsoft Exchange Server 2007

Microsoft SQL Server 2005 on Windows Server 2003

Ultra-Scalable Storage Provides Low Cost Virtualization Solutions

How To Write A Server On A Flash Memory On A Perforce Server

How To Back Up A Computer To A Backup On A Hard Drive On A Microsoft Macbook (Or Ipad) With A Backup From A Flash Drive To A Flash Memory (Or A Flash) On A Flash (Or Macbook) On

DELL EMC solutions for Microsoft: Maximizing your Microsoft Environment. by Philip Olenick, DELL Solutions

PROMAPP TECHNICAL INFORMATION

Enterprise Deployment: Laserfiche 8 in a Virtual Environment. White Paper

Backup and Recovery. What Backup, Recovery, and Disaster Recovery Mean to Your SQL Anywhere Databases

Cloud Attached Storage

Blackboard Managed Hosting SM Disaster Recovery Planning Document

Protecting Microsoft SQL Server with an Integrated Dell / CommVault Solution. Database Solutions Engineering

DeltaV Virtualization High Availability and Disaster Recovery

Client Hardware and Infrastructure Suggested Best Practices

NEC Corporation of America Intro to High Availability / Fault Tolerant Solutions

The Microsoft Large Mailbox Vision

VMware Site Recovery Manager with EMC RecoverPoint

Cisco Unified CM Disaster Recovery System

DB2 9 for LUW Advanced Database Recovery CL492; 4 days, Instructor-led

NETAPP WHITE PAPER USING A NETWORK APPLIANCE SAN WITH VMWARE INFRASTRUCTURE 3 TO FACILITATE SERVER AND STORAGE CONSOLIDATION

Oracle Database 10g: Backup and Recovery 1-2

An Oracle White Paper November Backup and Recovery with Oracle s Sun ZFS Storage Appliances and Oracle Recovery Manager

W H I T E P A P E R. Disaster Recovery Virtualization Protecting Production Systems Using VMware Virtual Infrastructure and Double-Take

: HP HP0-X02. : Designing & Implementing HP Enterprise Backup Solutions. Version : R6.1

INSIDE. Preventing Data Loss. > Disaster Recovery Types and Categories. > Disaster Recovery Site Types. > Disaster Recovery Procedure Lists

EMC Integrated Infrastructure for VMware

Real Time Replication in the Real World

MapGuide Open Source Repository Management Back up, restore, and recover your resource repository.

June Blade.org 2009 ALL RIGHTS RESERVED

Simplified HA/DR Using Storage Solutions

Westek Technology Snapshot and HA iscsi Replication Suite

Deploying Exchange Server 2007 SP1 on Windows Server 2008

FileCruiser Backup & Restoring Guide

EMC Integrated Infrastructure for VMware

StarWind iscsi SAN Software Hands- On Review

One Solution for Real-Time Data protection, Disaster Recovery & Migration

Module 7: System Component Failure Contingencies

Disaster Recovery Strategies: Business Continuity through Remote Backup Replication

Data Replication INSTALATION GUIDE. Open-E Data Storage Server (DSS ) Integrated Data Replication reduces business downtime.

Kroll Ontrack Data Recovery. Oracle Data Loss: When the best of plans fail

STORAGE CENTER WITH NAS STORAGE CENTER DATASHEET

SEIZE THE DATA SEIZE THE DATA. 2015

Server and Storage Virtualization with IP Storage. David Dale, NetApp

Every organization has critical data that it can t live without. When a disaster strikes, how long can your business survive without access to its

IM and Presence Disaster Recovery System

Designing, Optimizing and Maintaining a Database Administrative Solution for Microsoft SQL Server 2008

System Compatibility. Enhancements. Security. SonicWALL Security Appliance Release Notes

Reference Architecture. EMC Global Solutions. 42 South Street Hopkinton MA

Real-time Protection for Hyper-V

Hardware/Software Guidelines

Redefining Oracle Database Management

Achieving High Availability & Rapid Disaster Recovery in a Microsoft Exchange IP SAN April 2006

Hosting Solutions Made Simple. Managed Services - Overview and Pricing

STORAGE CENTER. The Industry s Only SAN with Automated Tiered Storage STORAGE CENTER

EMC Business Continuity for Microsoft SQL Server 2008

Universal Backup Device The Essential Facts of UBD

Managed Security Services SLA Document. Response and Resolution Times

Using HP StoreOnce D2D systems for Microsoft SQL Server backups

MySQL backups: strategy, tools, recovery scenarios. Akshay Suryawanshi Roman Vynar

About Backing Up a Cisco Unity System

Library Recovery Center

DISASTER RECOVERY. Omniture Disaster Plan. June 2, 2008 Version 2.0

Backup and Restore Strategies for SQL Server Database. Nexio Motion. 10-February Revsion: DRAFT

Nexio Motion 4 Database Backup and Restore

System Administration and Server Management Service Level Agreement (SLA)

Protecting Microsoft SQL Server

CASS COUNTY GOVERNMENT. Data Storage Project Request for Proposal

Implementing and Managing Windows Server 2008 Clustering

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

MS Design, Optimize and Maintain Database for Microsoft SQL Server 2008

Transcription:

Perforce Backup Strategy & Disaster Recovery at National Instruments Steven Lysohir 1

Why This Topic? Case study on large Perforce installation Something for smaller sites to ponder as they grow Stress the importance of planning for a disaster Entertainment 2

Topics Personal & Company Intro NI s Development Environment Perforce Architecture Backup Strategy Real Life Disasters & Lessons Learned Best Practices Questions? 3

Personal & Company Info Steven Lysohir Systems Analyst focused mainly on global Perforce support at National Instruments 3 years in this role National Instruments Produces hardware and software for the Test & Measurement industry (PXI chassis & PCI cards, LabVIEW) Global company headquartered in Austin, TX Sales & Support offices worldwide Distributed development (R&D branches) 4

Development Environment Distributed development Perforce used in 10 countries Seven servers globally Server operating systems vary from Linux AS to Windows 2000/2003 Perforce Server versions range from 2001.1 to 2004.2 Used across organization (multiple business units) 5

Global Architecture 6

Dev Environment Main Server Main Perforce server in Austin, TX Perforce Metrics for main server Number of Users 1050 Number of Versioned Files...4,800,000 Size of Depots.600 GB Size of Database 21 GB Number of Changelists..1,000,000 Number of Clients...3500 Number of Commands Daily.100,000 7

Hardware Architecture Servers Two identical servers Dell PowerEdge 2600 Dual 3.2 GHz Xeon processors 4 GB RAM Dual Gigabit network adapters 8 X 36 GB U320 15K drives 2 disk RAID 1 array 6 disk RAID 10 array First server is primary Perforce production server Second server has multiple roles Failover server Runs custom scripts Storage for checkpoint and journal files 8

Hardware Architecture Storage Network Attached Storage (NAS) device NetApp FAS 940c cluster Capable of performing point-in-time copies of data (snapshots) Connected to Perforce server through shared Gigabit over LAN 9

Architecture Overview 10

Application Architecture Journal & Log files stored on RAID 1 array Database stored on RAID 10 array Versioned files on NAS appliance Offline Database stored in offline directory on RAID 10 array 11

Architecture Benefits Identical Servers Failover with limited user impact One step closer to clustering NAS Solution Reliability Recovery Scalability Performance Offline Database Ability to perform nightly checkpoints without locking production database 12

Backup Strategy Weekly checkpoint of production database Daily checkpoint of offline database Multiple daily snapshots & journal copies Nightly tape backups 13

Backup Strategy Weekly Checkpoint 14

Backup Strategy Daily Checkpoint 15

Backup Strategy Snapshots Every 4 hours for versioned files Copy P4 journal every 4 hours Timing of these two events coincide to maintain data integrity 16

Backup Strategy Benefits Checkpoints & Journal Copy Always a copy of the DB available on disk, current to within 4 hours Switching to failover is more efficient since all data already resides on failover server Snapshots Ability to restore data directly from disk No file locking during backups Able to create backups on the fly during business hours 17

Backup Strategy Tape Backups Application & Failover server receive full, nightly backups P4 database & P4 Journal excluded Backed up to tape through offline database & journal copy Versioned Files on NAS device Receive full, nightly backups from a snapshot Eliminates file locking 18

Backup Strategy Test Restores Test restore of all versioned files and checkpoints performed every 6 months 19

Disaster 1 Untested Recovery Plan Background Moving depots from one share to another Deleting depots from old share after move complete Issue Wrong depot deleted from original share Delete occurred around 11:30 PM and error not realized until next business day Approximately 5 GB of versioned files deleted from file system (and right before a release!) 20

Disaster 1 Where Do We Stand? Current State Developers could continue to work on unaffected files Deleted files reported librarian errors in Perforce Initial Response Plan Notify users Recover as many files from the snapshot as possible Restore remaining files from tape Run p4 verify to check for data integrity 21

Disaster 1 Roadblocks Only 30% of the data could be recovered from snapshot Recovered files were randomly located No easy process to identify files to missing The full depot had to be recovered from tape Restore of this magnitude was never tested Restore continued to fail No clear communication to users on status 22

Disaster 1 Final Resolution Storage & Backup vendors contacted for support Custom work-around finally enabled the restore to complete successfully Restored full depot to restore directory on NAS device Script written to identify files that needed to be copied to production share Files copied through another script p4 verify run to test for data integrity Users notified of successful recovery 23

Disaster 1 Lessons Learned Good News All but 3 files were able to be recovered This happened on a Friday Bad News Took 3 days to perform untested recovery from tape Opportunities for Improvement Test the restore process & document the procedure Create more frequent snapshots Develop clear channels of communication Document a disaster recovery plan 24

Disaster 2 Benefit of Frequent Snapshots Background Implemented new backup hardware Performed test restore of all Perforce versioned files Issue Bug in backup software lead to restore over production data Error realized within 15 minutes of test restore Roughly 20% of production versioned files replaced with zero-length files 25

Disaster 2 Where Do We Stand? Current State Developers could continue to work on unaffected files Corrupted files reported RCS errors in Perforce Initial Response Plan Notify users Identify corrupted files Restore corrupted files from snapshot Run p4 verify to check for data integrity 26

Disaster 2 Solution Perl script written to identify files Used Perl script from last disaster to recover files from snapshot to production share p4 verify run to test for data integrity Lost/Unrecoverable revisions obliterated from Perforce Owners of these revisions notified 27

Disaster 2 Snapshots Benefits Ease of recovery Simple system copy commands Ability to quickly automate recovery Speed of recovery Issue was resolved in 8 hours (from discovery to resolution) Actual recovery took 2 hours vs 3 days recovering from tape Ability to restore 99% of corrupted files Snapshot of files taken at 12:00 PM Corruption occurred at 12:30 PM 28

Disaster 2 Lessons Learned Roadblocks No clear channel of user communication Users not notified of status Perforce admin bombarded with support calls Opportunities for improvement The Admin working on a technical solution should not have the burden of user communication Funnel all communication through IT Operations group Document a disaster recovery plan (still lacking any documentation) 29

Disaster 3 Application Server Crash Background Users start to experience extremely slow performance from Perforce application Some users cannot connect to Perforce server All file access on the Perforce server came to a virtual halt Windows failed to start on 2 nd reboot of server Issue Raid controller on Perforce server crashed 30

Disaster 3 Where Do We Stand? Current State Server and application unavailable Production database unavailable Versioned files unaffected Journal file recovered (copied before 2 nd reboot) Next Steps Notify users Rebuild database from checkpoint on failover server Replay journal into database Switch production application to failover server 31

Disaster 3 Solution Communication to users channeled through IT Operations group Copied journal file from production server to failover server (was possible before second reboot) Rebuilt database from most current checkpoint on failover server Replayed journal file into rebuilt database Swap names and IP addresses for production and failover server Started Perforce service and crossed fingers 32

Disaster 3 Benefits of Architecture Failover Server Failover with limited effort (journal copy & name/ip change) Failover with no impact to users (other than downtime) NAS Device (external storage) Eliminated the need to restore versioned files Preserved data integrity of versioned files 33

Disaster 3 Lessons Learned Journal file should be backed up (copied) more frequently Copy journal file on same schedule as snapshots Need to finally document a disaster recovery plan 34

Best Practices Backup Strategy Frequent checkpoints Frequent copies of journal file Point-in-time copies (even disk based backups) can speed up recovery times Test your ability to restore data Have some type of failover server in place that stores the most recent Perforce data Your backup/restore process is the first and most crucial step in disaster recovery 35

Best Practices Disaster Recovery Setup clear channels of communication Have a plan and have it documented Have related Perforce documentation, specific to your site Make your documentation idiot proof Test recovery scenarios Be able to verify your recovery was successful 36

Final Note Make your Perforce environment as redundant as possible. If that can be accomplished, you may never have to revert to your disaster recovery plan. 37

38 Questions?