e Shandor Simon Director, Networking Services Latin School of Chicago

Similar documents

Managing, Maintaining Data in a Virtual World

29/07/2010. Copyright 2010 Hewlett-Packard Development Company, L.P.

Veeam Backup and Replication Architecture and Deployment. Nelson Simao Systems Engineer

Daly Computers Webinar for MEEC: P4000 SAN Solutions

Table of contents. Matching server virtualization with advanced storage virtualization

DeltaV Virtualization High Availability and Disaster Recovery

(Formerly Double-Take Backup)

The future is in the management tools. Profoss 22/01/2008

VMware and VSS: Application Backup and Recovery

Top Ten Private Cloud Risks. Potential downtime and data loss causes

HRG Assessment: Stratus everrun Enterprise

NETWORK SERVICES WITH SOME CREDIT UNIONS PROCESSING 800,000 TRANSACTIONS ANNUALLY AND MOVING OVER 500 MILLION, SYSTEM UPTIME IS CRITICAL.

Top 10 downtime and data-loss risks. Quick Reference Guide for Private Cloud Disaster Recovery and High Availability Protection

Bosch Video Management System High Availability with Hyper-V

CASE STUDY: INTERNET SYSTEMS CONSORTIUM (ISC)

Using Emergency Restore to recover the vcenter Server has the following benefits as compared to the above methods:

HP P4000 SAN Solutions

BEST PRACTICES GUIDE: VMware on Nimble Storage

M6422A Implementing and Managing Windows Server 2008 Hyper-V

HP + Veeam: Fast VMware Recovery from SAN Snapshots

Availability Guide for Deploying SQL Server on VMware vsphere. August 2009

UNSTOPPABLE COMPUTING ENTERPRISE NETWORK AVAILABILITY AT AN SMB PRICE

- Brazoria County on coast the north west edge gulf, population of 330,242

Improving availability with virtualization technology

How To Fix A Fault Fault Fault Management In A Vsphere 5 Vsphe5 Vsphee5 V2.5.5 (Vmfs) Vspheron 5 (Vsphere5) (Vmf5) V

Virtualization Backup/Replication Solution Comparison:

Building Storage Service in a Private Cloud

Flash Use Cases Traditional Infrastructure vs Hyperscale

VMware Virtual Machine File System: Technical Overview and Best Practices

Granular recovery from SAN Snapshots

EMC VPLEX FAMILY. Continuous Availability and data Mobility Within and Across Data Centers

Real-time Protection for Hyper-V

DR-to-the- Cloud Best Practices

South Colonie Central School District Server Virtualization and Disaster Recovery Plan

Shared Machine Room / Service Opportunities. Bruce Campbell November, 2011

Internet Resiliency and Recovery

Disaster Recovery Solution Achieved by EXPRESSCLUSTER

Leveraging Virtualization for Disaster Recovery in Your Growing Business

Enterprise Storage Services (ESS) for Windows

In addition to their professional experience, students who attend this training should have technical knowledge in the following areas.

Protecting SQL Server in Physical And Virtual Environments

NETAPP WHITE PAPER USING A NETWORK APPLIANCE SAN WITH VMWARE INFRASTRUCTURE 3 TO FACILITATE SERVER AND STORAGE CONSOLIDATION

A Project Summary: VMware ESX Server to Facilitate: Infrastructure Management Services Server Consolidation Storage & Testing with Production Servers

Support Guide Comprehensive Hosting at Nuvolat Datacenter

Server and Storage Virtualization with IP Storage. David Dale, NetApp

DISASTER RECOVERY WITH AWS

Quorum DR Report. Top 4 Types of Disasters: 55% Hardware Failure 22% Human Error 18% Software Failure 5% Natural Disasters

Complete Storage and Data Protection Architecture for VMware vsphere

High-Availability Fault Tolerant Computing for Remote and Branch Offices HA/FT solutions for Cisco UCS E-Series servers and VMware vsphere

MS 20417B: Upgrading Your Skills to MCSA Windows Server 2012

Implementing and Managing Windows Server 2008 Hyper-V

Planning and Deploying a Disaster Recovery Solution

ovirt and Gluster hyper-converged! HA solution for maximum resource utilization

ovirt and Gluster hyper-converged! HA solution for maximum resource utilization

Online Storage Replacement Strategy/Solution

Downtime, whether planned or unplanned,

Microsoft SMB File Sharing Best Practices Guide

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

6422: Implementing and Managing Windows Server 2008 Hyper-V (3 Days)

Vess A2000 Series HA Surveillance with Milestone XProtect VMS Version 1.0

Designing, Optimizing and Maintaining a Database Administrative Solution for Microsoft SQL Server 2008

Belgacom Group Carrier & Wholesale Solutions. ICT to drive Your Business. Hosting Solutions. Datacenter Services

Disaster Recovery Cookbook Guide Using VMWARE VI3, StoreVault and Sun. (Or how to do Disaster Recovery / Site Replication for under $50,000)

WASHTENAW COUNTY FINANCE DEPARTMENT Purchasing Division

Solution Brief Availability and Recovery Options: Microsoft Exchange Solutions on VMware

What s in Installing and Configuring Windows Server 2012 (70-410):

Surround SCM Backup and Disaster Recovery Solutions

DISASTER RECOVERY. Omniture Disaster Plan. June 2, 2008 Version 2.0

BME CLEARING s Business Continuity Policy

Big company HA does not have to be complicated or expensive for SMBs

Westek Technology Snapshot and HA iscsi Replication Suite

Storage Infrastructure as a Service! Andres Rodriguez, CEO

Top 5 reasons to virtualize Exchange

How To Fix A Powerline From Disaster To Powerline

Module: Business Continuity

June Blade.org 2009 ALL RIGHTS RESERVED

CONVERGED VIRTUAL STORAGE

EMC Integrated Infrastructure for VMware

Softverski definirani data centri - 2. dio

MS Design, Optimize and Maintain Database for Microsoft SQL Server 2008

Overview of I/O Performance and RAID in an RDBMS Environment. By: Edward Whalen Performance Tuning Corporation

Deployment Topologies

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

RFP-MM Enterprise Storage Addendum 1

Version: Page 1 of 5

Appendix C to DIR Contract Number DIR-TSO-2736 SunGard Availability Services Discount Level: 25% Managed Data Center Services - Cloud Hosting

The Promise of Virtualization for Availability, High Availability, and Disaster Recovery - Myth or Reality?

McAfee Endpoint Encryption Hot Backup Implementation

HCS Redundancy and High Availability

Availability Digest. Stratus Avance Brings Availability to the Edge February 2009

Barracuda Backup Server. Introduction

Evolving Datacenter Architectures

Virtualized Disaster Recovery from VMware and Vision Solutions Cost-efficient, dependable solutions for virtualized disaster recovery and business

Trends in Application Recovery. Andreas Schwegmann, HP

Leveraging Virtualization in Data Centers

AUTOMATED DISASTER RECOVERY SOLUTION USING AZURE SITE RECOVERY FOR FILE SHARES HOSTED ON STORSIMPLE

Bosch Video Management System High availability with VMware

NETGEAR SMB Storage Line Update and ReadyNAS 2100 Introduction

Transcription:

Lessons Learned about Consolidation, Virtualization at and Disaster Recovery e Shandor Simon Director, Networking Services Latin School of Chicago

Agenda The Latin School of Chicago Why redundancy and resiliency is important to us How it was supposed to work What went wrong? How did we respond? What have we learned? Results Q&A

Latin School of Chicago Founded in 1888, The Latin School of fchicago is a private independent day school located on the Near North Side of Chicago 1,100 students in JK through 12 th Grade 3 Buildings, 300,000 sq ft 250+ Faculty/Staff 80+ servers, 2 Data Centers 700+ school-owned computers

Latin School of Chicago Technology Environment Academic Technology (projectors & interactive whiteboards in every classroom, class websites, blogs, wikis, etc.) Administrative Technology (accounting, payroll, development, attendance, grades, etc.) Operational Technology (building controls, access control, surveillance, etc.) Technology Infrastructure (DNS, DHCP, Active Directory, etc.)

The Challenge Our school has become dependent on our technology infrastructure functioning. In our new building, the network must work for the classroom doors to unlock. We have invested in redundant d server, storage and network virtualization, multiple server rooms. Yet we still had a failure. What happened, and more importantly what can we learn?

Virtual Server Infrastructure VMware Virtual Infrastructure 3.5 Cisco VSS Lefthand Networks SAN/iQ

How VM failure is supposed to work Virtual Machine Host Virtual Machine Host Virtual Machine Host

How our iscsi SAN is supposed to work Our SAN is a software based iscsi solution that runs on commodity hardware. We have four servers, each with 6TB of RAID 5 storage. The SAN writes the same data to at least two different servers, and spreads the data across all four servers. The SAN is configured to write one copy of the data to an even server, the other to an odd server. 1 2 3 4

How our iscsi SAN is supposed to work By placing all the even servers in one building, 1 and the odd servers in another, we will not lose data with the loss of one location. 3 If the link between the sites fail, it is important t to allow changes to happen only on one side of the SAN. Our vendor makes use of a quorum where one side has the most votes. We have a special 2 tie-breaking server. 4

Network Redundancy Our core virtual switch / router physically spans two locations. Two 10 Gb links connect the two physical boxes. The VSS switch appears as a single logical l switch to the rest of the network. All of our access switches are wired back to both locations. Either physical box can fail without t impacting the core network functions. We have redundant Internet connections to We have redundant Internet connections to both locations, and redundant firewalls

Facilities Utility Power Generator UPS Power Distribution Cooling

Facilities Middle School Server Room Upper School Server Room

Putting it together US Virtual Machine Host Virtual Machine Host Virtual Machine Host Virtual Machine Host Virtual Machine Host MS US Utility Power Cooling Generator UPS Utility Power Power Distribution Cooling Generator UPS

What we expected We could survive failure of either data center. Servers in a failed location would restart in the other location. That the overall reliability of the system was sufficiently high. That our setup could handle failure of an entire data center location, or of a component on one side.

Expectations US Virtual Machine Host Virtual Machine Host Virtual Machine Host Virtual Machine Host Virtual Machine Host MS US Utility Power Cooling Generator UPS Utility Power Power Distribution Cooling Generator UPS

Actual Failure FAIL FAIL Virtual Machine Host Virtual Machine Host Virtual Machine Host Virtual Machine Host Virtual Machine Host FAIL Utility Power Cooling Generator UPS Utility Power Generator FAIL Power Distribution Cooling FAIL UPS

What went wrong? We didn t properly account for load on our PDUs under full power, nor account for a cascading failure caused by servers with multiple power supplies shifting load. Since our tie-breaking server was improperly placed, our SAN did not achieve a quorum at the site that survived, and went read-only. Functioning i virtual servers crashed at the good site when their disks went read-only, and failed servers were unable to be restarted. A ti Di t D N S i d DHCP ll t d Active Directory, Doman Name Service, and DHCP all went down. AD and DNS should have survived, but the backup server that hosted the VMs that did not use the SAN was down for maintenance.

Our response Switched to DNS/DHCP appliances that can failover. Moved database and mail servers off of SAN (performance win), and replicate the data between data centers twice a day. Added additional PDUs and spread the load across them. Enhanced monitoring of data center facilities to a service that notifies via phone and uses a call list. Moved SAN tie-breaking server outside of either data center. During scheduled downtime, we test that our failover works the way we expect it to.

Recovery Time vs. Recovery Point Recovery Time: how long it takes to recover from a failure. Recovery Point: how much data is lost. In our initial design we focused on preventing failure and data loss. We did not plan on recovery time, and it showed (took way too long to get back on-line). Our users would prefer to lose a few hours of mail than a few hours of work. We now assume everything will fail, and run recovery exercises to see how long it takes us to recover.

Results Our air conditioner failed again, but this time did not cause additional failures. With better notification, we remotely forced our infrastructure to failover to the other site, and shut down our equipment, dealing with the problem in the morning. In another case a PDU failed, causing half of our SAN to go down. The SAN continued to function out of the other site. Recently we were able to recover from a server failure by reverting to a replicated copy, causing only a 15-minute outage and minimal data loss.

Avoiding our mistakes Make sure you understand the limitations of your facilities infrastructure (cooling, power, etc.) Test how your infrastructure reacts to failure, considering how different systems interact. Understand that recovery time might be as or more important then recovery point. Watch out for redundant parts of your infrastructure having the same dependences (ex: our redundant servers all using the same SAN).

Shandor Simon Director, Networking Services Latin School of Chicago