Highly-Available Distributed Storage. UF HPC Center Research Computing University of Florida



Similar documents
New Storage System Solutions

Lustre failover experience

Optimizing Large Arrays with StoneFly Storage Concentrators

Introduction to Gluster. Versions 3.0.x

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

High-Availability and Scalable Cluster-in-a-Box HPC Storage Solution

Block based, file-based, combination. Component based, solution based

Cluster Implementation and Management; Scheduling

NetApp High-Performance Computing Solution for Lustre: Solution Guide

Architecting a High Performance Storage System

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

Performance Analysis of RAIDs in Storage Area Network

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

Lessons learned from parallel file system operation

EMC SYMMETRIX VMAX 20K STORAGE SYSTEM

Implementing Storage Concentrator FailOver Clusters

Chapter 10: Mass-Storage Systems

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

Current Status of FEFS for the K computer

Building a Scalable Storage with InfiniBand

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Traditionally, a typical SAN topology uses fibre channel switch wiring while a typical NAS topology uses TCP/IP protocol over common networking

SAN Conceptual and Design Basics

Lustre SMB Gateway. Integrating Lustre with Windows

The safer, easier way to help you pass any IT exams. Exam : Storage Sales V2. Title : Version : Demo 1 / 5

Large Scale Storage. Orlando Richards, Information Services LCFG Users Day, University of Edinburgh 18 th January 2013

Multiple Public IPs (virtual service IPs) are supported either to cover multiple network segments or to increase network performance.

Lustre Networking BY PETER J. BRAAM

PADS GPFS Filesystem: Crash Root Cause Analysis. Computation Institute

LUSTRE FILE SYSTEM High-Performance Storage Architecture and Scalable Cluster File System White Paper December Abstract

Using EonStor FC-host Storage Systems in VMware Infrastructure 3 and vsphere 4

IBM System x GPFS Storage Server

Cisco Active Network Abstraction Gateway High Availability Solution

High Throughput File Servers with SMB Direct, Using the 3 Flavors of RDMA network adapters

SMB Direct for SQL Server and Private Cloud

Application Performance for High Performance Computing Environments

Expert Reference Series of White Papers. Planning for the Redeployment of Technical Personnel in the Modern Data Center

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

HITACHI VIRTUAL STORAGE PLATFORM FAMILY MATRIX

The functionality and advantages of a high-availability file server system

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

HPC Advisory Council

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Solution Brief July All-Flash Server-Side Storage for Oracle Real Application Clusters (RAC) on Oracle Linux

Volume Replication INSTALATION GUIDE. Open-E Data Storage Server (DSS )

HP iscsi storage for small and midsize businesses

<Insert Picture Here> Refreshing Your Data Protection Environment with Next-Generation Architectures

Implementing a Digital Video Archive Using XenData Software and a Spectra Logic Archive

Feature Comparison. Windows Server 2008 R2 Hyper-V and Windows Server 2012 Hyper-V

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

How To Write An Article On An Hp Appsystem For Spera Hana

Chapter 12: Mass-Storage Systems

SFA10K-X & SFA10K-E. ddn.com. Storage Fusion Architecture TM. DDN Whitepaper

Implementing an Automated Digital Video Archive Based on the Video Edition of XenData Software

Building a Flash Fabric

High Performance Computing OpenStack Options. September 22, 2015

How To Create A Multi Disk Raid

IBM BladeCenter H with Cisco VFrame Software A Comparison with HP Virtual Connect

Lustre & Cluster. - monitoring the whole thing Erich Focht

Solutions Brief. Unified All Flash Storage. FlashPoint Partner Program. Featuring. Sustainable Storage. S-Class Flash Storage Systems

SUN HARDWARE FROM ORACLE: PRICING FOR EDUCATION

Virtual SAN Design and Deployment Guide

Zadara Storage Cloud A

Contingency Planning and Disaster Recovery

Using Multipathing Technology to Achieve a High Availability Solution

Connecting the Clouds

PCI Express and Storage. Ron Emerick, Sun Microsystems

Open-E Data Storage Software and Intel Modular Server a certified virtualization solution

Hyper-V over SMB: Remote File Storage Support in Windows Server 2012 Hyper-V. Jose Barreto Principal Program Manager Microsoft Corporation

OVERVIEW. CEP Cluster Server is Ideal For: First-time users who want to make applications highly available

Oracle Exadata: The World s Fastest Database Machine Exadata Database Machine Architecture

High Availability with Windows Server 2012 Release Candidate

THE SUN STORAGE AND ARCHIVE SOLUTION FOR HPC

Moving Virtual Storage to the Cloud

Sonexion GridRAID Characteristics

How to Choose your Red Hat Enterprise Linux Filesystem

Fine-grained File System Monitoring with Lustre Jobstat

WHITEPAPER: Understanding Pillar Axiom Data Protection Options

HITACHI VIRTUAL STORAGE PLATFORM FAMILY MATRIX

Private cloud computing advances

Storage Technologies for Video Surveillance

Implementing a Digital Video Archive Based on XenData Software

MESOS CB220. Cluster-in-a-Box. Network Storage Appliance. A Simple and Smart Way to Converged Storage with QCT MESOS CB220

Moving Virtual Storage to the Cloud. Guidelines for Hosters Who Want to Enhance Their Cloud Offerings with Cloud Storage

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

An Oracle White Paper September Oracle Exadata Database Machine - Backup & Recovery Sizing: Tape Backups

Hyper-V over SMB Remote File Storage support in Windows Server 8 Hyper-V. Jose Barreto Principal Program Manager Microsoft Corporation

Using High Availability Technologies Lesson 12

Best Practices Guide: Network Convergence with Emulex LP21000 CNA & VMware ESX Server

SUN HARDWARE FROM ORACLE: PRICING FOR EDUCATION

EMC SYMMETRIX VMAX 10K

Implementing Enterprise Disk Arrays Using Open Source Software. Marc Smith Mott Community College - Flint, MI Merit Member Conference 2012

broadberry.co.uk/storage-servers

Transcription:

Highly-Available Distributed Storage UF HPC Center Research Computing University of Florida

Storage is Boring Slow, troublesome, albatross around the neck of high-performance computing UF Research Computing 2

HA Storage Previous Implementation Scalable (Lustre-based) High-Performance Affordable (very cost effective) Relatively Reliable Still want to improve Provide a better experience for users UF Research Computing 3

HA Storage Lustre MDS MDS 1 MDS 2 Lustre OSS OSS 1 Commodity Storage Servers OSS 2 Shared Storage InfiniBand OSS 3 Lustre Clients Ethernet OSS 4 OSS 5 Enterprise Storage Arrays SAN Fabrics Failover Pair OSS 6 UF Research Computing 4

HA Storage Traditional Approach Multiple Servers External Storage Chassis Dual-Active RAID Controllers Dual-Attached Fibre Channel or SAS Dual-Homed Drives HA Cluster Software Costs Quickly Mount ($2800 - $3500 TB) UF Research Computing 5

HA Storage Our Approach Integrated Server + Storage Chassis Lower Cost Higher Density Internal PCIe RAID Controllers Lower Cost As good or better performance Commodity SATA Disk Drives (Enterprise) HA Cluster Software UF Research Computing 6

HA Storage Problem All storage is internal to each chassis No way for one server to take over the storage of the other server in the event of a server failure Without dual-ported storage and RAID controllers how can one server take over the other s storage? Solution InfiniBand SCSI Remote/RDMA Protocol (SRP) UF Research Computing 7

HA Storage InfiniBand Low-latency, high-bandwidth interconnect Used natively for distributed memory applications (MPI) Encapsulation layer for other protocols (IP, SCSI, FC, etc.) SCSI Remote Protocol (SRP) Think of it as SCSI over IB Provides a host with block-level access to storage devices in another host. Via SRP host A can see host B s drives and vice-versa UF Research Computing 8

HA Storage Host A can see host B s storage and host B can see host A s storage but there s a catch If host A fails completely, host B still won t be able to access host A s storage since host A will be down and all the storage is internal. So SRP/IB doesn t solve the whole problem. But what if host B had a local copy of Host A s storage and vice-versa (pictures coming stay tuned). Think of a RAID-1 mirror, where the mirrored volume is comprised of one local drive and one remote (via SRP) drive UF Research Computing 9

HA Storage Mirrored (RAID-1) Volumes Two (or more) drives Data is kept consistent across both/all drives Writes are duplicated to each disk Reads can take place from either/all disk(s) UF Research Computing 10

Remote Mirrors Not Possible? UF Research Computing 11

Remote Mirrors Remote targets exposed via SRP UF Research Computing 12

Remote Mirrors Mirroring Possibilities UF Research Computing 13

Remote Mirrors Normal Operating Conditions UF Research Computing 14

Remote Mirrors Host A is down UF Research Computing 15

Remote Mirrors Degraded mirrors on host B UF Research Computing 16

UF Research Computing 17

UF Research Computing 18

HA Software High-Availability Software (Open Source) Corosync Pacemaker Corosync Membership Messaging Pacemaker Resource monitoring and management framework Extensible via Resource agent templates Policy Engine UF Research Computing 19

HA Software Pacemaker Resources Highly-available services IP Addresses, disk volumes, http servers, DNS, File systems, etc. Pacemaker Resource Agents Typically BASH shell scripts Conform to certain conventions (API) Know how to stop, start, monitor, and validate a particular resource within the Pacemaker framework UF Research Computing 20

HA Resources What are our HA resources? Mirrored disk volumes Lustre File System instances Disk Volume + File System = Lustre OST/MDT OST = Object Storage Target (many) MDT = MetaData Target (one) Host A & Host B: Failover Pair Mirrored volumes can be assembled on either host File system mounted where ever mirror is assembled Hence, a specific OST or MDT can reside on either server. UF Research Computing 21

HA Resources In Practice MDT Active-Standby Servers Only one MDT (currently supported) OSTs Active-Active 4 OSTs per Server (normally) 8 OSTs per server (degraded) UF Research Computing 22

HA Storage Split Brain Syndrome UF Research Computing 23

HA Storage Not just Highly Available Also high-performance 4 PCI-E RAID Controllers per Server 2 RAID-6 (4+2) Logical Disk per Controller 8 Logical Disks per Server (4 local, 4 remote) 490 MB/sec per Logical Disk 650 MB/sec per Controller (parity limited) Three IB Interfaces per Server IB Clients (QDR, Dedicated) IPoIB Clients (SDR, Dedicated) SRP Mirror Traffic (QDR, Dedicated) UF Research Computing 24

HA Storage High-Performance ( continued) Per Server Throughput 1.1 GB/sec per server (writes as seen by clients) 1.7 GB/sec per server (reads as seen by clients) Actual server throughput is 2x for writing (mirrors!) That s 2.2 GB/s per Server 85% of the 2.6 GB/s for the raw storage UF Research Computing 25

HA Storage Keeping your data safe Mirrors enable failover Provide a second copy of the data Each Mirror Hardware RAID RAID-6 (4+2), two copies of parity data Servers protected by UPS Orderly shutdown of servers in the event of a sudden power outage. 3+1 Redundant power supplies each to a different UPS. UF Research Computing 26

HA Storage Thank you! Questions? UF Research Computing 27