Lessons learned from parallel file system operation



Similar documents
New Storage System Solutions

Lustre tools for ldiskfs investigation and lightweight I/O statistics

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

PADS GPFS Filesystem: Crash Root Cause Analysis. Computation Institute

SAN Conceptual and Design Basics

The safer, easier way to help you pass any IT exams. Exam : Storage Sales V2. Title : Version : Demo 1 / 5

pnfs State of the Union FAST-11 BoF Sorin Faibish- EMC, Peter Honeyman - CITI

Cluster Implementation and Management; Scheduling

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

Cray DVS: Data Virtualization Service

Current Status of FEFS for the K computer

(Scale Out NAS System)

The Simple High Available Linux File Server. Schlomo Schapiro Principal Consultant Leitung Virtualisierung & Open Source

Panasas at the RCF. Fall 2005 Robert Petkus RHIC/USATLAS Computing Facility Brookhaven National Laboratory. Robert Petkus Panasas at the RCF

IBM System x GPFS Storage Server

RFP-MM Enterprise Storage Addendum 1

High Performance Computing OpenStack Options. September 22, 2015

Highly-Available Distributed Storage. UF HPC Center Research Computing University of Florida

The Panasas Parallel Storage Cluster. Acknowledgement: Some of the material presented is under copyright by Panasas Inc.

EMC Invista: The Easy to Use Storage Manager

Ultimate Guide to Oracle Storage

High Availability and Backup Strategies for the Lustre MDS Server

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Architecting a High Performance Storage System

June Blade.org 2009 ALL RIGHTS RESERVED

Distributed File System Choices: Red Hat Storage, GFS2 & pnfs

Using EonStor FC-host Storage Systems in VMware Infrastructure 3 and vsphere 4

Addendum No. 1 to Packet No Enterprise Data Storage Solution and Strategy for the Ingham County MIS Department

Education. Servicing the IBM ServeRAID-8k Serial- Attached SCSI Controller. Study guide. XW5078 Release 1.00

Best Practices for Data Sharing in a Grid Distributed SAS Environment. Updated July 2010

NetApp High-Performance Computing Solution for Lustre: Solution Guide

Virtualization, Business Continuation Plan & Disaster Recovery for EMS -By Ramanj Pamidi San Diego Gas & Electric

InfiniBand in the Enterprise Data Center

Sun Storage Perspective & Lustre Architecture. Dr. Peter Braam VP Sun Microsystems

SAN TECHNICAL - DETAILS/ SPECIFICATIONS

Storage Design for High Capacity and Long Term Storage. DLF Spring Forum, Raleigh, NC May 6, Balancing Cost, Complexity, and Fault Tolerance

Lustre SMB Gateway. Integrating Lustre with Windows

Preview of a Novel Architecture for Large Scale Storage

LUSTRE FILE SYSTEM High-Performance Storage Architecture and Scalable Cluster File System White Paper December Abstract

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

OVERVIEW. CEP Cluster Server is Ideal For: First-time users who want to make applications highly available

Linux Powered Storage:

NOTICE ADDENDUM NO. TWO (2) JULY 8, 2011 CITY OF RIVIERA BEACH BID NO SERVER VIRTULIZATION/SAN PROJECT

Maurice Askinazi Ofer Rind Tony Wong. Cornell Nov. 2, 2010 Storage at BNL

Enterprise Storage Solution for Hyper-V Private Cloud and VDI Deployments using Sanbolic s Melio Cloud Software Suite April 2011

Red Hat Cluster Suite

The skinny on storage clusters

Overview of I/O Performance and RAID in an RDBMS Environment. By: Edward Whalen Performance Tuning Corporation

How to Choose your Red Hat Enterprise Linux Filesystem

Oracle Maximum Availability Architecture with Exadata Database Machine. Morana Kobal Butković Principal Sales Consultant Oracle Hrvatska

STORAGE CENTER. The Industry s Only SAN with Automated Tiered Storage STORAGE CENTER

SOP Common service PC File Server

Storage Architectures for Big Data in the Cloud

High Availability Solutions for the MariaDB and MySQL Database

February, 2015 Bill Loewe

29/07/2010. Copyright 2010 Hewlett-Packard Development Company, L.P.

Simple virtualization IBM SAN Volume Controller (SVC)

Large Scale Storage. Orlando Richards, Information Services LCFG Users Day, University of Edinburgh 18 th January 2013

Introduction to Gluster. Versions 3.0.x

EMC Backup and Recovery for Microsoft SQL Server 2008 Enabled by EMC Celerra Unified Storage

10th TF-Storage Meeting

HPC Storage Solutions at transtec. Parallel NFS with Panasas ActiveStor

Preparation Guide. How to prepare your environment for an OnApp Cloud v3.0 (beta) deployment.

KIT Site Report. Andreas Petzold. STEINBUCH CENTRE FOR COMPUTING - SCC

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Big data management with IBM General Parallel File System

Capacity planning for IBM Power Systems using LPAR2RRD.

EMC DATA DOMAIN OPERATING SYSTEM

Violin Memory Arrays With IBM System Storage SAN Volume Control

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

Globus and the Centralized Research Data Infrastructure at CU Boulder

INCREASING EFFICIENCY WITH EASY AND COMPREHENSIVE STORAGE MANAGEMENT

Large File System Backup NERSC Global File System Experience

BlueArc unified network storage systems 7th TF-Storage Meeting. Scale Bigger, Store Smarter, Accelerate Everything

IP Storage On-The-Road Seminar Series

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Distributed RAID Architectures for Cluster I/O Computing. Kai Hwang

How To Store Data On A Server Or Hard Drive (For A Cloud)

IBM XIV Gen3 Storage System Storage built for VMware vsphere infrastructures

Long term retention and archiving the challenges and the solution

Xyratex Update. Michael K. Connolly. Partner and Alliances Development

The BIG Data Era has. your storage! Bratislava, Slovakia, 21st March 2013

A Survey of Shared File Systems

HP P4000 SAN Solutions

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

ClearPath Storage Update Data Domain on ClearPath MCP

Comprehending the Tradeoffs between Deploying Oracle Database on RAID 5 and RAID 10 Storage Configurations. Database Solutions Engineering

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Object storage in Cloud Computing and Embedded Processing

SMB Direct for SQL Server and Private Cloud

Scalable filesystems boosting Linux storage solutions

Lustre & Cluster. - monitoring the whole thing Erich Focht

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Transcription:

Lessons learned from parallel file system operation Roland Laifer STEINBUCH CENTRE FOR COMPUTING - SCC KIT University of the State of Baden-Württemberg and National Laboratory of the Helmholtz Association www.kit.edu

Overview General introduction to parallel file systems Lustre, GPFS and pnfs compared Basic Lustre concepts Lustre systems at KIT Experiences with Lustre with underlying storage Options for sharing data by coupling InfiniBand fabrics by using Grid protocols 2 05.10.2011 Roland Laifer Lessons learned from parallel file system operation

Parallel file system vs. distributed file system What is a distributed file system? File system data is usable at the same time from different clients With multiple servers applications see separate file systems Examples: NFS, CIFS What is a parallel file system (PFS)? Distributed file system with parallel data paths from clients to disks Even with multiple servers applications typically see one file system Examples: Lustre, GPFS 3 05.10.2011 Roland Laifer Lessons learned from parallel file system operation

When and why is a PFS required? Main PFS advantages Throughput performance Scalability: Usable by 1000s of clients Lower management costs for huge capacity Main PFS disadvantages Metadata performance low compared to many separate file servers Complexity: Management requires skilled administrators Most PFS require adaption of clients for new Linux kernel versions Which solution is better? This depends on the applications and on the system environment Price also depends on the quality and is hard to compare e.g. huge price differences of NFS products If PFS is not required, distributed file system is much easier 4 05.10.2011 Roland Laifer Lessons learned from parallel file system operation

PFS products (1): Lustre Status Huge user base: 70% of Top100 recently used Lustre Lustre products available from many vendors DDN, Cray, Xyratex, Bull, SGI, NEC, Dell Most developers left Oracle and now work at Whamcloud OpenSFS is mainly driving Lustre development Pros and Cons + Nowadays runs very stable + Open source, open bugzilla + Scalable up to 10000s of clients + High throughput with multiple network protocols and LNET routers - Client limitations: - Only supports Linux, NFS/CIFS gateways possible - Not in the kernel, i.e. adaptions required to be usable with new kernels - Limited in its features, e.g. no data replication or snapshots 5 05.10.2011 Roland Laifer Lessons learned from parallel file system operation

PFS products (2): IBM GPFS Status Large user base, also widely-used in industry Underlying software for other products e.g. IBM Scale Out File Services Pros and Cons + Runs very stable + Offers many useful features Snapshots, data and metadata replication, online disk removal Integrated Lifecycle Management (ILM), e.g. allows easy storage renewal + Scalable up to 1000s of clients + Natively supported on AIX, Linux and Windows Server 2008 - Client limitations: - Not in the kernel, i.e. adaptions required to be usable with new kernels - Vendor lock-in - IBM is known to frequently change their license policy 6 05.10.2011 Roland Laifer Lessons learned from parallel file system operation

PFS products (3): Parallel NFS (pnfs) Status Standard is defined NFS version 4.1, see RFCs 5661, 5663, 5664 3 different implementations: files, blocks, objects Servers largely ready from different vendors NetApp, EMC, Panasas, BlueArc Client for files implementation in Linux kernel 3.0 and RHEL 6.1 Windows client developed by University of Michigan (CITI) Pros and Cons + Standard and open source + Will be part of the Linux kernel + Server solutions from multiple vendors + Metadata and file data separation allows increased performance + Fast migrating or cloning of virtual machine disk files with Vmware ESX - Stability: - Linux client still not completely ready - Complicated product, e.g. because of 3 different implementations - Lots of bugs expected when first production sites start 7 05.10.2011 Roland Laifer Lessons learned from parallel file system operation

Basic Lustre concepts Client Client Client Directory operations, file open/close metadata & concurrency File I/O & file locking Metadata Server Recovery, file status, file creation Object Storage Server Lustre componets: Clients (C) offer standard file system API Metadata servers (MDS) hold metadata, e.g. directory data Object Storage Servers (OSS) hold file contents and store them on Object Storage Targets (OSTs) All communicate efficiently over interconnects, e.g. with RDMA 8 05.10.2011 Roland Laifer Lessons learned from parallel file system operation

How does Lustre striping work? file on $HOME file on $PFSWORK stripe size (default 1 MB) file on $WORK...... C C C C C C C C C C C C InfiniBand interconnect MDS MDS OSS OSS 2x OSS OSS 8x MDS MDS OSS OSS 2x OST OST stripe count 1 (default) stripe count 2 (default) Parallel data paths from clients to storage stripe count 4 (default) 9 05.10.2011 Roland Laifer Lessons learned from parallel file system operation

Lustre file systems at XC1 XC1 cluster C C C C C C... (120x) C C C C C C Quadrics QSNet II Interconnect Admin MDS OSS OSS OSS OSS OSS OSS EVA EVA EVA EVA EVA EVA EVA File system /lustre/data /lustre/work HP SFS appliance with HP EVA5000 storage Production from Jan 2005 to March 2010 10 05.10.2011 Roland Laifer Lessons learned from parallel file system operation

Lustre file systems at XC2 XC2 cluster C C C C C C C C C C C C... (762x) InfiniBand 4X DDR Interconnect MDS MDS OSS OSS OSS OSS OSS OSS OSS OSS File system /lustre/data /lustre/work HP SFS appliance (initially) with HP storage Production since Jan 2007 11 05.10.2011 Roland Laifer Lessons learned from parallel file system operation

Lustre file systems at HC3 and IC1 IC1 cluster HC3 cluster C C... (213x) C C C C C... (366x) C C C InfiniBand DDR InfiniBand QDR InfiniBand DDR MDS MDS OSS OSS 2x OSS OSS 8x MDS MDS OSS OSS 2x OST OST OST OST File system /pfs/data /pfs/work /hc3/work Production on IC1 since June 2008 and on HC3 since Feb 2010 pfs is transtec/q-leap solution with transtec provigo (Infortrend) storage hc3work is DDN (HP OEM) solution with DDN S2A9900 storage 12 05.10.2011 Roland Laifer Lessons learned from parallel file system operation

bwgrid storage system (bwfs) concept Lustre file systems at 7 sites in state of Baden Württemberg Grid middleware for user access and data exchange Production since Feb 2009 HP SFS G3 with MSA2000 storage Grid BelWü services Backup and archive Gateway systems University 1 University 2 University 7 13 05.10.2011 Roland Laifer Lessons learned from parallel file system operation

Summary of current Lustre systems System name hc3work xc2 pfs bwfs Users Lustre version # of clients # of servers # of file systems KIT DDN Lustre 1.6.7.2 366 6 1 universities, industry HP SFS G3.2-3 762 10 2 departments, multiple clusters Transtec/Q-Leap Lustre 1.6.7.2 583 22 2 universities, grid communities HP SFS G3.2-[1-3] >1400 # of OSTs 28 8 + 24 12 + 48 7*8 + 16 + 48 Capacity (TB) 203 16 + 48 76 + 301 36 9 4*32 + 3*64 + 128 + 256 Throughput (GB/s) 4.5 0.7 + 2.1 1.8 + 6.0 8*1.5 + 3.5 Storage hardware DDN S2A9900 HP transtec provigo HP MSA2000 # of enclosures 5 36 62 138 # of disks 290 432 992 1656 14 05.10.2011 Roland Laifer Lessons learned from parallel file system operation

General Lustre experiences (1) Using Lustre as home directories works Problems with users creating 10000s of files per small job Convinced them to use local disks (we have at least one per node) Problems with unexperienced users using home for scratch data Also puts high load on backup system Enabling quotas helps to quickly identify bad users Enforcing quotas for inodes and capacity is planned Restore of home directories would last for weeks Idea is to restore important user groups first Luckily until 2 weeks ago complete restore was never required 15 05.10.2011 Roland Laifer Lessons learned from parallel file system operation

General Lustre experiences (2) Monitoring performance is important Check performance of each OST during maintenance We use dd and parallel_dd (own perl script) Check which users are heavily stressing the system We use collectl and and script attached to bugzilla 22469 Then discuss more efficient system usage, e.g. striping parameters Nowadays Lustre is running very stable After months MDS might stall Usually server is shot by heartbeat and failover works Most problems are related to storage subsystems 16 05.10.2011 Roland Laifer Lessons learned from parallel file system operation

Complexity of parallel file system solutions (1) Complexity of underlying storage Lots of hardware components Cables, adapters, memory, caches, controllers, batteries, switches, disks All can break Firmware or drivers might fail Extreme performance causes problems not seen elsewhere Disks fail frequently Timing issues cause failures C C C C C C... MGS MDS OSS OSS 4X DDR InfiniBand 4 Gbit FC 3 Gbit SAS 17 05.10.2011 Roland Laifer Lessons learned from parallel file system operation

Complexity of parallel file system solutions (2) Complexity of parallel file system (PFS) software Complex operating system interface Complex communication layer Distributed system: components on different systems involved Recovery after failures is complicated Not easy to find out which one is causing trouble Scalability: 1000s of clients use it concurrently Performance: low level implementation required Higher level solutions loose performance Expect bugs in any PFS software Vendor tests at scale are very important Lots of similar installations are benefical 18 05.10.2011 Roland Laifer Lessons learned from parallel file system operation

Experiences with storage hardware HP arrays hang after disk failure under high load Happened at different sites for years System stalls, i.e. no file system check required Data corruption at bwgrid sites with HP MSA2000 Firmware of FC switch and of MSA2000 most likely reason Largely fixed by HP action plan with firmware / software upgrades Data corruption with transtec provigo 610 RAID systems File system stress test on XFS causes RAID system to hang Problem is still under investigation by Infortrend SCSI errors and OST failures with DDN S2A9900 Caused by single disk with media errors Happened twice, new firmware provides better bad disk removal Expect severe problems with midrange storage systems 19 05.10.2011 Roland Laifer Lessons learned from parallel file system operation

Interesting OSS storage option OSS configuration details Linux software RAID6 over RAID systems RAID systems have hardware RAID6 over disks RAID systems have one partition for each OSS No single point of failure Survives 2 broken RAID systems Survives 8 broken disks Good solution with single RAID controllers Mirrored write cache of dual controllers often is bottleneck 20 05.10.2011 Roland Laifer Lessons learned from parallel file system operation

Future requirements for PFS / Lustre Need better storage subsystems Fight against silent data corruption It really happens Finding responsible component is a challenge Checksums quickly show data corruption Provide increased probability to avoid huge data corruptions Storage subsystems should also check data integrity E.g. by checking the RAID parity during read operations T10 DIF and T10 DIX might help for future systems Support efficient backup and restore Need point in time copies of the data at different location Fast data paths for backup and restore required Checkpoints and differential backups might help 21 05.10.2011 Roland Laifer Lessons learned from parallel file system operation

Sharing data (1): Extended InfiniBand fabric Examples: IC1 and HC3 bwgrid clusters in Heidelberg and Mannheim (28 km distance) InfiniBand coupled with Obsidian Longbow over DWDM Requirements: Select appropriate InfiniBand routing mechanism and cabling Host based subnet managers might be required Advantages: Same file system visible and usable on multiple clusters Normal Lustre setup without LNET routers Low performance impact Disadvantages: InfiniBand possibly less stable More clients possibly cause additional problems 22 05.10.2011 Roland Laifer Lessons learned from parallel file system operation

Sharing data (2): Grid protocols Example: bwgrid gridftp and rsync over gsissh to copy data between clusters Requirements: Grid middleware installation Advantages: Clusters usable during external network or file system problems Metadata performance not shared between clusters User ID unification not required No full data access for remote root users Disadvantages: Users have to synchronize multiple copies of data Some users do not cope with Grid certificates 23 05.10.2011 Roland Laifer Lessons learned from parallel file system operation

Further information SCC talks Lustre administration, performance monitoring, best practices: http://www.scc.kit.edu/produkte/lustre.php Parallel file systems Lustre http://www.lustre.org/ http://www.whamcloud.com/ http://www.opensfs.org/ IBM GPFS pnfs http://www.ibm.com/systems/software/gpfs/ http://www.pnfs.com/ 24 05.10.2011 Roland Laifer Lessons learned from parallel file system operation