The Parallel File System HP SFS/Lustre on xc2

Similar documents
Lessons learned from parallel file system operation

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

New Storage System Solutions

Current Status of FEFS for the K computer

Panasas at the RCF. Fall 2005 Robert Petkus RHIC/USATLAS Computing Facility Brookhaven National Laboratory. Robert Petkus Panasas at the RCF

VTrak SATA RAID Storage System

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

Lustre tools for ldiskfs investigation and lightweight I/O statistics

PADS GPFS Filesystem: Crash Root Cause Analysis. Computation Institute


Recommended hardware system configurations for ANSYS users

Cluster Implementation and Management; Scheduling

HP Proliant BL460c G7

HP reference configuration for entry-level SAS Grid Manager solutions

Building Clusters for Gromacs and other HPC applications

Clusters: Mainstream Technology for CAE

Highly-Available Distributed Storage. UF HPC Center Research Computing University of Florida

NetApp High-Performance Computing Solution for Lustre: Solution Guide

Deployment Guide. How to prepare your environment for an OnApp Cloud deployment.

Investigation of storage options for scientific computing on Grid and Cloud facilities

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

NERSC File Systems and How to Use Them

Lustre SMB Gateway. Integrating Lustre with Windows

Architecting a High Performance Storage System

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Introduction to Gluster. Versions 3.0.x

PARALLELS CLOUD STORAGE

RAID Performance Analysis

Analisi di un servizio SRM: StoRM

The Panasas Parallel Storage Cluster. Acknowledgement: Some of the material presented is under copyright by Panasas Inc.

SRNWP Workshop. HP Solutions and Activities in Climate & Weather Research. Michael Riedmann European Performance Center

February, 2015 Bill Loewe

Lustre failover experience

(Scale Out NAS System)

HP recommended configuration for Microsoft Exchange Server 2010: HP LeftHand P4000 SAN

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

LUSTRE FILE SYSTEM High-Performance Storage Architecture and Scalable Cluster File System White Paper December Abstract

SQL Server Database Administrator s Guide

EMC ISILON AND ELEMENTAL SERVER

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

AFS Usage and Backups using TiBS at Fermilab. Presented by Kevin Hill

Data Movement and Storage. Drew Dolgert and previous contributors

How to Choose your Red Hat Enterprise Linux Filesystem

Post-production Video Editing Solution Guide with Quantum StorNext File System AssuredSAN 4000

SOP Common service PC File Server

SAP database backup and restore solutions for HP StorageWorks Enterprise Virtual Array using HP Data Protector 6.1 software

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Implementing Enterprise Disk Arrays Using Open Source Software. Marc Smith Mott Community College - Flint, MI Merit Member Conference 2012

Module 10: Maintaining Active Directory

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

高 通 量 科 学 计 算 集 群 及 Lustre 文 件 系 统. High Throughput Scientific Computing Clusters And Lustre Filesystem In Tsinghua University

Implementing an Automated Digital Video Archive Based on the Video Edition of XenData Software

Application Performance for High Performance Computing Environments

Configuring Celerra for Security Information Management with Network Intelligence s envision

Distributed Data Storage Based on Web Access and IBP Infrastructure. Faculty of Informatics Masaryk University Brno, The Czech Republic

Business white paper. HP Process Automation. Version 7.0. Server performance

IBM System x GPFS Storage Server

Improving Lustre OST Performance with ClusterStor GridRAID. John Fragalla Principal Architect High Performance Computing

Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Violin: A Framework for Extensible Block-level Storage

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

Education. Servicing the IBM ServeRAID-8k Serial- Attached SCSI Controller. Study guide. XW5078 Release 1.00

POWER ALL GLOBAL FILE SYSTEM (PGFS)

Optimizing Large Arrays with StoneFly Storage Concentrators

SAN TECHNICAL - DETAILS/ SPECIFICATIONS

Monitoring Tools for Large Scale Systems

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Sun Storage Perspective & Lustre Architecture. Dr. Peter Braam VP Sun Microsystems

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

CONFIGURATION GUIDELINES: EMC STORAGE FOR PHYSICAL SECURITY

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Storage Virtualization in Cloud

High Availability of the Polarion Server

DELL TM PowerEdge TM T Mailbox Resiliency Exchange 2010 Storage Solution

PCI Express and Storage. Ron Emerick, Sun Microsystems

WHITE PAPER BRENT WELCH NOVEMBER

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

ZFS Backup Platform. ZFS Backup Platform. Senior Systems Analyst TalkTalk Group. Robert Milkowski.

IT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez

SUN ORACLE DATABASE MACHINE

IOmark-VM. DotHill AssuredSAN Pro Test Report: VM a Test Report Date: 16, August

Implementing a Digital Video Archive Based on XenData Software

Sun Constellation System: The Open Petascale Computing Architecture

Veeam Cloud Connect. Version 8.0. Administrator Guide

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

FLOW-3D Performance Benchmark and Profiling. September 2012

THE SUN STORAGE AND ARCHIVE SOLUTION FOR HPC

Large Scale Storage. Orlando Richards, Information Services LCFG Users Day, University of Edinburgh 18 th January 2013

QuickSpecs. What's New Support for two InfiniBand 4X QDR 36P Managed Switches

The safer, easier way to help you pass any IT exams. Exam : Storage Sales V2. Title : Version : Demo 1 / 5

QuickSpecs. HP Smart Array 5312 Controller. Overview

Redundancy in enterprise storage networks using dual-domain SAS configurations

Performance Analysis of RAIDs in Storage Area Network

Red Hat Global File System for scale-out web services

Transcription:

The Parallel File System HP SFS/Lustre on xc2 Computing Centre (SSCK) University of Karlsruhe Germany Laifer@rz.uni-karlsruhe.de page 1

Outline» What is Lustre?» What is HP SFS?» Overview of HP SFS on xc2» Properties of the different file systems» Restrictions for using HP SFS» Performance diagrams» IO performance monitoring» Backup and archiving» Quota» How does striping work?» Possible optimization with striping parameters page 2

What is Lustre?» "Lustre" is an amalgam of the terms "Linux" and "Clusters "» Lustre is a scalable high performance file system for Linux» Main development by Cluster File Systems, Inc. (CFS) Roadmap, FAQs and source code at http://www.clusterfs.com/ Lustre products are available from many vendors» Pros and Cons + Runs very stable User base is rapidly growing Scalable up to 10000 s of clients Allows failover support on servers + Excellent throughput and metadata performance High throughput with multiple network protocols + POSIX file system semantics Administration is not easy Currently supports only Linux clients page 3

What is HP SFS?» HP SFS means HP StorageWorks Scalable File Share Our experiences: http://www.rz.uni-karlsruhe.de/dienste/lustretalks» HP SFS is a Lustre appliance from HP Only dedicated hardware is supported: Servers are Xeon based Proliant systems from HP Storage arrays are with SATA disks or EVA3000 with FC disks Includes a special software package» Advantages of HP SFS software HP supplies a hardened Lustre version Includes additional software for failover and management for sending problem alerts and to verify the system s health for performance monitoring client build kits and client rpm packages Easy installation, configuration and upgrade Good support page 4

Overview of HP SFS on xc2 760 clients (Opteron)...... C C C C C C C C C C C C InfiniBand 4X DDR Interconnect MDS MDS $HOME $WORK Capacity 8 TB 48 TB Total write / read perf. 360 / 800 MB/s 2100 / 3200 MB/s Note: $HOME file system is mirrored Single client write / read 360 / 320 MB/s 400 / 320 MB/s page 5

Properties of the different file systems Property $TMP $HOME $WORK Visibility local global global Lifetime batch job project > 7 days Capacity 70 GB 8 TB 48 TB Quotas no planned no Backup no yes (default) no Read perf. / node 60 MB/s 320 MB/s 320 MB/s Write perf. / node 60 MB/s 360 MB/s 400 MB/s Total read perf. n*60 MB/s 800 MB/s 3200 MB/s Total write perf. n*60 MB/s 360 MB/s 2100 MB/s page 6

Restrictions for using HP SFS» Nearly everything works like on a local file system!» Unsupported features flock(2) does not lock files over Lustre Identical to behaviour of NFS Direct IO (O_DIRECTIO option) is not supported Rarely used in applications atime is not always accurate Similar to other parallel file systems due to performance reasons tail f and ls l does not always show latest data Total metadata performance is limited to several 1000 operations / sec Operations are open, close, create, delete, stat» Good practise is to omit lots of metadata operations write or read data in large blocks change striping parameters if lots of clients use one very large file page 7

Performance of one for $HOME write read» InfiniBand SDR 770 MB/s measured InfiniBand SDR» 133 MHz PCI-X bus for IB About 1000 MB/s» 2x 100 MHz PCI-X bus for SCSI adapters PCI-X 133 Mhz 2x PCI-X 100 Mhz About 1600 MB/s» 4x storage array About 180 MB/s for writes Mirrored, RAID6 About 400 MB/s for reads Mirrored, RAID6 4x write (mirror) 4x read (mirror) 0 250 500 750 1000 1250 1500 1750 Throughput (MB/s) page 8

Performance of one for $WORK write read» InfiniBand SDR 770 MB/s measured InfiniBand SDR» 133 MHz PCI-X bus for IB About 1000 MB/s» 2x 100 MHz PCI-X bus for SCSI adapters PCI-X 133 Mhz 2x PCI-X 100 Mhz About 1600 MB/s» 4x storage array About 360 MB/s for writes RAID6 About 600 MB/s for reads RAID6 4x write 4x read 0 250 500 750 1000 1250 1500 1750 Throughput (MB/s) page 9

Writeperformanceof $WORK Throughput (MB/s) 1600 1500 1400 1300 1200 1100 1000 900 800 700 600 500 400 300 200 100 0 1 2 3 4 5 6 7 8 9 10 Number of nodes Measurement done only once with unstable switch 360 MB/s from one task is possible 1 task / node 2 tasks / node page 10

Read performance of $WORK Throughput (MB/s) 2200 2000 1800 1600 1400 1200 1000 800 600 400 200 Measurement done only once with unstable switch 320 MB/s from one task is possible 1 task / node 2 tasks / node 0 1 2 3 4 5 6 7 8 9 10 Number of nodes page 11

Write / read scalability of $WORK Throughput (MB/s) 2750 2500 2250 2000 1750 1500 1250 1000 750 500 250 Measurement done only once Performance impact by other MPI applications due to internal IB link saturation Total write / read performance 2100 / 3200 MB/s is possible write read 0 1 11 21 31 41 51 Number of nodes page 12

IO performance monitoring» Unfortunately no easy way for users to do performance monitoring Hopefully in later XC version a tool will become available Problem is mainly restricted access to batch nodes» Possibly add IO performance measurement to your application Measure time for large IO operations in order to get bandwidth This is an easy way for portable performance monitoring» Contact SFS admins if you assume IO performance problems We have tools to do fine grained performance monitoring Possible performance impact of RAID rebuild or other applications page 13

Backup and Archiving» Commands to show and restore data of the backup: tsm_q_backup shows one, multiple or all files stored in the backup tsm_restore restores saved files Use option h to get help Only $HOME data runs into the backup» Commands to show, archive and retrieve data: tsm_q_archive shows files in the archive tsm_archiv archives files Files of $HOME or $WORK can be archived tsm_retrieve retrieves archived files tsm_d_archiv deletes files from the archive page 14

Quota» Show reserved disk space: kontingent_get shows reserved amount of disk space of your project Initially set depending on your project proposal Ask XC hotline if you need more disk space Disk space limit (or quota) is not yet enforced» Show quota in future SFS version: Syntax: lfs quota -g <group name> /lustre/data Example: lfs quota -g wkrz00 /lustre/data Disk quotas for group wkrz00 (gid 40997): Filesystem blocks quota limit grace files quota limit grace /lustre/data 117342008 0 1000000000 279957 0 0 xc3-ls-mds1_uuid 59856 0 204800 279957 0 0 xc3-ls-ost1_uuid 117282152 0 117350400 page 15

How does striping work? File $HOME File1 $WORK File2 $WORK...... C C C C C C C C C C C C MDS MDS page 16

Possible optimization with striping parameters» Change striping parameters: lfs setstripe <dirname> <stripe size> <stripe start> <stripe count> Always use -1 as stripe start! Has only effect on new files on this directory! Changed parameters are not saved in the backup! Create own documentation of changes» Small file performance could be improved with stripe count 1 Example to stripe count 1: lfs setstripe. 4194304-1 1» Large IO on $WORK slightly improved with stripe count 8 Example to stripe count 8: lfs setstripe. 4194304-1 8» If many clients use a > 100 MB file on $WORK use stripe count -1! Example to stripe over all OSTs: lfs setstripe. 4194304-1 -1 page 17

Possible optimization with striping parameters (cont.)» Change stripe size if your application uses a special chunk size For MPI-IO best performance was observed with stripe size 16 MB Example to change stripe size to 16 MB: lfs setstripe. 16777216-1 -1» If application uses many files do not change striping parameters Default stripe count is 4 and files are automatically distributed Check stripe count: lfs getstripe <filename> OBDS: 0: xc2-ls-ost5_uuid 23: xc2-ls-ost28_uuid./my_striped_file obdidx objid objid group 12 3430634 0x3458ea 0 13 3430885 0x3459e5 0 14 3430874 0x3459da 0 15 3431045 0x345a85 0 Number of lines below obdidx shows stripe count page 18

Questions? page 19