Some say people have evolved into something intelligent?!?!



Similar documents
Cisco Data Virtualization

Continuity Cloud Virtual Firewall Guide

5 2 index. e e. Prime numbers. Prime factors and factor trees. Powers. worked example 10. base. power

5.4 Exponential Functions: Differentiation and Integration TOOTLIFTST:

A Project Management framework for Software Implementation Planning and Management

WORKERS' COMPENSATION ANALYST, 1774 SENIOR WORKERS' COMPENSATION ANALYST, 1769

Free ACA SOLUTION (IRS 1094&1095 Reporting)

ITIL & Service Predictability/Modeling Plexent

Probabilistic maintenance and asset management on moveable storm surge barriers

Product Overview. Version 1-12/14

Rural and Remote Broadband Access: Issues and Solutions in Australia

REVIEW ON COMPARATIVE STUDY OF SOFTWARE PROCESS MODEL

IBM Healthcare Home Care Monitoring

REPORT' Meeting Date: April 19,201 2 Audit Committee

Moving Securely Around Space: The Case of ESA

B April 21, The Honorable Charles B. Rangel Ranking Minority Member Committee on Ways and Means House of Representatives

Key Management System Framework for Cloud Storage Singa Suparman, Eng Pin Kwang Temasek Polytechnic

An Broad outline of Redundant Array of Inexpensive Disks Shaifali Shrivastava 1 Department of Computer Science and Engineering AITR, Indore

QUANTITATIVE METHODS CLASSES WEEK SEVEN

Taiwan Stock Forecasting with the Genetic Programming

Planning and Managing Copper Cable Maintenance through Cost- Benefit Modeling

Econ 371: Answer Key for Problem Set 1 (Chapter 12-13)

A Comparative Analysis of BRIDGE and Some Other Well Known Software Development Life Cycle Models

C H A P T E R 1 Writing Reports with SAS

Maintain Your F5 Solution with Fast, Reliable Support

LG has introduced the NeON 2, with newly developed Cello Technology which improves performance and reliability. Up to 320W 300W

Architecture of the proposed standard

Teaching Computer Networking with the Help of Personal Computer Networks

Mathematics. Mathematics 3. hsn.uk.net. Higher HSN23000

The example is taken from Sect. 1.2 of Vol. 1 of the CPN book.

FACULTY SALARIES FALL NKU CUPA Data Compared To Published National Data

Designing a Secure DNS Architecture

Important Information Call Through... 8 Internet Telephony... 6 two PBX systems Internet Calls... 3 Internet Telephony... 2

A Secure Web Services for Location Based Services in Wireless Networks*

by John Donald, Lecturer, School of Accounting, Economics and Finance, Deakin University, Australia

Adverse Selection and Moral Hazard in a Model With 2 States of the World

Question 3: How do you find the relative extrema of a function?

Parallel and Distributed Programming. Performance Metrics

TELL YOUR STORY WITH MYNEWSDESK The world's leading all-in-one brand newsroom and multimedia PR platform

Nimble Storage Exchange ,000-Mailbox Resiliency Storage Solution

GOAL SETTING AND PERSONAL MISSION STATEMENT

Logo Design/Development 1-on-1

ONTOLOGY-DRIVEN KNOWLEDGE-BASED HEALTH-CARE SYSTEM AN EMERGING AREA - CHALLENGES AND OPPORTUNITIES INDIAN SCENARIO

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

CARE QUALITY COMMISSION ESSENTIAL STANDARDS OF QUALITY AND SAFETY. Outcome 10 Regulation 11 Safety and Suitability of Premises

Intelligent Storage for Blade Servers

Information Management Strategy: Exploiting Big data and Advanced Analytics

Fleet vehicles opportunities for carbon management

CPS 220 Theory of Computation REGULAR LANGUAGES. Regular expressions

Lecture 3: Diffusion: Fick s first law

Projections - 3D Viewing. Overview Lecture 4. Projection - 3D viewing. Projections. Projections Parallel Perspective

E X C H A N G E R U L E S A N D C L E A R I N G R U L E S O F N A S D A Q O M X D E R I V A T I V E S M A R K E T S

Entry Voice Mail for HiPath Systems. User Manual for Your Telephone

Keywords Cloud Computing, Service level agreement, cloud provider, business level policies, performance objectives.

Combinatorial Analysis of Network Security

SPECIAL VOWEL SOUNDS

CPU. Rasterization. Per Vertex Operations & Primitive Assembly. Polynomial Evaluator. Frame Buffer. Per Fragment. Display List.

EFFECT OF GEOMETRICAL PARAMETERS ON HEAT TRANSFER PERFORMACE OF RECTANGULAR CIRCUMFERENTIAL FINS

Internal Geographic, Labor Mobility, and the Distributional Impacts of Trade Online Appendix (Not for Publication)

ENVIRONMENT FOR SIGNAL PROCESSING APPLICATIONS DEVELOPMENT AND PROTOTYPING Brigitte SAGET, MBDA

Category 7: Employee Commuting

Lecture notes: 160B revised 9/28/06 Lecture 1: Exchange Rates and the Foreign Exchange Market FT chapter 13

Data warehouse on Manpower Employment for Decision Support System

Swisscom Cloud Strategy & Services

Magic Message Maker Amaze your customers with this Gift of Caring communication piece

Section 7.4: Exponential Growth and Decay

Personal Identity Verification (PIV) Enablement Solutions

Electric power can be transmitted or dis

Cookie Policy- May 5, 2014

Scalable Transactions for Web Applications in the Cloud using Customized CloudTPS

Outside Cut 1 of fabric Cut 1 of interfacing

YouthWorks Youth Works (yüth- w rkz), n.

CDP. Module: Introduction. Page: W0. Introduction. CDP 2014 Water 2014 Information Request W0.1. Introduction

Enforcing Fine-grained Authorization Policies for Java Mobile Agents

ESA Support to ESTB Users

FEASIBILITY STUDY OF JUST IN TIME INVENTORY MANAGEMENT ON CONSTRUCTION PROJECT

Why An Event App... Before You Start... Try A Few Apps... Event Management Features... Generate Revenue... Vendors & Questions to Ask...

Asset set Liability Management for

Lecture 20: Emitter Follower and Differential Amplifiers

User-Perceived Quality of Service in Hybrid Broadcast and Telecommunication Networks

Developing Economies and Cloud Security: A Study of Africa Mathias Mujinga School of Computing, University of South Africa mujinm@unisa.ac.

Usability Test of UCRS e-learning DVD

ESCI 241 Meteorology Lesson 6 Humidity

Intermediate Macroeconomic Theory / Macroeconomic Analysis (ECON 3560/5040) Final Exam (Answers)

Fetch. Decode. Execute. Memory. PC update

Development of Financial Management Reporting in MPLS

Meerkats: A Power-Aware, Self-Managing Wireless Camera Network for Wide Area Monitoring

Remember you can apply online. It s quick and easy. Go to Title. Forename(s) Surname. Sex. Male Date of birth D

Gold versus stock investment: An econometric analysis

Lift Selection Guide

Physics. Lesson Plan #9 Energy, Work and Simple Machines David V. Fansler Beddingfield High School

Analyzing Failures of a Semi-Structured Supercomputer Log File Efficiently by Using PIG on Hadoop

Natural Gas & Electricity Prices

Managing Risk with Composite Information Systems

Transcription:

Abstract GPFS has matur significantly ovr th yars sinc its vrsion 1.x rlass for AIX, yt many HPC practitionrs o not fully raliz GPFS's flxibility an as of us toay. This prsntation xamins ths nw faturs incluing multi-clustring, GPFS in a mix AIX/Linux nvironmnt, its ability to work with a wir varity of isk vnors, an othr things. Sampl configurations with bnchmark rsults will b givn. Th prsntation will inclu a cursory rviw of th GPFS roamap. Som say popl hav volv...

Som say popl hav volv into somthing intllignt?!?! But what about HPC systms? How hav thy volv?

A Common HPC Evolutionary Path Mainfram IBM mainfram with attach vctor procssor(s) Vctor Procssor Cray (with attach mainfram?) A Common HPC Evolutionary Path Mainfram IBM mainfram with attach vctor procssor(s) Aftr n stps... Propritary Clustrs Room full of SPs Vctor Procssor Cray (with attach mainfram?)

A Common HPC Evolutionary Path Thn som rnga starts xprimnting with Bowulf clustrs... Mainfram IBM mainfram with attach vctor procssor(s) Aftr n stps... starting lik this... Propritary Clustrs Room full of SPs Vctor Procssor Cray (with attach mainfram?) A Common HPC Evolutionary Path Thn som rnga starts xprimnting with Bowulf clustrs... Mainfram IBM mainfram with attach vctor procssor(s) Aftr n stps... starting lik this... thn volving into this... Propritary Clustrs Room full of SPs Vctor Procssor Cray (with attach mainfram?)

A Common HPC Evolutionary Path Thn som rnga starts xprimnting with Bowulf clustrs... Bowulf clustr Mainfram IBM mainfram with attach vctor procssor(s) Aftr n stps... Rack mount Linux Nos IBM Clustr 1350 Propritary Clustrs Room full of SPs Blas (using Linux) IBM BlaCntr Vctor Procssor Cray (with attach mainfram?) Propritary SMP Clustrs Room full of IBM p690s What Nxt? Propritary SMP Clustrs Room full of IBM p690s Rack mount Linux Nos IBM Clustr 1350 Blas (using Linux) IBM BlaCntr Clustr 1 Nos Clustr 2 Nos Local isk accss Sit 1 SAN Sit 2 SAN Rmot isk accss... an vryboy is talking about gris! Clustr 1 Fil Systm Global SAN Intrconnct Rmot isk accss Sit 3 SAN Visualization Systm

But Whr Dos Storag I/O Fit? Storag I/O... Th oft forgottn stpchil But Whr Dos Storag I/O Fit? Early aoptrs of propritary clustrs (.g., IBM SP) gnrally aopt vnor storag solutions (.g., SSA an GPFS or JFS) GPFS is NOT th sam bast it us to b! Early aoptrs of commoity clustrs approach storag I/O from a potpouri of approachs (.g., NFS) Thr ar altrnativs to NFS Customrs trying to intgrat propritary an commoity systms oftn fl forc to us NFS Thr ar altrnativs to NFS An what about gris?

Lts tak a closr look at this. I will bgin with th Linux clustrs prspctiv. I will gt to th SP to psris prspctiv in a momnt. H who os not stuy history is pr-stin to rliv it... rrr, but is NFS rally history? Common First Stp For Somthing Small Ethrnt (GbE) For somthing lik this, it is common to o NFS an FTP btwn srvrs ovr a GbE or 100 MbE ntwork. sam Intllistation M Pro 2 CPU 4 GB Linux 2.4.21 (SuSE) sa (scsi) froo p615 2 CPU 4 GB AIX 5.2 hisk0(scsi) hisk1(scsi) hisk2(scsi) hisk3(scsi) ganalf p615 2 CPU 4 GB AIX 5.2 hisk0(scsi) hisk1(scsi) hisk2(scsi) hisk3(scsi) local /fs_sam NFS /fs_froo /fs_ganalf local /fs_froo NFS /fs_sam /fs_ganalf local /fs_ganalf NFS /fs_sam /fs_froo

Common Scon Stp Mak th Small Solution BIGGER N o S w i t c h Ha No Intrnal SCSI or SATA NFS xport local fil systm "Clint" Nos.g., x336 (Linux) accss to NFS mount fil systm from ha no intrnal SCSI us for local scratch FTP fils as ncssary btwn clints Ha No.g., x346 (Linux) fil systm bas on intrnal storag an NFS xport to clint No Switch Gnrally, IP bas ntwork GbE Myrint HPS IB? Common Application Organization Us IP ntwork to istribut ata via MPI, NFS an/or othr "hom-grown mi-layr" cos works wll for applications using minimal or no paralll I/O o application vloprs (i.., computational scintists) want to bcom computr scintists? Common Thir Stp Crat "Islans of Nos" as You Gt Evn BIGGER N o S w i t c h Highr Lvl Switch (LAN or WAN) Ha No Intrnal SCSI or SATA NFS xport local fil systm N o S w i t c h Ha No Intrnal SCSI or SATA NFS xport local fil systm Clustrs ar connct via a hirarchical switching ntwork COMMENTS Provis "any to any" connctivity th poor man's way Works wll whn I/O mol is not paralll, but may rquir aggrgating fils ISLs can b bottlncks Inaquacy of NFS smantics (spcially for paralll writs) Poor I/O prformanc Limit storag capacity can a mor ha nos Storag BW limit by hano switch aaptr Inconvninc of ftp or othr utilitis to manually mov fils Common fabric for mssag passing an storag I/O Naturally gnralizs to a gri with all of its issus, but is compoun by varibilitis of gographic spration!

Common Thir Stp Crat "Islans of Nos" as You Gt Evn BIGGER N o S w i t c h Highr Lvl Switch (LAN or WAN) iscsi bas storag N o S w i t c h NFS srvr no NFS srvr no NFS srvr no NFS srvr no NFS srvr no NFS srvr no Disk Controllr Disk Enclosur Disk Enclosur Clustrs ar connct via a hirarchical switching ntwork Mor sophisticat storag systms can b aopt to work within this NFS/IP bas mol ovr a WAN/gri iscsi bas systms NFS not ncssarily rquir "plain vanilla" iscsi can b us, but mor sophisticat schms ar bing invstigat (.g., fil rplication, Univ of Tokyo) SAN bas fil systm provis local I/O but th local fil systms ar NFS xport ovr th WAN still must al with NFS shortcommings S A N S w i t c h In this SAN xampl, th SAN attach storag woul typically b istribut ovr th 6 srvrs with 6 or mor iffrnt NFS xport fil systms (at last 1 fil systm pr NFS srvr). Th Final(?) Stp Global Storag with Paralll Fil Systm No Switch (LAN or WAN) Storag srvrs with xtrnal storag using th no switch storag no storag no storag no storag no SAN isk controllr isk nclosur isk nclosur isk nclosur isk nclosur isk nclosur COMMENTS homognous ntwork paralll FS fascilitats th ffctiv us of this architctur isks ar accss via th LAN/WAN virtual isks prformanc scals linarly in th numbr of srvrs incrasing th numbr of srvrs will incras BW can a capacity without incrasing th numbr of srvrs srvr switch aaptrs can bcom a bottlnck can inxpnsivly scal out th numbr of nos largst GPFS clustr with a singl fil systm 2300 nos natural mol for gri bas fil systm

Th Final(?) Stp Anothr Global Storag Paralll Fil Systm Mol SAN Switch No Switch (LAN or WAN) isk controllr isk nclosur isk nclosur isk nclosur isk nclosur isk nclosur SAN attach storag COMMENTS sprat switch fabrics paralll FS fascilitats th ffctiv us of this architctur prformac scals in th numbr of isk controllrs can a capacity without incrasing th numbr controllrs scaling out th numbr of irct attach nos is limit by th SAN largst SAN clustr is 200+ nos scaling largr rquirs rmot nos accssing storag ovr IP ntwork irct attach nos gt bttr fil systm BW BW is not rstrict by srvr no switch aapts (typically, a FC HBA is fastr th GbE... but os this chang with IB?) Allows gratr aggrgat BW.g., 15 GB/s on 40 nos SAN works wll in a procssing cntr Us a LAN/WAN to scal out byon SAN limits In th arly ays of HPC clustrs, thr wr limit choics for paralll/global fil systms... an gnrally it was ncssary to us th vnor's fil systm. Toay thr ar othr choics (at last 18 at my last count) that hav bn nabl by th vlopmnt of Linux bas clustrs. In orr to mor clarly unrstan how GPFS fits into this nvironmnt, th following pags iscuss a coars HPC storag architctur taxonomy covring rang of fil systms us on an HPC systms... this is a work in progrss!

HPC Storag Architctur Taxonomy Th following pags xamin an architctural taxonomy of storag I/O architcturs commonly us in HPC systms. Thy support varying grs of paralll I/O an o not rprsnt mutually xclusiv choics. Convntional I/O Asynchronous I/O Ntwork Fil Systms Basic Paralll I/O Singl Componnt Architctur Cntraliz Mtaata Srvr with SAN Attach Disk Dual Componnt Architctur Rcnt Dvlopmnts Tripl Componnt Architctur High Lvl Paralll I/O Local fil systms Basic, "no frills, out of th box" fil systm Journal, xtnt bas smantics Convntional I/O journaling: to log information about oprations prform on th fil systm mta-ata as atomic transactions. In th vnt of a systm failur, a fil systm is rstor to a consistnt stat by rplaying th log an applying log rcors for th appropriat transactions. xtnt: a squnc of contiguous blocks allocat to a fil as a unit an is scrib by a tripl consisting of <logical offst, lngth, physical> If thy ar a nativ FS, thy ar intgrat into th OS (.g., caching on via VMM) mor favorabl towar tmporal than spatial locality Intra-no procss paralllism Disk lvl paralllism possibl via striping Not truly a paralll fil systm Exampls: Ext3, JFS2, XFS COMMENT: GPFS cach (i.., pagpool) is mor favorabl towar spatial than tmporal locality. Vry larg pagpools (up to 8 GB using 64 bit OS) may o bttr with tmporal locality.

Asynchronous I/O Abstractions allowing multipl thras/tasks to safly an simultanously accss a common fil Built on top of a bas fil systm Paralllism availabl if its support in th bas fil systm Part of POSIX 4, but not support on all unix bas fil systms (.g., Linux 2.4, but Linux 2.6 now inclus it?) AIX, Irix, Solaris supports AIO Ntwork Fil Systm (NFS) Disk accss from rmot nos via ntwork accss (.g., TCP/IP ovr Ethrnt) NFS is ubiquitous an th most common xampl it is not truly paralll ol vrsions ar not cach cohrnt (is V3 or V4 truly saf?) writ rquirs O_SYNC an -noac options to b saf poorr prformanc for I/O intnsiv HPC jobs writ: only 90 MB/s on systm capabl of 400 MB/s (4 tasks) COMMENT: nhancmnts hav bn propos for NFS V4 unr AIX that shoul improv NFS paralll writs. ra: only 381 MB/s on systm capabl of 740 MB/s (16 tasks) uss POSIX I/O API, but not its smantics usful for on-lin intractiv accss to smallr fils whil NFS is not sign for gnral paralll fil accss on an HPC systm, by placing rstrictions on an application's storag I/O mol, som customrs gt "goo nough" prformanc from it

Basic Paralll I/O Singl Componnt Architctur Paralllizs fil, mtaata an control oprations singl componnt architctur: os not rquir istinction btwn mtaata, storag an s POSIX I/O mol with xtnsions byt stram using API with ra(), writ(), opn(), clos(), lsk(), stat(), tc. xtns POSIX mol to support saf paralll ata accss smantics ths options guarant portability to othr POSIX bas fil systms for applications using th POSIX I/O API gnrally has API xtnsions, but ths compromis portability Goo prformanc for larg volum, I/O intnsiv jobs Works bst for larg block, squntial accss pattrns, but vnors can a optimizations for othr pattrns Exampl: GPFS (IBM...bst of class), GFS (Sistina/Rhat) Cntraliz Mtaata Srvrs with SAN Attach Disk Dual Componnt Architctur Paralll usr ata smantics, but non-paralll mtaata smantics Support POSIX API, but with paralll ata accss smantics Dual componnt architctur (storag clint/srvr, mtaata srvr) Mtaata maintain an accss from a singl common srvr Failovr faturs allow a backup mtaata srvr to takovr if th primary fails Uss Ethrnt (100 MbE or 1 GbE) for mtaata accss Potntial scaling bottlnck (but SANs alray limit scaling). Latncy mor than BW is potntial issu. All "isks" connct to all s via th SAN fil ata accss via th SAN, not th no ntwork rmovs n for xpnsiv no ntwork (.g., Myrint) inhibits scaling u to cost of FC Switch Tr (i.., SAN) Ial for smallr numbrs of nos SNFS avrtiss up to 50 clints (an can go as high as 100 nos), an is capabl of vry high BW on a vry carfully configur/tun p690, GPFS, JFS2 an SNFS all got 15 GB/s CXFS scals only to 10-12 srvrs for som usrs, prhaps at most 30? goo nough for larg SMPs? Exampl: CXFS (SGI), SNFS (ADIC), SanFS (IBM) SanFS an SNFS plac havy mphasis on storag virtualization

Rcnt Dvlopmnts Tripl Componnt Architctur Lustr an Panasas, ar 2 rcntly vlop HPC styl paralll fil systms which bgan "from a clan sht of papr" in thir sign that istinguishs thm from othr fil systms in this taxonomy. Thy hav a numbr of architctural similaritis. Tripl componnt architctur storag clints, storag srvrs, mtaata srvrs fil ata accss ovr th no ntwork btwn storag clints an srvrs (.g., GbE, Myrint) Objct orint architctur objct orint isks ar not gnrally availabl yt, so th currnt implmntation is in SW an not fully gnraliz OO sign is blin to th application (i.., uss POSIX API with paralll smantics) Dsign to fascilitat storag managmnt (i.., "storag virtualization") Focus on Linux/COTS nvironmnts Highr Lvl Paralll I/O High lvl abstraction layr proviing paralll mol Built on top of a bas fil systm (convntional or paralll) MPI-I/O is th ubiquitous mol paralll isk I/O xtnsion to MPI in th MPI-2 stanar smantically richr API portabl Rquirs significant sourc co moification for us in lgacy cos, but it has th aavantags of bing a stanar (.g., syntactic portability)

Which Architctur is Bst? Thr is no concis answr to this qustion. It is application/customr spcific. All of thm srv spcific ns. All of thm work wll if proprly ploy an us accoring to thir sign spcs. Issus to consir ar application rquirmnts oftn rquirs compromis btwn compting ns how th prouct implmnts a spcific architctur What Othrs Say About GPFS Two rcnt paprs comparing/contrasting paralll fil systms. Margo, Kovatch, Anrws, Banistr. "An analysis of Stat-of-th-Art Paralll Fil Systms for Linux.", 5th Intrnational Conf on Linux Clustrs: HPC Rvolution 2004, Austin, TX, May 2004 Compar GPFS, Lustr, PVFS Critria: prformanc, systm aministration, runancy, spcial faturs "In both SAN an NSD mos, GPFS prform th bst. It was also asy to install an ha numrous runancy an spcial faturs." Cop, Obrg, Tufo, Woitaszk. "Shar Paralll Filsystms in Htrognous Linux Multi-Clustr Environmnts.", 6th Intrnational Conf on Linux Clustrs: HPC Rvolution 2005, Chapl Hill, NC, April 2005 Compar GPFS, Lustr, NFS, PVFS2, TrraFS Critria: prformanc, usability, stability, spcial faturs "Our xprincs with GPFS wr vry positiv, an w foun it to b a vry powrful filsystm with wll ocumnt aministrativ tools."

What is GPFS? This qustion no longr has simpl answr. GPFS in Th "Goo Ol Days" A Typical "Wintrhawk" Config with GPFS 1.3 no 16 no 14 no 12 no 10 no 8 no 6 no 4 no 2 no 15 no 13 no 11 no 9 no 7 no 5 no 3 no 1 SP Switch no 32 no 30 no 28 no 26 no 24 no 22 no 20 no 18 no 31 no 29 no 27 no 25 no 23 no 21 no 19 no 17 VSD Srvrs no 36 no 35 no 34 no 33 Comput clints Comput clints All nos run AIX Thin nos ar comput clints, wi nos ar VSD srvrs GPFS packts transit th switch Th isk is SSA Pak Aggrgat BW for this configuration at most 440 MB/s Took xprinc sysam a ay or 2 to configur hisks 1..64 SSA Disk

... but GPFS can't o that!?!?! Ol ias i har. GPFS is far mor vrsitil than what it was in its arly ays. Th following pags highlight many of ths nwr faturs. GPFS Toay GPFS unr Linux with GbE Whil not officially support, 100 MbE can also b us among th s insta of 1 GbE. x345-1 1 GbE 1 GbE Fil Srvr Systm x345-29 (NSD) x345-30 (NSD) x345-31 (NSD) x345-2 x345-3 x345-4 x345-5 x345-6 x345-7 x345-8 x345-9 x345-10 x345-11 x345-12 x345-13 x345-14 x345-15 x345-16 Ethrnt Switch Clint Systm 1 GbE x345-17 x345-18 x345-19 x345-20 x345-21 x345-22 x345-23 x345-24 x345-25 x345-26 x345-27 x345-28 Scaling Out: Actual GbE bas signs hav bn xtn upto 1100 Linux nos (.g., Intl or Optron) with GPFS 2.2 Currnt sign maximum for GPFS 2.3: 2000+ nos x345-32 (NSD) Bnchmark rsults on storag nos I/O Bnchmark (IBM) comman lin./ibm_vg{w r} /gpfs/xxx -nrc 4k -bsz 1m -pattrn sq -ntasks 4 summary writ rat: 491.4 MB/s* ra rat: 533.0 MB/s iozon comman lin./iozon.206 -c -R -b output.xls -C -r 32k -s 1024m -i 0 -i 1 -i 2 -i 5 -i 6 -i 7 -i 8 -W -t 4 -+m hostlist.cfs2a summary Initial writ: 554.1 MB/s* Rwrit: 264.1 MB/s* Ra: 526.7 MB/s R-ra: 533.6 MB/s Stri ra: 31.3 MB/s Ranom ra: 11.4 MB/s Ranom mix: 26.2 MB/s Ranom writ: 54.0 MB/s * writ caching = on R Hat Linux 9.0 (Krnl 2.4.24-st2) GPFS 2.2 Bnchmark rsults on "s" I/O Bnchmark (IBM), writ caching off BW constrain by singl GbE aaptr on ach NSD srvr 1 clint writ = 92.3 MB/s, ra = 111 MB/s 8 clints writ = 360 MB/s, ra = 384 MB/s SAN Broca 2109-F16 FAStT 900-2 EXP 700 EXP 700 EXP 700 EXP 700 EXP 700 EXP 700 Using Myrint on 8 clints an 2 FAStT900s writ = 397 MB/s ra = 585 MB/s A scon FAStT900 was n sinc pak ra BW xc th ability of 1 FAStT900.

GPFS Toay Mix AIX/Linux Config Existing Usr Ntwork GbE Switch p690-1 p690-2 p615 CSM srvr Clustr Managmnt No -1 Clustr Managmnt No -2 256 x 325 nos RIO-1.1 RIO-1.2 RIO-2.1 RIO-2.2 p615-2 TSM Clint/Srvr Application an Schuling Nos 325-3..6 SAN-1 (16 ports) SAN-2 (16 ports) FAStT900-1 FAStT900-2 LTO Tap Library tap_1 tap_2 tap_3 tap_4 p690s run AIX, 325s run Linux p690s ar NSD srvrs an comput clints 325s ar comput clints GPFS mount on TSM/HSM srvr GPFS packts transit GbE ntwork Th isk is FC Ovrall Pak Aggrgat BW < 800+ MB/s Pak Aggrgat BW on 325 < 640 MB/s GPFS Toay Mix AIX/Linux Config Bnchmark Config 1 p690 32 x335s Numbr of Nos Numbr of Nos Natural Aggrgat Harmonic Aggrgat Harmonic Man Numbr of Writ Rat Ra Rat Writ Rat MB/s Ra Rat writ ra sizof(fil) tasks pr no MB/s MB/s MB/s GB 1 1 640.10 765.28 4 1 2 641.07 760.75 648.02 761.71 324.01 380.85 8 1 4 631.84 739.31 646.62 758.13 161.65 189.53 16 1 8 649.14 724.88 670.83 755.63 83.85 94.45 32 1 16 646.97 721.26 671.29 788.48 41.96 49.28 64 p690; writ caching = off, pattrn = squntial, bsz = 1 MB Natural Aggrgat Harmonic Aggrgat Harmonic Man Numbr of Writ Rat Ra Rat Writ Rat MB/s Ra Rat writ ra sizof(fil) tasks pr no MB/s MB/s MB/s GB 1 1 109.27 110.89 4 2 1 198.92 218.65 198.92 219.51 99.46 109.76 8 4 1 269.91 435.37 269.95 437.68 67.49 109.42 16 8 1 282.44 624.19 283.44 626.81 35.43 78.35 32 16 1 253.23 595.50 281.78 598.18 17.61 37.39 64 32 1 269.77 577.53 269.83 581.23 8.43 18.16 128 x335; writ caching = off, pattrn = squntial, bsz = 1 MB Natural Aggrgat Harmonic Aggrgat Harmonic aggrgat ovr x335 an p690 writ ra sizof(fil) GB Numbr of Nos Numbr of tasks pr no Writ Rat MB/s Ra Rat MB/s Writ Rat MB/s Ra Rat MB/s x335 only *** 4 1 267.33 435.72 268.29 437.23 N/A N/A 16 p690 only 1 4 632.45 771.00 650.98 788.25 N/A N/A 16 x335 with p690 4 1 233.03 386.27 233.03 389.61 707.48* 801.66* 24** p690 with x335 1 4 470.22 407.79 477.30 412.81 32** mix nos; writ caching = off, pattrn = squntial, bsz = 1 MB, ntasks = 4 * Job tims ar nrly intical; thrfor, iostat masur rat was vry clos to harmonic aggrgat rat. ** Siz of combin fils from ach job; fil sizs ajust so job tims wr approximatly qual. Combin fils for writ = 24 GB, combin fils for th ra = 32 GB. *** x335 aggrgat ra rats wr gat by th 4 GbE at a littl ovr 100 MB/s pr aaptr.

GPFS Toay GPFS in a BlaCntr - Stanar Configuration BlaCntr Chasis WARNING: Do not connct a FC controllr to th FC ports on th bla chassis... its not support. 01 02 03 04 05 06 07 08 09 10 11 12 13 14 HS20 (GPFS NSD) GbE Ports FC Ports SYSTEM ANALYSIS 1. DS4500 - sustain pak prformanc < 540 MB/s 2. FC Ntwork - sustain pak prformanc < 600 MB/s 3. GbE Ntwork (aaptr aggrgat masur ovr all 4 x345s) - sustain pak prformanc < 360 MB/s 4. Aggrgat x345 Rat - sustain pak prformanc < 500 MB/s 5. Prict Aggrgat HS20 Rat - sustain pak prformanc < 360 MB/s SAN Switch (optional) Broca 2109-F16 x345 (GPFS SAN) x345 (GPFS SAN) Commnts - HS20 prformanc constrain by limit GbE Ports - Th rats for itms 1-4 ar bas on bnchmark tsts - Th SAN switch is optional; using it may ruc loa on GbE ntwork an ruc aggrgat application isk I/O banwith DS4500 EXP 710 EXP 710 x345 (GPFS SAN) x345 (GPFS SAN) Disk Srvr Systms Lowr Cost/Banwith Altrnativ If lss fil accss banwith is rquir or a lowr cost solution is rquir, thn th x345/fastt900 systm can b rplac with th following: - 2 isk srvrs (x345) ach with 1 GbE an 1 FC HBA - 1 FAStT600 an 1 isk nclosur () - SAN switch is optional - sustain pak prformanc < 200 MB/s EXP 710 EXP 710 EXP 710 Global Fil Systm ovr Multipl HS20 Systms This stan alon systm can b rplicat N tims. By routing HS20 an x345 GbE traffic through a switch, th NSD layr in GPFS will nabl all blas to s all LUNs; i.., multipl HS20 systms can all safly mount th sam GPFS fil systm an prformanc will scal linarly. GPFS Toay GPFS in a BlaCntr - Altrnativ Configuration Bla Cntr - Intrnal GbE ntwork highlight in blu I I I I D D D D E E E E I I I I D D D D E E E E b l a 01 b l a 02 b l a 03 b l a 04 I I I I D D D D E E E E I I I D D D E E E b l a 05 b l a 06 I I I I I D D D D D b l a 07 E b l a 08 I D E b l a 09 GbE Switch I I I I D D D D E E E E I D E E E E E b l a 10 b l a 11 b l a 12 b l a 13 Extrnal GbE Ports I D E I D E b l a 14 NOTE: This is not rcomn configuration by GPFS sinc th isks ar not twin tail... "but it works". BENCHMARK ANALYSIS - A privat intrnal GbE ntwork conncts all blas - Each bla has a singl GbE aaptr ffctiv BW = 80 to 100 MB/s - Baslin prformanc (ra from a singl local isk using xt2) application ra rat = 30 MB/s - Singl task GPFS prformanc application ra rat = 80 MB/s 2.7 X fastr than singl isk rat - Furthr analysis assum activ task is on bla 01 GPFS strips ovr all 28 IDE rivs in this configuration GPFS uss GbE ntwork for striping activity, thrfor singl bla GPFS prformanc limit to GbE aaptr BW (i.., upto 100 MB/s) ach bla has only on GbE aaptr us by GPFS an gnral systm acts as a isk srvr an GPFS clint singl task prformanc = 80 MB/s in a similar tst on x345 whr thr wr sprat GPFS clints an isk srvrs thr wr 2 GbE aaptrs pr no (on for GPFS an on for vrything ls) singl task prformanc = 100 MB/s - Aggrgat rat (1 task pr bla) ra rat = 560 MB/s - Analysis ach bla is acting as a isk srvr (as wll as clint) sinc GPFS scals linarly in th numbr isk srvrs, it yils goo aggrgat prformanc in a Wintrhawk 2 systm using SSA isk with ach no acting as a GPFS clint an isk srvr, th aggrgat rat ovr 14 nos was < 420 MB/s

GPFS Toay SDSC/IBM StorClou Architctur 40 Nos Linux/Intl intra-no: GbE? 3 FC HBAs/no 1 connction via ach FC switch 3 Switchs 2 Gb FC All isks irctly attach to srvrs via FC switchs. switch 01... Fram-15 Fram-01 4 x FAStT600 8 x FAStT600-01 switch 02 switch 03 to Fram-15 8 connctions... to Fram-01 8 connctions FAStT600-02 FAStT600-03 FAStT600-04 Aggrgat BW = 15 GB/s (sustain) goto: http://www.ssc.u/prss/2004/11/111204_sc04.html GPFS Toay Slct SDSC/IBM StorClou Statistics 40 Linux Nos 3 FC HBAs pr No 15 Storag frams 60 FAStT600s 2520 isks 240 LUNs COMMENT: IP vs FC With toay's tchnology, irct attach isk mols (.g., SAN attach) can yilr gratr pr no BW than IP bas mols. IP bas systms ar limit by thrnt aaptr (.g., 80 MB/s for 1 GbE, 120 MB/s for ual bon GbE) irct attach systms can hav multipl FC HBAs (.g., with 3 HBA/no th BW is 380 MB/s) Will 10 GbE chang this? Will IB chang this? 8+P 4 LUNs pr FAStT600 73 GB/isk @ 15 Krpm Sustain Aggrgat Rat 15 GB/s 380 MB/s pr no 256 MB/s pr FAStT600

GPFS Toay Storag Pools Motivation Som nwr fil systms implmnt a concpt call "storag pools"; GPFS supports a form of this. Disks prsnt thmslvs to GPFS as LUNs A GPFS fil systm can mount a FS on any subst of ths LUNs Thr is 1 storag pool pr FS Max: 32 fil systms pr GPFS clustr Exampl Monolythic isk architctur (.g., SATA) Accss is "bursty" To avoi striping ovr all isks an strssing all isks, ivi isks into 16 isjoint substs an hnc 16 fil systms. Fil striping is confin to a fil systm. Whn a fil systm not in us, GPFS is not spinning isks. (n.b., All FS's ar sn by all nos in th GPFS clustr... upto 2000+ nos) Exampl 2 classs of isk: FC isk an SATA isk 1 FS for FC isk us for constant accss 1 FS for SATA isk us infrquntly GPFS Toay Storag Pools in a Mix BlaCntr/pSris Clustr bla cntr #1 GbE Switch bla cntr #2 bla cntr #3 bla cntr #4 p575-12 p575-11 p575-10 p575-9 p575-8 p575-7 p575-6 p575-5 p575-4 p575-3 p575-2 p575-1 Fration Switch San Switch FAStT900 FAStT100

GPFS Toay Cross-clustr Mounts Problm: nos outsi th clustr n accss to GPFS fils Solution: allow nos outsi th clustr to mount th fil systm Owning clustr rsponsibl for amin, managing locking, rcovry, Sparatly aministr rmot nos hav limit status Can rqust locks an othr mtaata oprations Can o I/O to fil systm isks ovr global SAN (IP, Fibr Channl, ) Ar trust to nforc accss control, map usr Is, Uss: High-sp ata ingstion, postprocssing (.g. visualization) Sharing ata among clustrs Sparat ata an comput sits (Gri) Forming multipl clustrs into a suprclustr for gran challng problms Clustr 1 Nos Local isk accss Sit 1 SAN Clustr 1 Fil Systm Global SAN Intrconnct Clustr 2 Nos Sit 2 SAN Rmot isk accss Rmot isk accss Sit 3 SAN Visualization Systm GPFS Toay Cross-clustr Mounts -- Exampl IP bas Switch Intr-Switch Links (at last GbE sp!) ns_a1 UID MAPPING EXAMPLE (i.., Crntial Mapping) 1. pass Clustr_B UID/GID(s) from I/O thra no to mmui2nam 2. map UID to GUN(s) (Globally Uniqu Nam) 3. sn GUN(s) to mmnam2ui on no in Clustr_A 4. gnrat corrsponing CLUSTER_A UID/GID(s) 5. sn Clustr_A UID/GIDs back to Clustr_B no runing I/O thra (for uration of I/O rqust) COMMENTS: mmui2nam an mmnam2ui ar usr writtn scripts ma availabl to all usrs in /var/mmfs/tc; ths scripts ar call ID rmapping hlpr functions (IRHF) an implmnt accss policis simpl stratgis (.g, txt bas fil with UID <-> GUN mappings) or 3r party packags (.g., Globus Scurity Infrastruction from Tragri) can b us to implmnt th rmapping procurs ns_a2 San Switch UID/GIDA mmnam2ui GUN mmui2nam UID/GIDB ns_a3 ns_a4 Clustr_A /fsa Hom Clustr COMMENTS: Clustr_B accsss /fsa from Clustr_A via th NSD nos s xampl on nxt pag Clustr_B mounts /fsa locally as /fsaonb OpnSSL (scur sockt layr) provis scur accss btwn clustrs IP bas Switch NSD_B1 San Switch NSD_B2 Clustr_B /fsaonb Rmot Clustr S http://www-1.ibm.com/srvrs/srvr/clustrs/whitpaprs/ui_gpfs.html for tails.

GPFS Toay Cross-clustr Mounts -- Sysam Commans Mount a GPFS fil systm from Clustr_A onto Clustr_B assum iagram from th prvious pag On Clustr_A 1. Gnrat public/privat ky pair mmauth gnky COMMENTS crats public ky fil with fault nam i_rsa.pub start GPFS amons aftr this comman! 2. Enabl authorization mmchconfig ciphrlist=authonly 3. Sysam givs following fil to Clustr_B /var/mmfs/ssl/i_rsa.pub COMMET: rnam as clustr_a.pub 7. Authoriz Clustr_B to mount FS own by Clustr_A mmauth a clustr_b -k clustr_b.pub On Clustr_B 4. Gnrat public/privat ky pair mmauth gnky COMMENTS crats public ky fil with fault nam i_rsa.pub start GPFS amons aftr this comman! 5. Enabl authorization mmchconfig ciphrlist=authonly 6. Sysam givs following fil to Clustr_A /var/mmfs/ssl/i_rsa.pub COMMENT: rnam as clustr_b.pub 8. Dfin clustr nam, contact nos an public ky for clustr_a mmrmotclustr a clustr_a -n ns_a1,ns_a2,ns_a3,ns_a4 -k Clustr_A.pub 9. Intify th FS to b accss on clustr_a mmrmotfs a /v/fsaonb -f /v/fsa -C Clustr_A -T /v/fsaonb 10. mount FS locally mount /fsaonb S http://publib.boulr.ibm.com/clrsctr/ocs/gpfs/gpfs23/200412/bl1am10/bl1am1031.html#ammcch for tails. GPFS Toay GPFS is Easir to Aministr Ethrnt (GbE) sam ns clint Intllistation M Pro 2 CPU 4 GB Linux 2.4.20 GPFS 2.2 isk0 (scsi) NSD 11..26 froo ns srvr p615 2 CPU 4 GB AIX 5.2 GPFS 2.2 hisk0(scsi) hisk1(scsi) hisk2(scsi) hisk3(scsi) 1 VG ovr hisk 4..10 8 LVs ovr 1 VG NSDs hisk 11..26 ganalf p615 2 CPU 4 GB AIX 5.2 GPFS 2.2 hisk0(scsi) hisk1(scsi) hisk2(scsi) hisk3(scsi) 1 VG ovr hisk 4..10 8 LVs ovr 1 VG NSDs hisk 11..26 EXP300 14 isks SCSI 36 GB 15 Krpm pisks 0..6 pisks 7..13 SSA 16 isks 9 GB 10 Krpm COMMENT: This coul b a FAStT isk controllr. pisks 0..15 To buil th fil systm*, o th following on ganalf... 1. mmcrclustr 2. mmstartup 3. mmcrns spcify primary, sconary, clint, srvr nos in isk scriptor fil 4. mmcrfs 5. mount /<FS nam> * GPFS 2.3 COMMENTS: Onc th SAN zoning an low lvl isk formats ar complt, on can buil GPFS in unr 5 minuts on smallr systms. For StorClou, it took ~= 30 minuts, but this was ovr a 135 TB fil systm (n.b., 240 LUNs or 2520 isks) Othr ymanic faturs... mmaisk, mmano, mmlisk, mmlno mmchattr, mmchfs, mmchclustr, mmchconfig, mmchns mmpmon

So what is GPFS?... in on lin or lss So what is GPFS? It is IBM s shar isk, paralll fil systm for AIX an Linux clustrs.

What is GPFS? GPFS = Gnral Paralll Fil Systm IBM s shar isk, paralll fil systm for AIX, Linux clustrs Clustr: 2300+ nos (tst), fast rliabl communication, common amin omain Shar isk: all ata an mtaata on isk accssibl from any no through isk I/O intrfac (i.., "any to any" connctivity) Paralll: ata an mtaata flows from all of th nos to all of th isks in paralll RAS: rliability, accssibility, srvicability Gnral: supports wi rang of HPC application ns ovr a wi rang of configurations 1. Gnral Paralll Fil Systm GPFS Faturs matur IBM prouct gnrally availabl for 7 yars 2. Clustr, shar isk, paralll fil systm for AIX an Linux 3. Aaptabl to many customr nvironmnts by supporting a wi rang of basic configurations an isk tchnologis 4. Provis saf, high BW accss using th POSIX I/O API 5. Provis non-posix avanc faturs.g., DMAPI, ata-shipping, multipl accss hints (also us by MPI-IO) 6. Provis goo prformanc for larg volum, I/O intnsiv jobs 7. Works bst for larg rcor, squntial accss pattrns, has optimizations for othr pattrns (.g., stri, backwar) 8. Strong RAS faturs (rliability, accssibility, srvicability) 9. Convrting to GPFS os not rquir application co changs provi th co works in a POSIX compatibl nvironmnt

GPFS Faturs GPFS Prformanc Faturs 1. striping 2. larg blocks (with support for sub-blocks) 3. byt rang locking (rathr than fil or xtnt locking) 4. accss pattrn optimizations 5. fil caching (i.., pagpool) that optimizs straming accss 6. prftch, writ-bhin 7. multi-thraing 8. istribut managmnt functions (.g., mtaata, tokns) 9. multi-pathing (i.., multipl, inpnnt paths to th sam fil ata from anywhr in th clustr) GPFS Faturs GPFS provis many of its own RAS faturs an xploits RAS faturs provi by various subsystms 1. If a no fails proviing GPFS managmnt functions, an altrnativ no assums rsponsibility rucing risk of loosing th fil systm. 2. Whn using icat NSD srvrs with "twin tail isks", spcifying primary an sconary nos lts th sconary no provi accss to th isk if th primary no fails. WARNING: intrnal SCSI an IDE rivs ar no twin tail 3. In SAN nvironmnt, failovr rucs risk of lost accss to ata. 4. GPFS on RAID architcturs rucs risk of lost ata. 5. Onlin an ynamic systm managmnt allows fil systm moifications without bringing own th fil systm. mmaisk, mmlisk, mmano, mmlno, mmchconfig, mmchfs

GPFS Faturs Othr Faturs 1. Disk scaling allowing larg, singl instantiation global fil systms (100's of TB now, PB in futur) 2. No scaling (2300+ nos) allowing larg clustrs an high BW (many GB/s) 3. Multi-clustr architctur (i.., gri) 4. Journaling (logging) Fil Systm - logs information about oprations prform on th fil systm mta-ata as atomic transactions that can b rplay 5. Data Managmnt API (DMAPI) - Inustry-stanar intrfac allows thir-party applications (.g. TSM) to implmnt hirarchical storag managmnt Why is GPFS n? Clustr applications impos nw rquirmnts on th fil systm "Any to any" accss any no in th clustr has accss to any ata in th clustr Paralll applications n accss to th sam ata from multipl nos Srial applications ynamically assign to procssors bas on loa n high-prformanc accss to thir ata from whrvr thy run Rquir both goo availability of ata an normal fil systm smantics Scalability to larg numbrs of nos GPFS supports this via: Uniform accss singl-systm imag across clustr Convntional Posix API no program moification High capacity larg fils, 100TB + fil systm High throughput wi striping, larg blocks, many GB/sc to on fil Paralll ata an mtaata accss shar isk an istribut locking Rliability an fault-tolranc - no an isk failurs Onlin systm managmnt ynamic configuration an monitoring

Exampl GPFS Applications Customr applications that rquir fast, scalabl accss to larg amounts of fil ata. Ths applications may b srial or paralll, raing or writing. Applications that srv ata to visualization ngins Sismic ata acquisition procssing for srial or paralll raing/writing of fils Environmnts with vry larg ata, spcially whn singl fil srvrs (such as NFS) rach capacity limits Digital library fil srving Accss to larg CAD/CAM fil sts Data mining applications Data clansing applications prparing ata for ata warhouss Oracl RAC Applications rquiring ata rats which xc what can b livr by othr fil systms Larg aggrgat scratch spac for commrcial or scintific applications Intrnt srving of contnt to usrs with balanc prformanc Applications with high availability (HA) fil systm rquirmnts Slct Nw Faturs in GPFS 2.3 Scal out to largr clustrs (ovr 2000) Currnt largst prouction clustr is 2300 blas Biggr fil systms 100's of TB Biggr LUNs Thr is no GPFS limitation; it is now an OS an isk limitation only ovr 2 TB on AIX in 64 bit mo, up to 2 TB in "othr support" OS's Dpns lss on isk protocols for many faturs (.g., SCSI prsistnt rsrv) thrfor GPFS is portabl to a wir varity of isk harwar No longr rquirs RSCT Only on clustr typ (n.b., no n to spcify sp, rp or hacmp) Simplr quorum finition ruls GPFS spcific prformanc monitor Masurs latncy an banwith GPFS is asir to aministr an us