Data Layout Optimization for Petascale File Systems

Data Layout Optimizatio for Petacale File Sytem Xia-He Su Illioi Ititute of Techology u@iit.edu Yog Che Illioi Ititute of Techology yog.che@iit.edu Yalog Yi Illioi Ititute of Techology yyi@iit.edu ABSTRACT I thi tudy, the author propoe a imple performace model to promote a better itegratio betwee the parallel I/O middleware layer ad parallel file ytem. They how that applicatiopecific data layout optimizatio ca improve overall data acce delay coiderably for may applicatio. Implemetatio reult uder MPI-IO middleware ad PVFS file ytem cofirm the correcte ad effectivee of their approach, ad demotrate the potetial of data layout optimizatio i petacale data torage. Categorie ad Subject Decriptor B..3 [Iput/Output ad Data Commuicatio]: Itercoectio (Subytem) parallel I/O. D.. [Operatig Sytem]: Storage Maagemet allocatio/deallocatio trategie, ecodary torage. Geeral Term Meauremet, Performace, Experimetatio. Keyword Data layout, parallel file ytem, parallel I/O 1. INTRODUCTION High-performace computig (HPC) ha croed the Petaflop mark ad i movig forward to reach the Exaflop rage [15]. However, while computig reource are makig rapid progre, there i a igificat gap betwee proceig capacity ad dataacce performace. Due to thi gap, although proceig reource are available, they have to tay idle waitig for data to arrive, which lead to a evere overall performace degradatio. Figure 1 how the umber of CPU cycle required to acce cache memory (SRAM), mai memory (DRAM), ad dik torage []. It ca be ee that the umber of cycle for acceig dik i hudred of thouad of time lower. Thi tred i predicted to cotiue i the ear future. I the meatime, applicatio are becomig more ad more data iteive. Due to the growig performace diparity ad emergig data iteive applicatio, I/O ad torage have become a critical performace bottleeck i HPC machie, epecially whe we are dealig with petacale Permiio to make digital or hard copie of all or part of thi work for peroal or claroom ue i grated without fee provided that copie are ot made or ditributed for profit or commercial advatage ad that copie bear thi otice ad the full citatio o the firt page. To copy otherwie, to republih, to pot o erver or to reditribute to lit, require prior pecific permiio ad/or a fee. Supercomputig PDSW'9, Nov. 15, 9. Portlad, OR, USA. Copyright 9 ACM 97-1-55-3-/9/11... $1. data torage. Data layout mechaim decide how data i ditributed amog multiple file erver. It i a crucial factor that decide the data acce latecy ad the I/O ubytem performace for highperformace computig. The recet work i log-like reorderig of data [1][7] ha demotrated the importace ad performace improvemet by arragig data i a proper maer. However, hitorically, parallel I/O middleware ytem, uch a ROMIO [], ad parallel file ytem are developed eparately with a implified modular deig i mid. Parallel I/O middleware ytem ofte aume the uderlyig i a big file ytem, ad, o the other had, parallel file ytem ofte rely o the I/O middleware for data acce optimizatio ad do little i data layout optimizatio. I thi tudy, we argue that purely depedig o I/O middleware for data retrieval optimizatio i cotly ad may ot be effective i may ituatio. We argue that if we pa ome of the applicatio-pecific I/O requet iformatio to file ytem for data layout optimizatio, the reult could be much better. Exitig parallel file ytem, uch a PVFS [3], Lutre [5], ad GPFS [11] provide high badwidth for imple, well-formed, ad geeric I/O acce characteritic, but their performace varie from applicatio to applicatio [][]. Tuig data layout accordig to pecific I/O acce patter for a parallel I/O ytem i a eceity. Thi tuig require udertadig file ytem abtractio, gaiig kowledge of dik torage, kowig the deig of high-level librarie, ad makig itelliget deciio. While PVFS ad high-level parallel I/O librarie, uch a MPI-IO [13] ad HDF-5 [] provide ome fuctioality to cutomize data layout accordig to pecific I/O workload, few kow how to ue them effectively. Cycle 1M 1M 1 1K 1K 1 1 1 Data Acce Time i CPU Cycle 19 195 199 1995 5 1 Year SRAM DRAM Dik Figure 1. Compario of data acce latecy. I thi reearch, we tudy data layout optimizatio of parallel file ytem. We how that, with the coideratio of applicatio-

pecific I/O requet, the data layout optimizatio ca be totally differet i a parallel file ytem. We preet a ytem-level applicatio-pecific data layout optimizatio trategy for petacale data torage. By ytem-level, we mea that the propoed approach i itegrated ito the file ytem ad i traparet to programmer ad uer. By applicatio-pecific, we mea that the propoed approach ca adapt to pecific data acce patter for a proper data layout. The cotributio of thi tudy i two fold. Firt, we how that the data layout optimizatio ha a igificat impact o petacale data torage performace. Secod, we demotrate with a imple performace model ad curret imple data layout fuctioalitie provided by PVFS that we ca achieve oticeable performace gai. While our reult are prelimiary, they demotrate the potetial of the data layout optimizatio approach.. APPLICATION-SPECIFIC DATA LAYOUT MODELING Modelig ad evaluatig the performace of data layout trategy i eetial i providig a applicatio-pecific data layout optimizatio. The covetioal roud-robi ditributio (referred to a imple tripig i ome exitig work) i i place i may of parallel file ytem [3][5][11]. However, uder parallel I/O ytem, thi imple ditributio may ot be the bet data layout ad ca be improved. We preet a imple data layout performace model herei. I thi model, we aume that the coectio betwee compute (I/O) ode ad file erver i ot a performace bottleeck ad that the igificat overhead i i acceig file erver. We further aume that each file erver performace ca be meaured a α+β, where α i the tart up time (latecy), i the data ize, ad β i the tramiio time of igle uit data (the reciprocal of tramiio rate). I thi model, we differetiate three data layout trategie, 1-D Horizotal Layout, 1-D Vertical Layout ad -D Layout. The 1-D Horizotal Layout (or 1-DH i hort) refer to the trategy that data i ditributed amog all available file erver i a traditioal roud-robi fahio. Thi layout matche with the exitig imple tripig or roud-robi trategy. The 1-D Vertical Layout (or 1- DV i hort) refer to the trategy that data to be acceed by each proce i tored o oe give file erver. The -D Layout (or -D i hort) i the trategy i which data to be acceed by each proce i tored o a ubet of file erver. Figure illutrate thee three trategie with a example. Aume that we have p computig (I/O) ode, file erver, where all computig ode participate i a SPMD form of parallel computig, with a block-cyclic or ome imilar, eve data partitioig. With 1-DH data layout, i.e., with imple tripig roud-robi layout where exactly / of the data are i ay of the file erver, the cot of acceig data of ize by oe proce ad p procee are: ( α + β) ad p ( α + β) = p p α + β (1) repectively. With the 1-DH layout, each proce accee it required data cocurretly, but multiple procee have to acce data oe by oe equetially; ad the data of each proce i ditributed over differet file erver evely. Thi trategy make accee i a equetial cocurret way. The value of Equatio (1) deped o the value of p,, α ad β. I ay cae, however, the 1-DH layout or the covetioal roud-robi layout may ot be the bet choice whe p. If we take the 1-DV layout, i.e. takig a cocurret equetial approach, we ca get a better p performace, with ( ) α + β. If p <, the the data ca be tored either o erver uig 1-DH layout or uig -D layout, where each of the p procee get /p file erver for data torage. For the former layout, the cot i α+β ad for the latter cae, the cot i α + β. p P P1 P P3 p p( α + ) p β = α + β α + β or p ( α + β) Figure. Data layout trategie. α + β p 3. APPLICATION-SPECIFIC DATA LAYOUT OPTIMIZATION With applicatio-pecific data layout modelig, we are able to guide data layout toward a better way by coiderig data acce characteritic. With the value of p, ad α, β, the proper data layout ca be determied with the aforemetioed data layout formula for a give parallel I/O requet. The data layout of a give applicatio ca the be determied baed o the weighted ummatio of the cot of it I/O requet. The above model i determiitic ad i ready to ue uder exitig parallel file ytem, uch a PVFS. I additio, i parallel I/O applicatio, it i commo that a applicatio accee multiple file ad each file i multiple occaio. We tore each file i a differet layout to improve performace. Whe a applicatio accee a file i multiple patter, it i eceary to fid a layout that i beeficial for all patter. For example, a file i read i cotiguou acce patter ad writte i a complex o-cotiguou patter. From may obervatio [][13], acceig data i o-cotiguou patter perform wore tha acceig cotiguouly. Storig data to facilitate o-cotiguou accee may deteriorate cotiguou acce performace. We have to fid a balace betwee performace beefit whe we decide o performace layout. Baed o patter aalyi, we ca utilize a trategy by aigig each patter a weight to repreet it cope for I/O performace improvemet.

Baed o the modelig ad obervatio, we defie a et of data layout heuritic a how i Table 1. Whe I/O acce characteritic are ukow or completely radom, we rely o 1- DH trategy or the default imple roud-robi trategy. Whe the degree of I/O cocurrecy i high, it i beeficial to ue 1-DV layout. 1-DH layout or -D layout ca be cofigured for low degree of cocurrecy. I cae of TCP Icat [1], it i better to tripe data amog a certai et of file erver itead of all available file erver, which i -D layout. File ytem uch a PVFS provide feature to exted ad create ew ditributio [9]. We utilize thee feature i geeratig ew applicatiopecific ditributio i our implemetatio. Table 1. Heuritic for Chooig Layout Acce Patter Feature Radom High degree of I/O cocurrecy Low degree of I/O cocurrecy Too may I/O erver o TCP/IP Data Layout Strategy 1-DH (roud-robi) layout 1-DV data layout 1-DH or -D data layout -D data layout After makig deciio o the layout, we tore data o file erver uig the ew layout. The 1-DH layout trategy, or the imple roud-robi layout, with differet tripe ize ad tripig factor ca be et with MPI-IO hit, uch a tripig_factor ad tripig_uit. A more complex ditributio, uch a 1-DV or -D data layout, eed to be modified at the file ytem level to provide geeral upport, but ca be emulated with differet tripig_factor ad tripig_uit cofiguratio. I additio, it i commo for parallel file ytem, uch a PVFS, to provide flexible ad extedable data ditributio [9]. PVFS iclude a modular ytem for addig ew data ditributio to the ytem ad uig thee for ew file ad optimized layout. Sice our curret implemetatio focue o prototypig the idea ad verifyig the potetial performace gai, we employ a relatively quick prototypig trategy by uig parallel file ytem cofiguratio to provide upport for variou layout trategie. The curret prototypig ytem ha demotrated a igificat performace improvemet over exitig trategie a the followig ectio how. A geeral full-fledged data layout trategy upport at parallel fileytem level i uder developmet a well.. PRELIMINARY EXPERIMENTAL RESULTS We have carried out a prototype implemetatio of applicatiopecific data layout o PVFS parallel file ytem baed o the previouly dicued model ad optimizatio trategy. We curretly upport three trategie, 1-DH, 1-DV ad -D layout. The followig ubectio preet the iitial experimetal reult of thee applicatio-pecific trategie uder differet ceario..1 Experimetal Setup Our experimet were coducted o a 17-ode Dell PowerEdge Liux-baed cluter ad a 5-ode Su Fire Liux-baed cluter. The Dell cluter i compoed of oe Dell PowerEdge 5 head ode, with dual. GHz Xeo proceor ad GB memory, ad 1 Dell PowerEdge 5 compute ode with dual 3. GHz Xeo proceor ad 1 GB memory. The head ode ha two 73 GB U3 1K-RPM SCSI drive. Each compute ode ha a GB 7.K-RPM SATA hard drive. The Su cluter i compoed of oe Su Fire X head ode, with dual.7 GHz Optero quad-core proceor ad GB memory, ad Su Fire X compute ode with dual.3ghz Optero quad-core proceor ad GB memory. The head ode ha 5GB 7.K-RPM SATA-II drive cofigured a RAID-5 ytem. Each compute ode ha a 5GB 7.K-RPM SATA hard drive. The experimet were teted o PVFS file ytem. For the Dell cluter, PVFS wa cofigured with oe metadata erver ode, the head ode, ad I/O erver ode. All ode are ued a compute ode. For the Su Fire cluter, PVFS wa cofigured with 3 I/O erver ode. The ret ode are ued a compute ode.. Experimetal Reult ad Aalye..1 Sythetic Bechmark We have coded a ythetic bechmark which doe equetial read over the file tored with differet layout. We have performed a erie of tet o the Dell cluter. The firt et of experimet coducted i to compare the performace of differet layout trategie with four compute procee. I thi ceario, four procee retrieve data from MB, 1MB, 3MB, MB ad MB file repectively. Thee file are tored o eight file erver with three layout, 1-DH, 1-DV ad -D. We meaured the performace of retrievig data i each cae ad the reult are how i Figure 3. The reported reult are the average of three ru. We fluhed the ytem buffer cache betwee each ru. 1 MB 1MB 3MB MB MB File Size Figure 3. I/O performace with differet layout trategie. Figure 3 clearly how that differet layout trategie do have a coiderable impact to the performace of parallel I/O ytem. Amog three layout, the -D layout achieved the bet performace i all cae. Thi i coitet with our model ad aalyi that the -D layout i deired whe the umber of compute procee i le tha that of I/O erver ode. I the meatime, the 1-DH layout, or the default roud-robi layout, performed wore tha both 1-DV ad -D layout, ad the performace diparity wa up to.%.

We have alo performed a detailed aalyi to verify the propoed model. We compute the theoretical value with the model ad the meaured dik trafer time ad tartup time. The theoretical ad experimetal reult are how i Figure (1-DH layout i omitted here due to the pace limit). A ca be ee from the reult, there i a cloe match betwee the experimetal reult ad theoretical reult, which how the model ca etimate the performace of thee layout trategie well. 3 5 15 1 5 1 1 5 5 9 19 13 Fileize (MB) Figure. Radom read/write with KB tripe ize. MB 1MB 3MB MB MB MB 1MB 3MB MB MB 3 File ize File ize Experimetal Theoretical Experimetal Theoretical Figure. Experimetal ad theoretical reult. (Left: 1-DV layout; Right: -D layout) The other et of experimet we have coducted i to compare the impact of layout trategie with 1 compute procee. Thi et of tet i imilar with the previou tet, but the file ize are doubled i order to compare the performace with variou file ize. The reult how that 1-DV layout outperformed the other two trategie i all cae, which i coitet with the model ad aalyi preeted i Sectio. The reult are how i Figure 5. 35 3 5 15 1 5 5 5 9 19 13 Fileize (MB) Figure 7. Sequetial read/write with KB tripe ize. 3 5 15 1 5 5 15 1 5 MB 3MB MB MB MB File Size 5 5 9 19 13 File Size (MB) Figure. Radom read/write with 1MB tripe ize. 3 Figure 5. I/O performace with differet layout trategie... IOR Bechmark I additio to the ythetic bechmark meauremet, we have performed a erie of tetig o the Su cluter with the IOR-.1. bechmark from Lawrece Livermore Natioal Laboratory []. I thee experimet, we performed a larger cale of tetig. We cofigured PVFS with 3 I/O erver ode ad ru tetig with procee o 3 cliet ode (cliet ode are eparate from I/O erver ode). We performed both equetial read/write ad radom read/write tet, ad varied the tripe ize ad the file ize. Figure ad Figure 7 report the badwidth reult of acceig file with differet layout i a radom or equetial maer, repectively, with KB tripe ize for 1-DH ad -D layout. Figure ad Figure 9 report the reult i a imilar ceario, but with 1MB tripe ize for 1-DH ad -D layout. 5 15 1 5 5 5 9 19 13 File Size (MB) Figure 9. Sequetial read/write with 1MB tripe ize. A ca be ee from thee reult, differet layout trategie ca affect the IOR bechmark tetig performace coiderably. Amog the three trategie we pecifically aalyze, the 1-DV trategy geerally perform better tha the other two, while the - D trategy perform better tha the 1-DH trategy.

Although the curret experimetal reult are prelimiary, they have demotrated that data layout trategie have a coiderable impact o parallel I/O ytem. The propoed model ad applicatio-pecific data layout optimizatio are deired to dyamically adapt the layout to achieve a better performace uder differet ceario. 5. ONGOING WORK We have reported ome of iitial reult, while everal tudie are ogoig ad are ot ready to report at thi time. For itace, we are workig o a compreheive data layout model to characterize the performace impact of layout trategy i geeral cae baed o probability ad queuig theory. The baic idea of the geeral model i that each I/O ode ca be modeled a a idepedet queue. I/O requet come ito thee queue ad are erviced for either torig or retrievig data. Whe cotetio occur, the requet ha to wait i the queue to be erviced. Multiple queue are idepedet from each other, ad data layout optimizatio o parallel file erver are derived accordigly. Thi model characterize cocurrecy (parallelim) ad cotetio, two major role that data layout trategy play i affectig the ytem performace, to guide a optimal layout electio. We have developed a theoretical model ad are workig o the experimetal part to verify the model. We are alo movig the experimetal tetig to a much larger computer cluter tha what we have ued.. CONCLUSION Parallel I/O middleware ad parallel file ytem are fudametal ad critical compoet for petacale torage. While both of the techologie have made their ucce, little ha bee doe to applicatio-pecific data layout. I mot exitig file ytem, data i ditributed amog multiple erver primarily with a imple roud-robi trategy. Thi imple data layout trategy doe ot alway work well for parallel I/O ytem, where I/O requet are geerated cocurretly. I thi tudy, we have propoed a applicatio-pecific data layout trategy to optimize the performace of acceig data accordig to ditict applicatio feature. Thi data layout trategy optimizatio i built upo a imple but effective data layout model, ad ha bee prototyped with the cofiguratio facility of the uderlyig PVFS parallel file ytem. Parallel file ytem have bee deiged a oe-et-for-all ad have bee tatic. There i a great eed for reearch ito extgeeratio I/O architecture to upport acce awaree, itelligece, ad applicatio-pecific adaptive data ditributio ad reditributio. Although our curret reult are very limited, our prototypig ytem ha demotrated the great potetial i improvig parallel I/O acce performace via data layout optimizatio whe acce characteritic are take ito coideratio. We believe that the applicatio-pecific data layout optimizatio approach eed a commuity attetio. Thi approach appear to be a feaible olutio to mitigatig the I/O wall problem, epecially for petacale data torage. 7. ACKNOWLEDGMENTS The author are thakful to Dr. Rajeev Thakur, Dr. Rob Ro ad Sam Lag of Argoe Natioal Laboratory for their cotructive ad thoughtful uggetio toward thi tudy. Thi reearch wa upported i part uder NSF grat CCF-35 ad CCF- 93777.. REFERENCES [1] J. Bet, G. Gibo, G. Grider, B. McClellad, P. Nowoczyki, J. Nuez, M. Polte, M. Wigate, PLFS: A Checkpoit Fileytem for Parallel Applicatio, i Proc. of ACM/IEEE SuperComputig'9. [] R. E. Bryat ad D. O'Hallaro, Computer Sytem: A Programmer' Perpective, Pretice-Hall, 3. [3] P. H. Car, W. B. Ligo III, R. B. Ro, ad R. Thakur, PVFS: A Parallel File Sytem For Liux Cluter, i Proceedig of the th Aual Liux Showcae ad Coferece,. [] P. E. Cradall, R. A. Aydt, A. A. Chie, ad D. A. Reed, Iput/Output Characteritic of Scalable Parallel Applicatio, i Proceedig of the ACM/IEEE Coferece o Supercomputig, 1995. [5] Cluter File Sytem Ic., Lutre: A Scalable, High Performace File Sytem, Whitepaper, http://www.lutre.org/doc/whitepaper.pdf. [] Iterleaved or Radom (IOR) Bechmark, http://ourceforge.et/project/ior-io/. [7] J. F. Loftead, S. Klaky, K. Schwa, N. Podhorzki ad C. Ji, Flexible IO ad Itegratio for Scietific Code Through the Adaptable IO Sytem (ADIOS), i Proc. of the th Iteratioal Workhop o Challege of Large Applicatio i Ditributed Eviromet,. [] J. May, Parallel I/O for High Performace Computig, Morga Kaufma Publihig, 1. [9] PVFS Developmet Team, PVFS Developer' Guide, http://www.pvf.org/cv/pvf---brach-doc/doc//pvf- guide.pdf. [1] A. Phaihayee, E. Krevat, V. Vaudeva, D. Adere, G. Gager, G. Gibo ad S. Seha, Meauremet ad Aalyi of TCP Throughput Collape i Cluter-Baed Storage Sytem, i Proceedig of File ad Storage Techologie (FAST),. [11] F. Schmuck ad R. Haki, GPFS: A Shared-Dik File Sytem for Large Computig Cluter, i 1 t USENIX Coferece o File ad Storage Techologie, USENIX,. [] The HDF5 Project, HDF5 - A New Geeratio of HDF, NCSA, Uiv. of Illioi at Urbaa Champaig. Available at http://hdf.ca.uiuc.edu/hdf5. [13] R. Thakur, W. Gropp ad E. Luk, Optimizig Nocotiguou Accee i MPI-IO, Parallel Computig, ()1:3-15,. [] R. Thakur, W. Gropp ad E. Luk, Data Sievig ad Collective I/O i ROMIO, i Proceedig of the 7th Sympoium o the Frotier of Maively Parallel Computatio, 1999. [15] Top 5 Supercomputig Webite. http://www.top5.org.