A parallel file system made in Germany Tiered Storage and HSM

A parallel file system made in Germany Tiered Storage and HSM Mai 7th 2012 HLRS Stuttgart Franz-Josef Pfreundt Competence Center for HPC Sven Breuner

Fraunhofer Institut for Industrial Mathematics Mathematical models Algorithms Simulations Software Visualization Data mining Fluid Dynamics LI-ION Battery Simulation Optimization CC-HPC

Fraunhofer Competence Center for HPC Business Fields GPI GPI-Space HPC Tools Visualization Green IT HPC Apps Maximizing efficiency Seismic parallel programming models distributed computing parallel algorithms parallel file systems The ability to rapidly generate processor cores that are tailored to scientific applications makes these tools compelling, but the excessive overhead in verifying hardware and creating a usable software stack for each new processor negates any time saved in hardware development. To address this drawback, the tools generate optimizing compilers test benches as well as a functional simulator in parallel with the design s register transfer logic. Constructing the processor with verified building blocks and automatically generating test benches greatly reduce the risk and time spent in formal verification. Rapid design prototyping DRAM Traditionally, the complexity of coding in Arbiter Verilog or VHDL versus C++ or Python and the inability to emulate large designs have outweighed $ Xtensa Xtensa $ the speed and accuracy advantages of using fieldprogrammable gate arrays (FPGAs). However, FPGA network network To global To global use has become much more practical over the past $ Xtensa Xtensa $ decade because, unlike commercial microprocessors, FPGAs are not experiencing a clock-rate and Local store power plateau. The lookup table count on FPGAs Arbiter continues to increase, enabling the emulation of DRAM more complex designs. In addition, FPGA clock rates have been growing steadily, closing the gap between emulated and production clock rates. Figure 1. The on-chip network fabric for the Green Flash systemon-chip. A concentrated torus network fabric yields the highest Recent advances in FPGA I/O features have made performance and most power-efficient design for scientific codes. accessing large, dynamic memories much more palatable. To accelerate the creation of prototype system designs, ample, demonstrated the emulation of more than 1,000 we are using the Research Accelerator for Multiple Processors (RAMP), 6 an FPGA emulation platform that makes the cores using a stack of 16 BEE2 boards. 7 hardware configuration available for evaluation while the actual hardware is still on the drawing board. RAMP is a Opting to follow the design philosophy that the best cooperative effort among six universities to build a new way to reduce power consumption and increase efficiency is to reduce waste, we chose an architecture standard emulation system for parallel processors. Although the steady growth in FPGA lookup table count with a very simple in-order core and no branch prediction. Because the climate model s demands for memory has enabled the emulation of more complex designs, a strawman architecture of 128 cores per socket requires and communication are high, both aspects drive Green emulating more than the two or four cores that will fit on a Flash s core design. Reducing the computational burden through autotuning also contributes to efficiency. Finally, version 3 of the Berkeley Emulation Engine (BEE3), a board hardware-software cotuning tunes the hardware to the populated with four Virtex-5 155 FPGAs, each with two autotuned software for additional efficiency gains. dedicated channels of double data rate memory, connected Network topology. Our experience evaluating the STI in a ring with a crossover connection. Cell processor 4 shows that, for memory-intensive applications, cores with a local store use a higher percentage of Using the BEE3, we effectively emulate eight networked cores, each running at 33 MHz. To scale beyond eight the available dynamic RAM (DRAM) bandwidth. On the cores, the BEE3 includes 10-Gbit Ethernet connections, basis of these results, we decided to include a local store in allowing the boards to be linked and enabling the emulation of an entire socket. There is significant precedent for emulating massively multithreaded architectures across multiple FPGAs. The Berkeley RAMP Blue project, for ex- single FPGA. To address this limitation, we have employed Research our processor architecture. As Figure 1 shows, the design uses a torus network fabric with two on-chip networks. Predictably, most of the communication among the climate model s subdomains is nearest neighbor. We did ray tracing in visualization NOVEMBER 2009 seismic imaging 65 distributed energy management

R. Fontana IBM about Storage media development

Cost Relationship - R. Fontana IBM SSD s will stay expensive Tape will still play an important role and will grow fast in capacity

FhGFS Key Features Maximum Scalability Distributed File Contents & Metadata Low server load efficient multithreading Cloud type installations More than 300 servers Object storage, servers use a local file system (XFS, EXT, ZFS..)

FhGFS Key Features Flexibility Add Clients and Servers without Downtime Client and Servers can run on same Machine On-the-fly storage init (mkfs) Multiple Networks with dynamic Failover Storage Cluster Compute + Storage o Flexible Striping: Individual Settings on a per-file /per-directory Basis

Fraunhofer Seislab Interactive Seismic Imaging Compute & Storage 20 Compute Nodes 48-96 GB RAM 4 x 256 GB SSD striped QDR Infiniband 5 Compute&Storage Nodes 20 TB HDD, RAID5 (Archive) QDR Infiniband Tier 1 : 20 TB SSD Tier 2 : 120 TB HDD On demand FhGFS using ths SSD s per job up to 20 TB Read: 30 GB/sec Write: 20GB/sec Network bisection BW ~ I/O performance out of core applications

FhGFS Key Features Easy to use Automated Cluster Installation Kernel Module, user space severs, no patches Graphical System Administration & Monitoring o No specific Linux Distribution, o No special Hardware required

Question? What is the right direction? Compute Nodes POSIX Fast NAND Storage Cloud type storage HDFS Read Write Delete Append Extended by tape archives FhGFS can run on a massive number of I/O nodes We could easily implement a HDFS like API - should we? NON- POSIX

FhGFS - We focus on performance & reliability Light-weight Client Kernel Module High Single-Stream Throughput (>2.7GB/s on QDR IB) Two streams saturate QDR IB Efficient Metadata implementation (messurements on Seislab) Single Metatdata Server (SSD): Four Metadata Servér - File Creates / sec: 15369 - File Creates/sec : 58564 - File Stats /sec : 106807 - File Stats/sec : 373484 ( FhGFS will increase the file create rate with the next release) TU Vienna : 12 servers, 300 TByte, 6GB/sec, 1200 clients 12 metadata server(ssd) (on each server) The network is the bottleneck FhGFS is made for HPC and on demand file systems

FhGFS Customer feature Server Preference Clients can prefer a Subset of Servers => Support for multiple Data Centers WAN Setup @ Uni FFM By Jan Heichler, Clustervision

100GBit Testbed (Dresden <-> Freiberg) Uni-directional GPFS - 10,1 GB/s (60 km) Lustre - 11,8 GB/s (60 km) FhGFS - 12,4 GB/s (400 km) By Michael Kluge, TU Dresden Bi-directional GPFS - n/a Lustre - 21,9 GB/s (60 km) FhGFS - 22,5 GB/s (400 km)

Last major release August 2011 Re-designed metadata request handling to scale to high numbers of CPU cores All file attributes stored on metadata server Distributed POSIX file locking Parallel online file system check/repair Client operation counters Simplified automatic updates via software repositories Faster, more flexible, easier to use Multiple storage targets per server

Business Model No license fees Pay for support and maintenance Open Source - on a individual basis So far not a community request

Our supported customers ( > 50) HPC Centers Oil&Gas Medical Research Media world Cloud Computing Social Media University Oslo No system Halt for Software reasons Happy users... And more

About the FhGFS Roadmap Some FhGFS roadmap pillars are fixed, e.g.: HA HSM We leave some room to implement interesting user ideas, e.g.: Server affinity Client operation counters We learned that we need to leave some room to improve Linux kernel / tools, e.g.: tail, ls -l, Linux RDMA And we have enough people in the institute that develop HPC applications with disruptive new ideas and challenge the I/O subsystems every day.

Next major release 2012 Data/metadata mirroring over multiple FhGFS servers Configurable on a per-file (per-directory) basis Later : Server groups for remote mirroring Quota/ACL support Improved NFS re-export MAC support ( 2012).. and more Next major release Q3 2012

Typical Cloud Type Storage solution Cheap Storage System than include a server board 100TB/system 1PB < 250KEuro High Availablity by data mirroring No server fail over (complex setup and expensive) Synchronous mirrors per site Aynchronous mirroring across sites Main Advantage of FHGFS Run Client server metadata server and appllication server on same machine Internal data miroring with next release

FhGFS & HSM Archive and Backup need to utilize the capabilities of a PFS We want to use our scalable MDS system to support Archive & Backup For large systems a non POSIX access to data is required We may support the Hadoop API in the future no tapes Tape capacity grows faster then HDD capacity 2014 : 6-10 TB/tape (LTO) Tapes do not consume electricity Grau Data and Fraunhofer decided to work together to provide a competitive system

Hierarchical Storage Management Grau Data provides Grau ArchiveManager (GAM) as a solid single-server HSM solution We will integrate HSM Information into our MDS and the MDS will communicate directly with GAM scalability fast HSM solution The combined solution will support Parallel data migration (e.g. recall all file chunks at once) Collocation IDs Asynchronous recalls More in the next talk

FhGFS is the fastest system today FhGFS is scalable in every respect FhGFS is easy to install and maintain FhGFS will be combined with HSM in 2012 http://www.fhgfs.com fhgfs-user@googlegroups.com Franz-Josef Pfreundt, Sven Breuner pfreundt@itwm.fhg.de,breuner@itwm.fhg.de