The Lustre File System Eric Barton Lead Engineer, Lustre Group Sun Microsystems 1
Lustre Today What is Lustre Deployments Community Topics Lustre Development Industry Trends Scalability Improvements
Lustre File System The world's fastest, most scalable file system Parallel shared POSIX file system Scalable High performance Petabytes of storage Tens of thousands of clients Coherent Single namespace Strict concurrency control Heterogeneous networking High availability GPL Open source Multi-platform Multi-vendor
Lustre File System Major components Client Client Client Client Client Client Client Client MGS configuration MDS MDS namespace MDS data
Lustre Networking Simple Message Queues RDMA Active - Get/Put Passive Attach Asynchronous Events Error handling Unlink Layered LNET / LND Multiple Networks Routers RPC Queued requests RDMA bulk RDMA reply Recovery Resend Replay
A Lustre Cluster Metadata Servers (MDS) I/O Servers () MDS 1 (active) MDS 2 (standby) Multiple Networks 1 Commodity Storage Servers TCP/IP QSNet 2 Lustre Clients 10 s - 10,000 s Myrinet InfiniBand iwarp Cray Seastar 3 Shared storage enables failover Router 4 5 = failover 6 Enterprise-Class Storage Arrays & SAN Fabrics 7
Lustre Today Lustre is the leading HPC file system > 7 of Top 10 > Over 40% of Top100 Demonstrated scalability and performance > 190GB/sec IO > 26,000 clients > Many systems with over 1,000 nodes
Livermore Blue Gene/L SCF 3.5 PB storage; 52 GB/s I/O throughput 131,072 processor cores TACC Ranger 1.73 PB storage; 40GB/s I/O throughput 62,976 processor cores Sandia Red Storm 340 TB Storage; 50GB/s I/O throughput 12,960 multi-core compute sockets ORNL Jaguar 10.5PB storage; 240 GB/s I/O throughput goal 265,708 processor cores
Center-wide File System Spider will provide a shared, parallel file system for all systems Based on Lustre file system Demonstrated bandwidth of over 190 GB/s Over 10 PB of RAID-6 Capacity 13,440 1 TB SATA Drives 192 Storage servers 3 TeraBytes of memory Available from all systems via our high-performance scalable I/O network Over 3,000 InfiniBand ports Over 3 miles of cables Scales as storage grows Undergoing system checkout with deployment expected in summer 2009
Future LCF Infrastructure Everest Powerwall Remote Visualization Cluster End-to-End Cluster Application Development Cluster Data Archive 25 PB SION 192x 48x 192x XT5 Login XT4 Spider
Lustre Success - Media Customer challenges > Eliminate data storage bottlenecks resulting from scalability issues NFS can't handle > Increase system performance and reliability Lustre value > Doubled data storage at three times less cost of compelling solutions > The ability to provide a single file system namespace to its production artists > Easy-to-install open source software with great flexibility on storage and server hardware While we were working on The Golden Compass, we faced the most intensive I/O requirements on any project to date. Lustre played a vital role in helping us to deliver this project. Daire Byrne, senior systems integrator, Framestore
Lustre success - Telecommunications Customer challenges > Provide scalable service > Ensure continuous availability > Control costs NBC broadcast 2008 Summer Olympics live online over Level 3 network using Lustre Lustre value > The ability to scale easily > Works well with commodity equipment from multiple vendors > High performance and stability With Lustre, we can achieve that balancing act of maintaining a reliable network with lesscostly equipment. It allows us to replace servers and expand the network quickly and easily - Kenneth Brookman, Level 3 Communications
Lustre success - Energy Customer challenges > Process huge and growing volumes of data > Keep hardware costs manageable > Scale existing cluster easily Lustre value > Ability to handle exponential growth in data > Capability to scale computer clusters easily > Reduced hardware costs > Reduced maintenance costs More Success
Open Source Community Lustre OEM Partners
Open Source Community Resources Web http://www.lustre.org News and information Operations Manual Detailed technical documentation Mailing lists lustre-discuss@lists.lustre.org > General/operational issues lustre-devel@lists.lustre.org > Architecture and features Bugzilla https://bugzilla.lustre.org Defect tracking and patch database CVS repository Lustre Internals training material
HPC Trends Processor performance / RAM growing faster than I/O Relative number of I/O devices must grow to compensate Storage component reliability not increasing with capacity > Failure is not an option it s guaranteed Trend to shared file systems Multiple compute clusters Direct access from specialized systems Storage scalability critical
DARPA HPCS Capacity 1 trillion files per file system 10 billion files per directory 100 PB system capacity 1 PB single file size >30k client nodes 100,000 open files Reliability End-to-end data integrity No performance impact during RAID build Performance 40,000 file creates/sec > Single client node 30GB/sec streaming data > Single client node 240GB/sec aggregate I/O > File per process > Shared file
Lustre and the Future Continued focus on extreme HPC Capacity > Exabytes of storage > Trillions of files > Many client clusters each with 100,000's of clients Performance > TB's/sec of aggregate I/O > 100,000's of aggregate metadata ops/sec Community Driven Tools and Interfaces > Management and Performance Analysis
HPC Center of the Future Capability 500,000 Nodes Capacity 1 250,000 Nodes Capacity 2 150,000 Nodes Capacity 3 50,000 Nodes Test 25,000 Nodes Viz 1 Viz 2 WAN Access Shared Storage Network 10 TB/sec User Data 1000 MDTs Metadata 25 MDS s HPSS Archive Lustre Storage Cluster
Lustre Scalability Definition Performance / capacity grows nearly linearly with hardware Component failure does not have a disproportionate impact on availability Requirements Scalable I/O & MD performance Expanded component size/count limits Increased robustness to component failure Overhead grows sub-linearly with system size Timely failure detection & recovery
Lustre Scaling
Architectural Improvements Clustered Metadata (CMD) 10s 100s of metadata servers Distributed inodes > Files local to parent directory entry / subdirs may be non-local Distributed directories > Hashing Striping Distributed Operation Resilience/Recovery > Uncommon HPC workload - Cross-directory rename > Short term - Sequenced cross-mds ops > Longer term - Transactional - ACID - Non-blocking - deeper pipelines - Hard - cascading aborts, synch ops
Epochs Global Oldest Volatile Epoch Reduction Network Oldest Epoch Current Globally Known Oldest Volatile Epoch Newest Epoch Stable Unstable Committed Uncommitted Server 1 Updates Server 2 Server 3 Operations Client 1 Client 2 Local Oldest Volatile Epochs Redo
Architectural Improvements Fault Detection Today RPC timeout > Timeouts must scale O(n) to distinguish death / congestion Pinger > No aggregation across clients or servers > O(n) ping overhead Routed Networks > Router failure can be confused with end-to-end peer failure Fully automatic failover scales with slowest time constant > Many 10s of minutes on large clusters > Failover could be much faster if useless waiting eliminated
Architectural Improvements Scalable Health Network Burden of monitoring clients distributed not replicated > ORNL 35,000 clients, 192 s, 7 OSTs/ Fault-tolerant status reduction/broadcast network > Servers and LNET routers LNET high-priority small message support > Health network stays responsive Prompt, reliable detection > Time constants in seconds > Failed servers, clients and routers > Recovering servers and routers Interface with existing RAS infrastructure Receive and deliver status notification
Health Monitoring Network Primary Health Monitor Failover Health Monitor Client
Architectural Improvements Metadata Writeback Cache Avoids unnecessary server communications > Operations logged/cached locally > Performance of a local file system when uncontended Aggregated distributed operations > Server updates batched and tranferred using bulk protocols (RDMA) > Reduced network and service overhead Sub-Tree Locking > Lock aggregation a single lock protects a whole subtree > Reduce lock traffic and server load
Architectural Improvements Current - Flat Communications model Stateful client/server connection required for coherence and performance Every client connects to every server O(n) lock conflict resolution Future - Hierarchical Communications Model Aggregate connections, locking, I/O, metadata ops Caching clients > Aggregate local processes (cores) > I/O Forwarders scale another 32x or more Caching Proxies > Aggregate whole clusters > Implicit Broadcast - scalable conflict resolution
Hierarchical Communications Lustre Storage Cluster MDS MDS MDS MDS MDS MDS Proxy Cluster Proxy Cluster Proxy Server WBC Client Proxy Server Proxy Server WBC Client Proxy Server Proxy Server WBC Client Proxy Server User Proc. Lustre Client IO Forwarding Server I/O Forwarder WBC Client IO Forwarding Server I/O Forwarder User Proc. IO Forwarding Client IO Forwarding Client IO Forwarding Client User Proc. User Proc. User Proc. User Proc. User Proc. User Proc. User Proc. User Proc. Proxy Server WBC Client Proxy Server WAN / Security Domain WBC Client User Proc. Luster Comm System Calls
ZFS End-to-end data integrity Checksums in block pointers Ditto blocks Transactional mirroring/raid Remove ldiskfs size limits Immense Capacity (128 bit) No limits on files, dirents etc COW Transactional Snapshots
Performance Improvements SMP Scaling Improve MDS performance / small message handling CPU affinity Finer granularity locking # Client Nodes RPC Trhoughput RPC Througput Total client processes Total client processes
Load (Im)Balance Request Queue Depth Time Server #
Network Request Scheduler Much larger working set than disk elevator Higher level information - client, object, offset, job/rank Prototype Initial development on simulator Scheduling strategies - quanta, offset, fairness etc. Testing at ORNL pending Future Exchange global information - gang scheduling QoS - Real time / Bandwidth reservation (min/max)
Metadata Protocol Improvements Size on MDT (SOM) Avoid multiple RPCs for attributes derived from OSTs OSTs remain definitive while file open Compute on close and cache on MDT Readdir+ Aggregation > Directory I/O > Getattrs > Locking
Lustre Scalability Attribute Today Future Number of Clients 10,000s Flat comms model 1,000,000s Hierarchical comms model Server Capacity Ext3 8TB ZFS - Petabytes Metadata Performance Single MDS CMD SMP scaling Recovery Time RPC timeout - O(n) Health Network - O(log n)
THANK YOU Eric Barton eeb@sun.com lustre-discuss@lists.lustre.org lustre-devel@lists.lustre.org 36