Distributed File Systems An Overview Nürnberg, 30.04.2014 Dr. Christian Boehme, GWDG
Introduction A distributed file system allows shared, file based access without sharing disks History starts in 1960s Vast selection for different use cases Complex taxonomy Distributed access Federated access This presentation covers more recent (free) solutions for typical, current use cases 2
FhGFS High Performance Computing Core Features and Use Cases Direct parallel access clients meta data server storage servers Core Features Distributed files and metadata Native support for HPC networks (Infiniband) Easy to setup and maintain POSIX support Now marketed as BeeGFS Use Cases Data storage for HPC clusters: Requires performance, but no high availability. On-demand provisioning of cross-server storage: Requires easy setup, but no high availability. 3
Hadoop FS Big Data Core Features and Use Cases data access Client Nam enode BackupNode state inform ation Core Features Same server for data and compute Replication prevents data loss Part of the Hadoop framework Extensive ecosystem of big data tools MapReduce, Pig (Computation) HBase (Database) Hive (Data Warehouse) Use Cases Really big data: 5000+ nodes, 100+ PB data per cluster at Yahoo, Facebook... Any application using the Hadoop ecosystem: Performance and scalability, no POSIX required. 4
Ceph Cloud and Data Center Storage Core Features and Use Cases Low-Level API Object-Based Block-Based File-Based LIBRADOS Library access to RADOS: Java C, C++ Python RADOSGW REST S3 Interface RBD Block devices KVM / QEMU CEPH FS POSIX Kernel FUSE-Client Core Features Utilizes compute power of storage nodes (OSDs) and clients Data distribution for performance Data replication for redundancy Easily scalable by adding OSDs Self healing, self managing reliable autonomic distributed object store (RADOS) Use Cases Cost-efficient, flexible and scalable high-availability storage Storage for cloud and virtualization infrastructures (OpenStack) 5
irods Federated Data Access Core Features and Use Cases Trier Karlsruhe Replication Göttingen Core Features Data Management Middleware Rule Engine for policy enforcement Data replication between sites and data centers Creation of federated repositories beyond organizational boundaries Transparent access to remote site data from any site in the federation Central catalogue of access rights Use Cases Replication of archival research data between data centers (disaster prevention) Implementation of data management policies and workflows Federated data infrastructure 6
Conclusion Choose a file system with a scope that overlaps well with your use case Advanced policy requirements in data federations exceed the scope of typical distributed file systems. Data management middlewares - like irods - are a possible choices for realizing distributed data scenarios Solutions for simpler site distribution scenarios exist (replication) Choosing the wrong file system can be very expensive, when you have to migrate Petabytes of data 7
Distributed filesystem OpenAFS Over 20 years old and well tested Used by large organizations (CERN, DESY, Stanford Univ. and many others) Designed for use over the Internet Replicated read-only content Open source; very active development Available for a broad range of heterogeneous systems including UNIX, Linux, MacOS, Windows, ios Commercial support is available http://openafs.org/ SEITE 8
OpenAFS Uses Kerberos (e.g., Active Directory) for security Federated access through Kerberos trust relations Encryption of network traffic between clients and servers SEITE 9
Contact Dr. Christian Boehme T +49 551 201 1839 F +49 551 201 1576 E christian.boehme@gwdg.de Oliver Schmitt T +49 551 39 20512 F +49 551 201 1576 E oliver.schmitt@gwdg.de GWDG - Gesellschaft für wissenschaftliche Datenverarbeitung mbh Göttingen Am Faßberg 11, 37077 Göttingen http://www.gwdg.de 10