XtreemFS - a distributed and replicated cloud file system Michael Berlin Zuse Institute Berlin DESY Computing Seminar, 16.05.2011
Who we are Zuse Institute Berlin operates the HLRN supercomputer (#63+64) Research in Computer Science and Mathematics Parallel and Distributed Systems Group lead by Prof. Alexander Reinefeld (Humboldt University) Distributed and failure-tolerant storage systems
Who we are Michael Berlin PhD student since 03/2011 studied Informatik at Humboldt Universität zu Berlin Diplom thesis dealt with XtreemFS currently working on the XtreemFS client 3
Motivation Problem: Multiple copies of data Where? Copy complete? Different versions? PC internal Nodes external Nodes local file server internal storage external storage 4
Motivation (2) Problem: Different access interfaces Laptop via 3G/Wi-Fi VPN+?/ SSHFS local file server PC NFS/ Samba SCP external storage external Nodes <parallel file system> 5
Motivation (3) XtreemFS goals: Transparency Availability Laptop via 3G/Wi-Fi PC internal Nodes external Nodes XtreemFS 6
File Systems Landscape 7
Outline 1. XtreemFS Architecture 2. Client Interfaces 3. Read-Only Replication 4. Read-Write Replication 5. Metadata Replication 6. Customization through Policies 7. Security 8. Use Case: Mosgrid 9. Snapshots 8
XtreemFS Architecture (1) Volume on a Metadata Server: provides hierarchical namespace File Content on Storage servers: accessed directly by clients PC internal Nodes local file server internal storage 9
XtreemFS Architecture (2) Metadata and Replica Catalog (MRC): holds volumes Object Storage Devices (OSDs): file content split into objects objects can be striped across OSDs object-based file system architecture 10
WRITE READ Scalability File I/O Throughput parallel I/O: scales with number of OSDs Storage Capacity add and removal of OSDs possible OSDs may be used by multiple volumes Metadata Throughput limited by MRC hardware use many volumes spread over multiple MRCs 11
Accessing Components Directory Service (DIR) central registry all servers (MRC, OSD) register there with their id provides: list of available volumes mapping id URL to service list of available OSDs 12
Client Interfaces XtreemFS supports POSIX interface and semantics mount.xtreemfs: using FUSE runs on Linux, FreeBSD, OS X and Windows (Dokan) libxtreemfs for Java and C++ Laptop via 3G/WiFi PC internal Nodes external Nodes mount.xtreemfs mount.xtreemfs mount.xtreemfs XtreemFS 13
Read-Only Replication Requirement: Mark file as read-only Replica types: a. Full replica: requires complete copy b. Partial replica: fills itself on demand instantly ready to use external Nodes internal storage external storage 14
Read-Only Replication (2) 15
Read-Only Replication (3) Receiver-initiated transfer at object level OSDs exchange object lists Filling strategies: Fetch objects in order rarest first Prefetching available On-Close Replication: automatic replica creation 16
Read-Write Replication Availability Data safety Allow Modifications PC local file server important.cpp internal storage important.cpp 17
Read-Write Replication (2) Primary/Backup: 18
Read-Write Replication (3) Primary/Backup: 1. Lease Acquisition at most one valid lease per file revocation = lease timeout 19
Read-Write Replication (4) Primary/Backup: 1. Lease Acquisition at most one valid lease per file revocation = lease timeout 2. Data Dissemination 20
Read-Write Replication (5) Lease Acquisition XtreemFS: Flease scalable majority-based Central Lock Service Flease Data Dissemination Update Strategies: Write All, Read 1 Write Quorum, Read Quorum 21
Metadata Replication Primary/backup replication volume = database transparently replicate database use leases to elect primary replicate insert/update/delete Database = Key/Value Store own implementation: BabuDB 22
Customization through Policies Example: Which replica shall the client select? determined by policies internal storage??? external storage external Nodes Policies: Authentication Authorization UID/GID mappings Replica placement Replica selection 23
Customization through Policies (2) Replica Placement/Selection Policies: filter / sort / group replica list available default policies: FQDN-based datacenter map Vivaldi (latency estimation) can be chained own policies possible (Java) MRC sorted replica list open() external Nodes node1.ext-cluster internal storage osd1.int-cluster external storage osd1.ext-cluster 24
Security X.509 certificates support for authentication SSL to encrypt communication Laptop via 3G/Wi-Fi external Nodes mount.xtreemfs w/ user certificate XtreemFS mount.xtreemfs w/ host certificate 25
Use case: Mosgrid Mosgrid: ease running experiments in computational chemistry use grid resources through a web portal portal allows to submit and retrieve compute jobs XtreemFS: global data repository 26
Use case: Mosgrid (2) PC Submit Job Browser Retrieve Results Input Data Nodes Results mount.xtreemfs w/ user certificate Web Portal libxtreemfs (Java) Unicore Frontend mount.xtreemfs w/ host certificate XtreemFS XtreemFS scope Berlin Dresden Köln 27
Snapshots Backups needed in case of accidental deletion/modification virus infections Snapshot stable image of the file system at a given point in time PC unlink( important.cpp ) local file server important.cpp internal storage important.cpp 28
Snapshots (2) MRC: create snapshot if requested OSDs: Copy-on-Write on modify: create new object instead of overwriting on delete: only mark as deleted write("file.txt ) snapshot() write("file.txt ) t 0 t file.txt: V1, t 1 file.txt: V2, t 2 29
Snapshots (3) No exact global time: Loosely synchronized clocks assumption: maximum drift ε Time span-based snapshots write("file.txt ) snapshot() write("file.txt ) t 0 write("file.txt ) t 0 - ε t 0 + ε t file.txt: V1, t 1 file.txt: V2, t 2 file.txt: V2, t 2 30
Snapshots (4) OSDs: limit number of versions not version-on-every-write Instead: close-to-open problem: client sends no explicit close implicit close: create new version if last write at least X seconds ago Cleanup tool: deletes versions which belong to no snapshot Snapshots on directory level possible 31
Future Research Self-Tuning Quota support Data de-duplication Hierarchical Storage Management 32
XtreemFS Software Open source: www.xtreemfs.org Development: 5 core developers at ZIB integration tests for quality assurance Community: users and bug reporters mailing list with 102 subscribers Release 1.3: Experimental support for read/write replication and snapshots 33
Thank You! References: http://www.xtreemfs.org/publications.php www.contrail-project.eu The Contrail project is supported by funding under the Seventh Framework Programme of the European Commission: ICT, Internet of Services, Software and Virtualization. GA nr.: FP7-ICT-257438. 34