Replication and Consistency in Cloud File Systems Alexander Reinefeld und Florian Schintke Zuse-Institut Berlin Cloud-Computing-Tag im IKMZ der BTU Cottbus A. Reinefeld, F. Schintke, ZIB 14.04.2011 1
Let s start with a little quiz Who invented Cloud Computing? a) Werner Vogels b) Ian Foster c) Konrad Zuse The correct answer is c) Schließlich werden auch Rechenzentren über Fernmeldeleitungen miteinander vernetzt werden, Konrad Zuse in Rechnender Raum (1969) Konrad Zuse 22.06.1910-18.12.1995 A. Reinefeld, F. Schintke, ZIB 2
ZuseInstitute Berlin Research institute for applied mathematics and computer science Peter Deuflhard chair for scientific computing, FU Berlin Martin Grötschel chair for discrete mathematics, TU Berlin Alexander Reinefeld chair for computer science, HU Berlin A. Reinefeld, F. Schintke, ZIB 3
HPC Systems @ ZIB 1984 Cray 1M 160 MFlops 1987 Cray X-MP 471 MFlops 1994 Cray T3D 38 GFlops 1997 Cray T3E 486 GFlops 2002 IBM p690 2,5 TFlops 2008/09 SGI ICE, XE 150 TFlops 1984 1.000.000-fold performance increase in 25 years 2009 A. Reinefeld, F. Schintke, ZIB 4
H L R N 2 sites 98 computer racks 26112 CPU cores 128 TB memory 1620 TB disk 300 TF peakperf.
S T O R A G E 3 SL8500 robots 39 tape drives 19000 slots
What is Cloud Computing? Cloud Computing = Grid Computing on Datacenters? not that simple Cloud and Grid both abstract resources through interfaces. Grid: via new middleware. Requires Grid APIs. Cloud: via virtualization. Allows legacy APIs. Software as a Service (SaaS) Applications Application Services Platform as a Service (PaaS) Programming Environment Execution Environment Infrastructure as a Service (IaaS) Infrastructure Services Resource Set A. Reinefeld, F. Schintke, ZIB 7
Why Cloud? Pros It scales because it s their resources, not yours It s simple because they operate it Pay for what you need don t pay for empty spinning disks Cons It s expensive Amazon S3 charges $.15 / GB / month. = $1800 / TB / year It s not 100% secure S3 now allows to bring your own RSA key-pair. But: Would you put your bank account into the cloud? It s not 100% available S3 provides service credits if availability drops (10% for 99.0-99.9% availability) Alexander Reinefeld, ZIB 8
File System Landscape PC, local system Network FS/ Centralized Cluster FS/ Datacenter Cloud/Grid ext3, ZFS, NTFS NFS, SMB AFS/Coda Lustre, Panasas, GPFS, CEPH... Grid File System GFarm GDM "gridftp" Alexander Reinefeld, ZIB 9
Consistency, Availability, Partition tolerance: Pick two of three! C + A singleserver, Linux HA (one data center) A A + P Amazon S3 Mercurial Coda/AFS C P Consistency: All clients have the same view of the data. Availability: Each client can always read and write. Partitiontolerance: Operations will complete, even if individual components are unavailable. C + P distributeddatabases, distributed file systems Brewer, Eric. Towards Robust Distributed Systems. PODC Keynote, 2004. Alexander Reinefeld, ZIB 10
Which semantic do you expect? Distributed file systems should provide C + P But recent hype was on A + P + eventual consistency (e.g. Amazon S3) Alexander Reinefeld, ZIB 11
GridFile System provides access to heterogeneous storage resources, but middleware causes additional complexity, vulnerability requires explicit file transfer whole file: latency to 1 st access, bandwidth, disk storage also partial file access (gridftp) and pattern access (falls) no consistency among replicas user must take care no access control on replicas Alexander Reinefeld, ZIB 12
CloudFile System: XtreemFS Focus on data distribution data replication object based Key features MRCs are separated from OSDs fat Client is the link MRC = metadata & replica catalogue OSD = object storage device Client = file system interface Alexander Reinefeld, ZIB 13
A closer look at XtreemFS Features distributed, replicated POSIX compliant file system Server software (Java) runs on Linux, OS X, Solaris Client software (C++) runs on Linux, OS X, Windows secure: X.509 and SSL open source (GPL) Assumptions synchronous clocks with max. time drift (needed for OSD lease negotiation, reasonable assumption in clouds) upper limit on round trip time no need for FIFO channels (runs on either TCP or UDP) A. Reinefeld, F. Schintke, ZIB 14
XtreemFSInterfaces A. Reinefeld, F. Schintke, ZIB 15
File access protocol User appl. (Linux VFS) XtreemFS Client (fuse) MRC OSD Update(Cap, FileSize=128k) FileSize = 128k Alexander Reinefeld, ZIB 16
Client gets list of OSDs from MRC get a capability (signed by MRC) per file selects best OSD(s) for parallel I/O various striping policies: scatter/gather, RAIDx, erasure codes scalable and fast access no communication between OSD and MRC needed client is the missing link A. Reinefeld, F. Schintke, ZIB 17
MRC Metadata and Replication Catalogue provides open(), close(), readdir(), rename(), attributes per file: size, last access, access rights, location (OSDs), capability (file handle) to authorize a client to access objects on OSDs implemented with a key/value store (BabuDB) fast index append-only DB allows snapshots A. Reinefeld, F. Schintke, ZIB 18
OSD Object Storage Device serves file content operations read(), write(), truncate(), flush(), implements object replication also partial replicas for read-access data is filled on demand gets OSD list from MRC slave OSD redirects to master OSD write ops only on master OSD POSIX requires linearizable reads, hence reads are also redirected A. Reinefeld, F. Schintke, ZIB 19
OSD Object Storage Device Which OSD to select? object list bandwidth rarest first network coordinates, datacenter map, prefetching (for partial replicas) A. Reinefeld, F. Schintke, ZIB 20
OSD Object Storage Device implements concurrency control for replica consistency POSIX compliant master/slave replication with failover group membership service provided by MRC lease service Flease : distributed, scalable and failure-tolerant 50,000 leases/sec with 30 OSDs based on quorum consensus (Paxos) A. Reinefeld, F. Schintke, ZIB 21
Quorum consensus Basic algorithm When a majority is informed, each other majority has at least one member with up-to-date information. A minority may crash at any time. Paxos Consensus 1 Step: Check whether a consensus c was already established 2 Step: Re-establish c or try to establish own proposal x x x x xx x x A. Reinefeld, F. Schintke, ZIB 22
Proposer Init r = 1 r latest = 0 latest v = // Neues Proposal senden ack num = 0 Sende prepare(r) an alle acceptors Empfange ack(r ack,v i,r i ) von acceptor i Falls r == r ack ack num ++ Falls r i > r latest r latest = r i // lokale Runden-Nr // Nr der höchsten bestätigten Runde // Wert d. höchsten bestätigten Runde // Anzahl gültiger Bestätigungen // jüngere akzeptierte Runde // jüngerer Wert latest v = v i Falls ack num maj Falls latest v == schlage selbst einen Wert latest v vor sende accept(r, latest v ) an alle acceptors Acceptor Init r ack = 0 r accepted = 0 v = Empfange prepare(r) von proposer Falls r > r ack r > r accepted r ack = r // zuletzt bestätigte Runde // zuletzt akzeptierte Runde // aktueller lokaler Wert Sende ack(r ack, v, r accepted ) an Proposer Empfange accept(r, w) // höhere Runde Ende 1. Phase Learner num accepted = 0 // Anzahl gesammelter accepts Empfange accepted(r, v) von acceptor i Wenn r steigt: num accepted = 0 num accepted ++ Falls num accepted ==maj decide v; inform client // v ist Konsens Falls r r ack r > r accepted r accepted = r Alexander Reinefeld, ZIB 23 v = w Sende accepted (r accepted, v) an Learners
Striping Performance on Cluster Striping parallel transfer from/to many OSDs READ bandwidth scales with the number of OSDs client is the bottleneck: (slower reads are caused by TCP ingress problem) WRITE One client writes/reads a single 4GB file using asynchr. writes, read-ahead, 1MB chunk size, 29 OSDs. Nodes are connected with IP over IB (1.2 GB/s). A. Reinefeld, F. Schintke, ZIB 24
Snapshots & Backups Metadata snapshots (MRC) need atomic operation without service interrupt asynchronous consolidation in background granularity: subdirectories or volumes implemented by BabuDB or Scalaris File snapshots (OSD) taken implicitly when file is idle or explicitly when closing file or fsync() versioning of file objects: copy-on-write A. Reinefeld, F. Schintke, ZIB 25
Atomic Snapshots in MRC implemented with BabuDB backend a large-scale DB for data that exceeds the system s main memory 2 components: small mutable overlay trees(lsm trees) large immutable memory-mapped index on disk non-transactional key-value store prefix and range queries primary design goal: Performance! 300,000 lookups/sec (30M entries) fast crash recovery fast start-up A. Reinefeld, F. Schintke, ZIB 26
Log Structured Merge Trees:. A lookup takes O(s log(n)) with s: #snapshots, n: #files A. Reinefeld, F. Schintke, ZIB 27
Replicating MRC, OSDs Master/Slave Scheme Pros fast local read no distributed transactions easy to implement Cons master is performance bottleneck interrupt when master fails: needs stable master election Replicated State Machine (Paxos) Pros Cons no master, no single point of failure no extra latency on failure slower: 2 round trips per op needs distrib. consensus Alexander Reinefeld, ZIB 28
XtreemFSFeatures Release 1.2.1 (current) RAID and parallel I/O POSIX compatibility Read-only replication Partial replicas (on-demand) Security (SSL, X.509) Internet ready Checksums Extensions OSD and replica selection (Vivaldi, datacenter maps) Asynchronous MRC backups Metadata caching Graphical admin console Hadoop file system driver (experimental) Release 1.3 (very soon) DIR and MRC replication with automatic failover Read/write replication Release 2.x Consistent Backups Snapshots Automatic replica creation, deletion and maintenance Alexander Reinefeld, ZIB 29
Source Code XtreemFS http://code.google.com/p/xtreemfs 35.000 lines of C++ and Java code GNU GPL v2 license BabuDB http://code.google.com/p/babudb 10.000 lines of Java code new BSD license Scalaris http://code.google.com/p/scalaris 28.214 lines of Erlangand C++ code Apache 2.0 license Scalaris A. Reinefeld, F. Schintke, ZIB 30
Summary Cloud file systems require replication availability fast access, striping Replication requires consistency algorithm when crashes are rare: use master/slave replication with frequent crashes: use Paxos Only Consistency + Partition tolerance from CAP theorem Our next step: Faster high-level data services for MapReduce, Dryad, key/value store, SQL, A. Reinefeld, F. Schintke, ZIB 31