The Google File System

The Google File System Motivations of NFS NFS (Network File System) Allow to access files in other systems as local files Actually a network protocol (initially only one server) Simple and fast server crash recovery Motivations of AFS AFS (Andrew File System) Initially to support CMU campus files sharing Main purpose is scalability (NFS s bandwidth) Secure access under network (transparency) Motivations of gfs GFS (Google File System) reliability, scalability, high performance growing demands of Google s data processing High disaster tolerance 1

Assumptions in NFS & GFS NSF & ASF Size of most files is small => local caching Multiple copies may exist at one time => NSF clients needs periodic checking => ASF clients applies callback promise Though file is shared, always only one writer Read is more frequent than write Sequential accessing is more frequent than random accessing Assumptions in GFS Size of files is always very large (1MB, 1GB, ) => chunk size, bandwidth support Most writes are appending, not modifying => optimize large, sequential reads/writes => support small, random reads/writes High fault-tolerant & recovery (commercial PCs) => real-time monitoring, error detection Simultaneously writes are very common => defined, consistent Master Chunkserver Client Chunk: 64MB Replicated : Default 3 2

Support components: chunks handle metadata heartbeat messages Operation log Checkpoint System Interactions Data flow System Interactions Master to chunkservers Heartbeat messages: the way to make sure chunkservers are alive Leases: grant to a primary chunkserver, minimize the overhead of chunkservers management Modify order: primary chunks determine it, secondary chunks follow it 3

System Interaction Data flow Atomic Data appends Masters: accept requests of chunk indexes, answer with the identity of primary and the location of secondary Client add data to the determined offset, return the offset from GFS Clients: cache information sent by master, push data to chunkservers(step 3), send write request to primary(step 4), receive error information from them(step 7) If chunksize overflow, pad all the chunkservers, retry next chunk, write the data to the offset where the replica has Primary and replica: primary make serial operation, pass to replica, send comp info from replica to primary(step 4,5). If it fails, retry (may lead multiple append records) Namespace management Architecture Does not have per-directory structure Does not support alias name for files or directories Lookup table mapping full pathname to metadata GFS:Two levels architecture Large data flow Separate control and data flow AFS/NFS: Single level architecture No master level Sever also contains data 4

Stateful or Stateless Recovery GFS/NFS: Stateless Chunk Handle & File Handle Fast & Simple recovery AFS: Stateful Low network load Transparency NFS: Simply Retried Request GFS: AFS: Idempotency Retried Request Chunk Sever Replica Stateful complicated and costly recovery algorithm Caching / buffering References GFS: No Caching/Buffering Mostly appending operations Large file no need buffering AFS/NFS: Client-side caching WTRITE/READ buffering Sandberg, Russel. "The Sun network file system: Design, implementation and experience." Distributed Computing Systems: Concepts and Structures (1987): 300-316. Sandberg, Russel, et al. "Design and implementation of the Sun network filesystem." Proceedings of the Summer USENIX conference. 1985. Sun s Network File System (NFS) Remzi H. Arpaci-dusseau URL: http://pages.cs.wisc.edu/~remzi/ostep/dist-nfs.pdf Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The Google file system." ACM SIGOPS Operating Systems Review. Vol. 37. No. 5. ACM, 2003. 5