Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale applications. Google File System BigTable MapReduce What are the primary design issues surrounding GFS? cisco p.1/?? cisco p.2/?? Department of Computer Science University of San Fra Design Issues Architecture Commodity hardware - failures are the rule Huge files Files tend to be written, then either appended or streamed Random writes are rare What sorts of applications would have this behavior? Multiple users may simultaneously write to a file API and application design should happen in tandem Sustained bandwidth more important than latency Single master Many chunkservers Many clients Files are divided into 64MB chunks. A chunk is redundantly stored at many chunkservers. What is the master s role? cisco p.4/?? cisco p.5/?? Department of Computer Science University of San Fra
Chunkserver Chunkserver Maintain metadata namespace, access control, mapping of files to chunks, chunk locations. Refer clients to chunkservers. Control lease management, garbage collection, migration What is the chunkserver s role? What is the chunkserver s role? Serve up chunks to clients cisco p.7/?? cisco p.8/?? Department of Computer Science University of San Fra Control flow Control flow Advantages What is the typical order of operations for a client that wants to read a file? Client sends filename and offset to master. returns chunk handle and replica locations. Client chooses a replica and requests chunk range. not needed for further data exchange. What are some advantages of this approach? cisco p.10/?? cisco p.11/??
Advantages Persistence Persistence Simplicity of master. Data held in memory. No need to handle chunk access. Easy to handle failure; client just requests a new chunk. Is a single master a potential point of failure? How can a master recover from a crash? keeps all data structures in memory Each file action is logged. also periodically checkpoints. On failure, reload from checkpoint and play back log. cisco p.13/?? cisco p.14/?? Chunk info Chunk info Consistency How does the master know what chunks are stored at each chunkserver? periodically sends a heartbeat to each chunkserver. chunkserver responds with list of all stored chunks and their status. Occasionally, master may have stale information. Simplifies master and reduces overhead. What does consistency mean? What does defined mean? cisco p.16/?? cisco p.17/??
Consistency Implications for applications Implications for applications What does consistency mean? All clients see the same data What does defined mean? All clients see the complete results of a mutation. If a single mutation succeeds, it is consistent and defined. Concurrent writes may be consistent but not defined. Appends are handled more efficiently than random writes. What implications does this model have for an application? What implications does this model have for an application? Applications should append when possible Applications need to keep track of the defined region of the file. Applications will need to tolerate or filter occasional duplicate records. cisco p.19/?? cisco p.20/?? Leases Leases Writing replicated data What is a lease? How is it used? What is a lease? How is it used? A lease is an object that is used to allow mutations to a chunk. The master grants this to one chunkserver (the primary) which then coordinates writes with other replicas. What is the order of operations for writing replicated data? cisco p.22/?? cisco p.23/??
Writing replicated data Data flow Bigtable Client obtains a lease Sends write request to primary Client sends data to all replicas; these are cached. Primary sends write request to all replicas. All replicas process writes to that chunk in the same order. What if a replica fails during this operation? Data is pushed between replicas in a linear fashion. This is an interesting choice; they could have used multicast, or a tree. Why is this? Bigtable is implemented on top of GFS What are the goals of bigtable? What does it not provide? cisco p.25/?? cisco p.26/?? Bigtable Data model Data model Bigtable is implemented on top of GFS What are the goals of bigtable? High availability, scalability, high performance What does it not provide? Complex relational queries, datatypes What is Bigtable s data model? What is Bigtable s data model? Multidimensional map: row name, column name, timestamp map to a data cell (string). cisco p.28/?? cisco p.29/??
Rows Column families Data storage Rows are broken into ranges called tablets, arranged lexicographically. What is the thinking behind this? Column keys are grouped into column families. What is the thinking behind this? GFS is used to store data. Bigtable can coexist with other applications. Data files are written out using the SSTable file format. Chubby is used to provide locking and synchronization. cisco p.31/?? cisco p.32/?? Architecture Tablet servers Tablet servers Tablet servers Clients Chubby What do tablet servers do? What do tablet servers do? Handle interactions with clients, read and write data Tablets are not replicated. cisco p.34/?? cisco p.35/??
How does a client find a tablet? Root tablet accessed via Chubby This contains a map of tablets to tablet servers. This info is then cached by the client. Client communicates directly with the server. What is the role of the master? Keep track of tablet servers Place unassigned tablets. cisco p.37/?? cisco p.38/?? Discussion How can the master tell that a tablet server has died? How can the master tell that a tablet server has died? When a tablet server starts, it creates a lock in Chubby. queries server for the status of the lock. If server does not reply, master attempts to acquire lock. If successful, it redistributes that server s tablets. How does BigTable s architecture compare to GFS? What advantages does this structure have? How does this compare to architectures such as Can or Chord that you might ve learned about in 682? cisco p.40/?? cisco p.41/??
MapReduce MapReduce Example What is the basic paradigm of mapreduce? Define a map operation that is applied to each record in an input to generate key/value pairs Define a reduce operation applied to all elements with the same key to aggregate results. the classic example, counting words: def map(document, words) : for word in words.split() : yield word, 1 def reduce(key, words) : yield key, sum(words) cisco p.43/?? cisco p.44/?? Parallelizing Implementation Failure Structuring your problem in this way allows the map function to run simultaneously on many different machines on subsets of your data. Reduce can then run in parallel for each key. Input data is split into a number of sets. Keyspace is subdivided. A master is used to assign tasks to workers. Each mapping task is performed independently. results are eventually buffered, and the location returned to the master. The master then forwards mapped locations to reduce workers. Reduce workers collect all data associated with their keys, perform reduce, and write data to file. How is worker failure handled? cisco p.46/?? cisco p.47/??
Failure Failure Failure How is worker failure handled? Workers are pinged. Active tasks belonging to non-responsive workers are reassigned. Completed map tasks must be redone. How is master failure handled? How is master failure handled? Checkpointing Restarting cisco p.49/?? cisco p.50/?? Refinements Refinements MR vs DBMS The authors describe a number of refinements to MapReduce. What are they and why are they useful? User-defined partitioning User-defined combining Specialized readers skipping bad records Stonebraker, et al identify the sorts of tasks that MapReduce (Hadoop) excels at, and that RDBMS excel at. MapReduce: Extract-Transform-Load Complex analytics that require multiple passes Semi-structured data (key-value pairs) Quick-and-dirty problems Limited budget cisco p.52/?? cisco p.53/??
MR vs DBMS MR vs DBMS Takeaway Parallel DBMS: Grep log mining with group by join (combine user visits to URLs with PageRank table) Stonebraker, et al suggest some reasons why DBMS might do better even on tasks that seem to be in MapReduce s area of expertise: Repeated parsing of records Tuned compression in DBMS Intermediate data streamed, rather than written to disk Scheduling - DBMSs construct a query plan Hadoop could incorporate streaming and more job-aware scheduling SQL is arguably easier to write than mapreduce code. DBMSs need to be more plug-and-play DBMSs should work with filesystem data. cisco p.55/?? cisco p.56/??