Cloud Computing mit mathematischen Anwendungen Vorlesung SoSe 2009 Dr. Marcel Kunze Karlsruhe Institute of Technology (KIT) Steinbuch Centre for Computing (SCC) KIT the cooperation of Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) www.kit.edu 04.08
Agenda Cloud Computing 1. Einleitung Was ist Cloud Computing? 2. Grundlagen Virtualisierung, Web Services, 3. Cloud Architekturen Infrastruktur, Plattform, Anwendung 4. Cloud Services Amazon Web Services, Google App Engine 5. Aufbau einer Cloud OpenCirrus Projekt, Eucalyptus, Hadoop, Google 6. Cloud Algorithmen MapReduce, Optimierungsverfahren, Praktische Übungen und Anwendungen Vorlesung im Web: http://www.mathematik.uni-karlsruhe.de/mitglieder/lehre/cloud2009s/ 2
Infrastructure for Search Systems Several key pieces of infrastructure: GFS: large-scale distributed file system MapReduce: makes it easy to write/run large scale jobs generate production index data more quickly perform ad-hoc experiments more rapidly BigTable: semi-structured storage system online, efficient access to per-document information at any time multiple processes can update per-doc info asynchronously critical for updating documents in minutes instead of hours http://labs.google.com/papers/gfs.html http://labs.google.com/papers/mapreduce.html http://labs.google.com/papers/bigtable.html 3
Google Cluster Thousands of computers Distributed Computers have their own disks, and the file system spans those disks Failures are the norm Disks, networks, processors, power supplies, application software, operating system software, human error Files are huge Multi-gigabyte files, each containing many objects Read/write characteristics Files are mutated by appending Once written, files are typically only read Large streaming reads and small random reads are typical 4
Google File System (GFS) A GFS cluster has one master and many chunkservers Files are divided into 64 MB chunks Chunks are replicated and stored in the Unix file systems of the chunkservers The master holds metadata Clients get metadata from the master, and data directly from chunkservers 5
Google File System - Read From byte offset within the file, client computes chunk index Client sends filename and chunk index to master Master returns a list of replicas of the chunk Client interacts with a replica to access data 6
Google File System - Write Client asks master for identity of primary and secondary replicas Client pushes data to memory at all replicas via a replica-to-replica chain Client sends write request to primary Primary orders concurrent requests, and triggers disk writes at all replicas Primary reports success or failure to client 7
What is BigTable? A BigTable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, a column key, and a timestamp; each value in the map is an uninterpreted array of bytes. Google s Implementation of a database Lots of semi-structured data Enormous scale (row:string, column:string, time:int64) -> string 8
Relative to a DBMS, BigTable provides - Simplified data retrieval mechanism (row, column, time:) -> string only No relational operators - Atomic updates only possible at row level + Arbitrary number of columns per row + Arbitrary data type for each column Designed and optimized for Google s application set Provides extremely large scale (data, throughput) at extremely small cost 9
Physical representation of data A logical table is divided into multiple tablets Each tablet is one or more SSTable files in GFS Each tablet stores an interval of table rows Rows are lexicographically ordered (by key) If a tablet grows beyond a certain size, it is split into two new tablets 10
Software structure of a BigTable cell One master server Communicates only with tablet servers Multiple tablet servers Perform actual client accesses Chubby lock service holds metadata (e.g., the location of the root metadata tablet for the table ), handles master election GFS servers provide underlying storage 11
High-level structure 12
The Eight Design Fallacies The network is reliable. Latency is zero. Bandwidth is infinite. The network is secure. Topology doesn't change. There is one administrator. Transport cost is zero. The network is homogeneous. -- Peter Deutsch and James Gosling, Sun Microsystems 13
ATC Architecture NETWORK INFRASTRUCTURE ATC State ATC status is a kind of temporal database: for each ATC sector, it tells us what flights might be in that sector and when they will be there 14
Server replication Let s think about the service that tracks the status of ATC sectors Client systems are like web browsers Server is like a web service. ATC is a cloud but one with special needs: it speaks with one voice Now, an ATC needs highly available servers. Else a crash could leave controller unable to make decisions So: how can we make a service highly available? Most obvious option: primary/backup We run two servers on separate platforms The primary sends a log to the backup If primary crashes, the backup soon catches up and can take over 15
A primary-backup scenario primary log backup Clients initially connected to primary, which keeps backup up to date. Backup collects the log 16
Split brain Syndrome primary backup Transient problem causes some links to break but not all. Backup thinks it is now primary, primary thinks backup is down 17
Split brain Syndrome Safe for US227 to land on Ithaca s NW runway primary Safe for NW111 to land on Ithaca s NW runway backup Some clients still connected to primary, but one has switched to backup and one is completely disconnected from both 18
Oh no! But how could this happen? How do web service systems detect failures? The specifications don t really answer this question A web client senses a failure if it can t connect to a server, or if the connection breaks And the connections are usually TCP So, how does TCP detect failures? Under the surface, TCP sends data in IP packets, and the receiver acknowledges receipt. TCP channels break if a timeout occurs. Fix the problem Just have the backup unplug the primary Alternative: Install a consistency mechanism (Lock service) 19
Chubby A coarse-grained lock service Other distributed systems can use this to synchronize access to shared resources Intended for use by loosely-coupled distributed systems Design Goals High availability Reliability Anti-goals High performance Throughput Storage capacity 20
Chubby - Intended Use Cases GFS: Elect a master BigTable: master election, client discovery, table service locking Well-known location to bootstrap larger systems Partition workloads Locks should be coarse: held for hours or days build your own fast locks on top 21
Chubby - External Interface Presents a simple distributed file system Clients can open/close/read/write files Reads and writes are whole-file Also supports advisory reader/writer locks Clients can register for notification of file update Files == Locks? Files are just handles to information The contents of the file is one (primary) attribute As is the owner of the file, permissions, date modified, etc Can also have an attribute indicating whether the file is locked or not. 22
Chubby - Topology 23
Chubby Master Election Master election is simple: all replicas try to acquire a write lock on designated file. The one who gets the lock is the master. Master can then write its address to file; other replicas can read this file to discover the chosen master name. Chubby doubles as a name service 24
Chubby - Distributed Consensus Chubby cell is usually 5 replicas 3 must be alive for cell to be viable How do replicas in Chubby agree on their own master, official lock values? PAXOS algorithm 25
PAXOS Paxos is a family of algorithms (by Leslie Lamport) designed to provide distributed consensus in a network of several processors. Processor Assumptions Operate at arbitrary speed Independent, random failures Procs with stable storage may rejoin protocol after failure Do not lie, collude, or attempt to maliciously subvert the protocol Network Assumptions All processors can communicate with ( see ) one another Messages are sent asynchronously and may take arbitrarily long to deliver Order of messages is not guaranteed: they may be lost, reordered, or duplicated Messages, if delivered, are not corrupted in the process 26
A Fault Tolerant Memory of Facts Paxos provides a memory for individual facts in the network. A fact is a binding from a variable to a value. Paxos between 2F+1 processors is reliable and can make progress if up to F of them fail. Roles Proposer An agent that proposes a fact Leader the authoritative proposer Acceptor holds agreed-upon facts in its memory Learner May retrieve a fact from the system Safety Guarantees Nontriviality: Only proposed values can be learned Consistency: Only at most one value can be learned Liveness: If at least one value V has been proposed, eventually any learner L will get some value 27
Step 1: Prepare 28
Step 2: Promise 29
Step 3: Accept 30
Step 4: Accepted 31
Learning Values 32
Cross-Language Information Retrieval Translate all the world s documents to all the world s languages increases index size substantially computationally expensive but huge benefits if done well Challenges: continuously improving translation quality large-scale systems work to deal with larger and more complex language models to translate one sentence ~1M lookups in multi-tb model 33
ACLs in Information Retrieval Systems Retrieval systems with mix of private, semiprivate, widely shared and public documents e.g. e-mail vs. shared doc among 10 people vs. messages in group with 100,000 members vs. public web pages Challenge: building retrieval systems that efficiently deal with ACLs that vary widely in size best solution for doc shared with 10 people is different than for doc shared with the world sharing patterns of a document might change over time 34
Automatic Construction of Efficient IR Systems Currently use several retrieval systems e.g. one system for sub-second update latencies, one for very large # of documents but daily updates,... common interfaces, but very different implementations primarily for efficiency works well, but lots of effort to build, maintain and extend different systems Challenge: can we have a single parameterizable system that automatically constructs efficient retrieval system based on these parameters? 35
Information Extraction from Semi-structured Data Data with clearly labelled semantic meaning is a tiny fraction of all the data in the world But there s lots semi-structured data books & web pages with tables, data behind forms,... Challenge: algorithms/techniques for improved extraction of structured information from unstructured/semi-structured sources noisy data, but lots of redundancy want to be able to correlate/combine/aggregate info from different sources 36
Summary Google Cloud GFS BigTable Consistency Split Brain Problem Chubby Lock Service Distributed Consensus: Paxos Algorithm 37
Karlsruhe Institute of Technology Steinbuch Centre for Computing (SCC) Thank you for your attention. 38