Cloud (data) management Ahmed Ali-Eldin First part: ZooKeeper (Yahoo!) Agenda A highly available, scalable, distributed, configuration, consensus, group membership, leader election, naming, and coordination service Second part: Kafka (LinkedIn) a distributed logging service using ZooKeeper Some Examples from Yahoo! Crawling Fetch pages from the web Rough estimate 200 billion documents (200 x 10^9) If we use a single server... 1s to fetch each page 2 billion seconds if fetching 100 in parallel 63 years! More complications Pages are removed Pages change their content Politeness (e.g., crawl-delay directive) Some Examples from Yahoo! Hadoop Large-scale data processing Map-Reduce Large clusters of compute nodes Order of thousands of computers Yahoo!: 13,000+ Jobs Distribute computation across nodes Yahoo!: hundreds of thousands a month An example: WebMap Number of links between pages in the index: roughly 1 trillion links Size of output: over 300 TB, compressed! Number of cores used to run a single Map-Reduce job: over 10,000 Raw disk used in the production cluster: over 5 Petabytes
Distributed systems Distributed Systems Large number of processes Running on heterogeneous hardware Communicate using messages Systems are often asynchronous Unbounded amount of time to execute a step Unbounded message delay Makes it difficult to determine if process has failed or is just slow DS suck Distributed algorithms are not trivial to understand and implement Debugging is more than painful Same functionality is implemented over and over again It takes forever to build anything and get it to work Edge cases are very hard to handle Distributed Systems Or let us make things simpler first So let us use the Cloud which is even more distributed And build a coordination service To coordinate processes of distributed applications The programmers should be able to use it easily Therefore, ZooKeeper (Yahoo!) Used widely in the industry Kafka (LinkedIn) Storm (Twitter) Hadoop (Cloudera) Haystack (Facebook)
Provide API rather than service Provide API rather than service One approach is to provide certain primitives They decided not to but rather expose an API that enables application developers to implement their own primitives. Implemented a coordination kernel Allows the developer to specify new primitives without requiring changes to the service core Decided against providing any blocking primitives. Observation: Programmers suck in using locks Tend to create distributed deadlocks Programmers use (shared) file systems well Which also have locks Zookeeper, hence implements an API that manipulates simple wait-free data objects organized hierarchically as in file systems Provide guarantees (and some recipes) Terminology Sequential Consistency - Updates from a client will be applied in the order that they were sent. Atomicity - Updates either succeed or fail. No partial results. Single System Image - A client will see the same view of the service regardless of the server that it connects to. Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update. Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain bound.either system changes will be seen by a client within this bound, or the client will detect a service outage. Server: a process providing the ZooKeeper service znode: denote an in-memory data node in the ZooKeeper data, which is organized in a hierarchical namespace referred to as the data tree Update and write: refers to any operation that modifies the state of the data tree Session: established when clients connect to ZooKeeper Session handle: obtained on connection, used to issue requests
znodes znodes Organized according to a hierarchical name space Two types of znodes: Regular Following the UNIX notation /A/B/C, znode C has B as its parent and B has A as its parent Can have children Clients manipulate regular znodes by creating and deleting them explicitly Ephemeral Can not have children Clients create them and can delete them System removes them once session is terminated znodes Watches Clients may create sequential znodes Each created node will obtains a value of a monotonically increasing counter appended to its name When a client issues a read operation with a watch flag He gets notified whenever the data he read is changed A/B/C---> C's counter>b's counter Or counters of any siblings created before C Watch flags are one-time triggers Expires with sessions Or if once triggered
Data model API: Create is essentially a file system with a simplified API and only full data reads and writes or a key/value table with hierarchical keys Unlike files in file systems, znodes are not designed for general data storage create(path, data, flags): Creates a znode with path name path, stores data[] in it, and returns the name of the new znode. flags enables a client to select the type of znode: regular, ephemeral, and set the sequential flag; znodes map to abstractions of the client application typically corresponding to meta-data used for coordination purposes. API: Updates API: watches delete(path, version): Deletes the znode path if that znode is at the expected version; setdata(path, data, version): Writes data[] to znode path if the version number is the current version of the znode Note the use of version enables the implementation of conditional updates If the actual version number of the znode does not match the expected version number the update fails with an unexpected version error. If the version number is 1, it does not perform version checking. exists(path, watch): Returns true if the znode with path name path exists, and returns false otherwise. The watch flag enables a client to set a watch on the znode. getdata(path, watch): Returns the data and meta-data, such as version information, associated with the znode. The watch flag works in the same way as it does for exists(), except that ZooKeeper does not set the watch if the znode does not exist
API: watches API: Sync exists(path, watch): Returns true if the znode with path name path exists, and returns false otherwise. The watch flag enables a client to set a watch on the znode. getdata(path, watch): Returns the data and meta-data, such as version information, associated with the znode. The watch flag works in the same way as it does for exists(), except that ZooKeeper does not set the watch if the znode does not exist getchildren(path, watch): Returns the set of names of the children of a znode sync(path): Waits for all updates pending at the start of the operation to propagate to the server that the client is connected to. Use cases inside of Yahoo!» Leader Election» Group Membership» Work Queues» Con.guration Management» Cluster Management» Load Balancing» Sharding Disclaimer Examples The ZooKeeper service knows nothing about these more powerful primitives since they are entirely implemented at the client using the ZooKeeper client
Let us give an example Example2: Work Queues Group membership Monitoring process: 1. Watch /tasks for published Application1 implements a (simple) group membership tasks 2. Pick tasks on watch trigger from /tasks /tasks each client process p_i creates a znode p_i under /app1, which persists as long as the process is running. 3. assign it to a machine speci.c queue by creating create(/machines/m-${i}/task-${j}) 4. Watch for deletion of tasks (maps to task task-1 task-2 task-3 completion) Machine process: 5. Machines watch for /(/machines/m-${i}) /machines for any creation of tasks 2. After executing task-${i} delete task-${i} from /tasks and /m-${i} m-1 task-1 ZooKeepers Keep order Are reliable Are efficient Avoid contention Are timely And are ambition free Distributed Logging
What is the problem? Examples Facebook has a lot of servers distributed worldwide. The servers produce more than 1 Million logging messages/second. Need to analyze them together Maybe using MapReduce How do you send them to the MapReduce cluster? Distributed logging Apache Kafka (LinkedIn) Apache Flume (Cloudera and others) Chukwa (UCB) Scribe (Facebook, now less used) (There are more systems) Kafka Logged data A distributed messaging system that we developed for collecting and delivering high volumes of log data with low latency. from the paper User activity events corresponding to logins, page views, clicks, likes, sharing, comments, and search queries; Operational metrics such as service call stack, call latency, errors, and system metrics such as CPU, memory, network, or disk utilization on each machine.
Usage of logged data (LinkedIn) Challenge Search relevance Recommendations driven by item popularity co-occurrence in the activity stream Ads Security applications Abusive behavior, e.g., Spam Newsfeed features that aggregate user status updates or actions for their friends or connections to read Log data is larger than real data It is not just what you click It is also what you did not click In 2009, Facebook collected (on average) 6 TB of log data/day Kafka architecture Kafka architecture Written in Scala A stream of messages of a particular type is defined by a topic. Producers produce messages Published messages stored in broker Consumer consumes message
Sample producer code Sample consumer code producer = new Producer(...); message = new Message( test message str.getbytes()); set = new MessageSet(message); producer.send( topic1, set); streams[] = Consumer.createMessageStreams( topic1, 1) for (message : streams[0]) { bytes = message.payload(); // do something with the bytes } Message streams Load balancing Unlike traditional iterators, the message stream iterator never terminates. If there are currently no more messages block until new messages are published to the topic. Both point-to-point delivery model multiple consumers jointly consume a single copy of all messages in a topic and publish/subscribe model multiple consumers each retrieve its own copy of a topic. Divide topic into partitions Each broker stores one or more copies of the partition
Partition Kafka log Simple storage One partition==one (logical) log One (logical) log==a set of segment files of approximately the same size One (segment) file open for writing/partition Append new messages to that file fush the segment files to disk only after a configurable number of messages have been published or a certain amount of time has elapsed. A message is only exposed to the consumers after it is flushed. Messages addressed by their offset in the log No special id Message id+(message length)=next message id Message consumption Message consumption Consumer always consumes messages from a particular partition sequentially Brokers keep sorted list of offsets Including offset of the first message in every segment file Consumer acknowledges a particular message offset He received all messages prior to that offset in the partition. Under the covers, The consumer is issuing asynchronous pull requests to the broker to have a buffer of data ready for the application to consume. Each pull request contains the offset of the message from which the consumption begins and an acceptable number of bytes to fetch After a consumer receives a message, it computes the offset of the next message to consume and uses it in the next pull request
Stateless broker Stateless broker Broker does not keep track of who consumed what It is the consumers who should keep track of what they have consumed But then how do you delete something if you are not sure that all consumers have already used it? Retention policy Your message is safe and sound for X time units (typically 7 days) Most consumers consume their message daily, hourly or in real time Does performance degrade with larger stored data size? No since you consume using offsets, and files are kept within limits, e.g., 1 GB. A consumer can deliberately rewind back to an old offset and re-consume data. Violates the common contract of a queue, but proves to be an essential feature for many consumers. For example, when there is an error in application logic in the consumer, the application can re-play certain messages after the error is fixed. Distributed coordination Distributed coordination Consumer groups Those interested in the same topic(s) No coordination needed between consumer groups Decision 1: a partition within a topic the smallest unit of parallelism All messages from one partition are consumed only by a single consumer within each consumer group. No locking and no state-maintenance overhead For the load to be truly balanced, Many more partitions are needed in a topic than the consumers in each group. Achieve this by over partitioning a topic.
Distributed coordination Decision 2: No master (central) node No worries about master failures Use ZooKeeper to facilitate the coordination ZooKeeper usage in Kafka Detect the addition/removal of brokers and consumers Trigger a rebalance process in each consumer when a new broker/consumer added Maintaining the consumption relationship and keeping track of the consumed offset of each partition. ZooKeeper usage in Kafka Kafka usage at LinkedIn When each broker or consumer starts up stores its information in a broker or consumer registry in Zookeeper. The broker registry contains the broker s host name and port, and the set of topics and partitions stored on it. The consumer registry includes the consumer group to which a consumer belongs and the set of topics that it subscribes to. Each consumer group is associated with an ownership registry and an offset registry in Zookeeper. The ownership registry has one path for every subscribed partition and the path value is the id of the consumer currently consuming from this partition The offset registry stores for each subscribed partition, the offset of the last consumed message in the partition.
Kafka usage at linkedin Kafka usage at linkedin One Kafka cluster co-located with each datacenter The frontend services generate various kinds of log data and publish it to the local Kafka brokers in batches Another deployment in an analysis center Kafka accumulates hundreds of gigabytes of data and close to a billion messages per day (2011). Questions?