Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:



Similar documents
ZooKeeper. Table of contents

Kafka: a Distributed Messaging System for Log Processing

Architectures for massive data management

Google File System. Web and scalability

Apache Kafka Your Event Stream Processing Solution

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hypertable Architecture Overview

CSE-E5430 Scalable Cloud Computing Lecture 11

CDH AND BUSINESS CONTINUITY:

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Kafka & Redis for Big Data Solutions

Bigdata High Availability (HA) Architecture

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

Hadoop implementation of MapReduce computational model. Ján Vaňo

Big Data Analytics - Accelerated. stream-horizon.com

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

CSE-E5430 Scalable Cloud Computing Lecture 2

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Apache Hadoop FileSystem and its Usage in Facebook

Wisdom from Crowds of Machines

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Design and Evolution of the Apache Hadoop File System(HDFS)

Big Data Technology Core Hadoop: HDFS-YARN Internals

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Big Data and Scripting Systems beyond Hadoop

Hadoop and Map-Reduce. Swati Gore

Putting Apache Kafka to Use!

GigaSpaces Real-Time Analytics for Big Data

Big Data With Hadoop

Large scale processing using Hadoop. Ján Vaňo

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Hadoop & its Usage at Facebook

Distributed File Systems

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Cosmos. Big Data and Big Challenges. Pat Helland July 2011

Apache Hadoop FileSystem Internals

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Hadoop & its Usage at Facebook

GraySort on Apache Spark by Databricks

Hadoop Ecosystem B Y R A H I M A.

FAWN - a Fast Array of Wimpy Nodes

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

NoSQL and Hadoop Technologies On Oracle Cloud

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

International Journal of Advance Research in Computer Science and Management Studies

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Application Development. A Paradigm Shift

Hadoop: Embracing future hardware

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

NOT IN KANSAS ANY MORE

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Couchbase Server Under the Hood

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Using Kafka to Optimize Data Movement and System Integration. Alex

Certified Big Data and Apache Hadoop Developer VS-1221

A Performance Analysis of Distributed Indexing using Terrier

16.1 MAPREDUCE. For personal use only, not for distribution. 333

ioscale: The Holy Grail for Hyperscale

Apache Hadoop. Alexandru Costan

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

BookKeeper overview. Table of contents

Data Pipeline with Kafka

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Jeffrey D. Ullman slides. MapReduce for data intensive computing

HDFS Users Guide. Table of contents

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Chapter 6, The Operating System Machine Level

HADOOP PERFORMANCE TUNING

ANALYTICS ON BIG FAST DATA USING REAL TIME STREAM DATA PROCESSING ARCHITECTURE

Snapshots in Hadoop Distributed File System

Real Time Analytics for Big Data. NtiSh Nati

A programming model in Cloud: MapReduce

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Transcription:

Cloud (data) management Ahmed Ali-Eldin First part: ZooKeeper (Yahoo!) Agenda A highly available, scalable, distributed, configuration, consensus, group membership, leader election, naming, and coordination service Second part: Kafka (LinkedIn) a distributed logging service using ZooKeeper Some Examples from Yahoo! Crawling Fetch pages from the web Rough estimate 200 billion documents (200 x 10^9) If we use a single server... 1s to fetch each page 2 billion seconds if fetching 100 in parallel 63 years! More complications Pages are removed Pages change their content Politeness (e.g., crawl-delay directive) Some Examples from Yahoo! Hadoop Large-scale data processing Map-Reduce Large clusters of compute nodes Order of thousands of computers Yahoo!: 13,000+ Jobs Distribute computation across nodes Yahoo!: hundreds of thousands a month An example: WebMap Number of links between pages in the index: roughly 1 trillion links Size of output: over 300 TB, compressed! Number of cores used to run a single Map-Reduce job: over 10,000 Raw disk used in the production cluster: over 5 Petabytes

Distributed systems Distributed Systems Large number of processes Running on heterogeneous hardware Communicate using messages Systems are often asynchronous Unbounded amount of time to execute a step Unbounded message delay Makes it difficult to determine if process has failed or is just slow DS suck Distributed algorithms are not trivial to understand and implement Debugging is more than painful Same functionality is implemented over and over again It takes forever to build anything and get it to work Edge cases are very hard to handle Distributed Systems Or let us make things simpler first So let us use the Cloud which is even more distributed And build a coordination service To coordinate processes of distributed applications The programmers should be able to use it easily Therefore, ZooKeeper (Yahoo!) Used widely in the industry Kafka (LinkedIn) Storm (Twitter) Hadoop (Cloudera) Haystack (Facebook)

Provide API rather than service Provide API rather than service One approach is to provide certain primitives They decided not to but rather expose an API that enables application developers to implement their own primitives. Implemented a coordination kernel Allows the developer to specify new primitives without requiring changes to the service core Decided against providing any blocking primitives. Observation: Programmers suck in using locks Tend to create distributed deadlocks Programmers use (shared) file systems well Which also have locks Zookeeper, hence implements an API that manipulates simple wait-free data objects organized hierarchically as in file systems Provide guarantees (and some recipes) Terminology Sequential Consistency - Updates from a client will be applied in the order that they were sent. Atomicity - Updates either succeed or fail. No partial results. Single System Image - A client will see the same view of the service regardless of the server that it connects to. Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update. Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain bound.either system changes will be seen by a client within this bound, or the client will detect a service outage. Server: a process providing the ZooKeeper service znode: denote an in-memory data node in the ZooKeeper data, which is organized in a hierarchical namespace referred to as the data tree Update and write: refers to any operation that modifies the state of the data tree Session: established when clients connect to ZooKeeper Session handle: obtained on connection, used to issue requests

znodes znodes Organized according to a hierarchical name space Two types of znodes: Regular Following the UNIX notation /A/B/C, znode C has B as its parent and B has A as its parent Can have children Clients manipulate regular znodes by creating and deleting them explicitly Ephemeral Can not have children Clients create them and can delete them System removes them once session is terminated znodes Watches Clients may create sequential znodes Each created node will obtains a value of a monotonically increasing counter appended to its name When a client issues a read operation with a watch flag He gets notified whenever the data he read is changed A/B/C---> C's counter>b's counter Or counters of any siblings created before C Watch flags are one-time triggers Expires with sessions Or if once triggered

Data model API: Create is essentially a file system with a simplified API and only full data reads and writes or a key/value table with hierarchical keys Unlike files in file systems, znodes are not designed for general data storage create(path, data, flags): Creates a znode with path name path, stores data[] in it, and returns the name of the new znode. flags enables a client to select the type of znode: regular, ephemeral, and set the sequential flag; znodes map to abstractions of the client application typically corresponding to meta-data used for coordination purposes. API: Updates API: watches delete(path, version): Deletes the znode path if that znode is at the expected version; setdata(path, data, version): Writes data[] to znode path if the version number is the current version of the znode Note the use of version enables the implementation of conditional updates If the actual version number of the znode does not match the expected version number the update fails with an unexpected version error. If the version number is 1, it does not perform version checking. exists(path, watch): Returns true if the znode with path name path exists, and returns false otherwise. The watch flag enables a client to set a watch on the znode. getdata(path, watch): Returns the data and meta-data, such as version information, associated with the znode. The watch flag works in the same way as it does for exists(), except that ZooKeeper does not set the watch if the znode does not exist

API: watches API: Sync exists(path, watch): Returns true if the znode with path name path exists, and returns false otherwise. The watch flag enables a client to set a watch on the znode. getdata(path, watch): Returns the data and meta-data, such as version information, associated with the znode. The watch flag works in the same way as it does for exists(), except that ZooKeeper does not set the watch if the znode does not exist getchildren(path, watch): Returns the set of names of the children of a znode sync(path): Waits for all updates pending at the start of the operation to propagate to the server that the client is connected to. Use cases inside of Yahoo!» Leader Election» Group Membership» Work Queues» Con.guration Management» Cluster Management» Load Balancing» Sharding Disclaimer Examples The ZooKeeper service knows nothing about these more powerful primitives since they are entirely implemented at the client using the ZooKeeper client

Let us give an example Example2: Work Queues Group membership Monitoring process: 1. Watch /tasks for published Application1 implements a (simple) group membership tasks 2. Pick tasks on watch trigger from /tasks /tasks each client process p_i creates a znode p_i under /app1, which persists as long as the process is running. 3. assign it to a machine speci.c queue by creating create(/machines/m-${i}/task-${j}) 4. Watch for deletion of tasks (maps to task task-1 task-2 task-3 completion) Machine process: 5. Machines watch for /(/machines/m-${i}) /machines for any creation of tasks 2. After executing task-${i} delete task-${i} from /tasks and /m-${i} m-1 task-1 ZooKeepers Keep order Are reliable Are efficient Avoid contention Are timely And are ambition free Distributed Logging

What is the problem? Examples Facebook has a lot of servers distributed worldwide. The servers produce more than 1 Million logging messages/second. Need to analyze them together Maybe using MapReduce How do you send them to the MapReduce cluster? Distributed logging Apache Kafka (LinkedIn) Apache Flume (Cloudera and others) Chukwa (UCB) Scribe (Facebook, now less used) (There are more systems) Kafka Logged data A distributed messaging system that we developed for collecting and delivering high volumes of log data with low latency. from the paper User activity events corresponding to logins, page views, clicks, likes, sharing, comments, and search queries; Operational metrics such as service call stack, call latency, errors, and system metrics such as CPU, memory, network, or disk utilization on each machine.

Usage of logged data (LinkedIn) Challenge Search relevance Recommendations driven by item popularity co-occurrence in the activity stream Ads Security applications Abusive behavior, e.g., Spam Newsfeed features that aggregate user status updates or actions for their friends or connections to read Log data is larger than real data It is not just what you click It is also what you did not click In 2009, Facebook collected (on average) 6 TB of log data/day Kafka architecture Kafka architecture Written in Scala A stream of messages of a particular type is defined by a topic. Producers produce messages Published messages stored in broker Consumer consumes message

Sample producer code Sample consumer code producer = new Producer(...); message = new Message( test message str.getbytes()); set = new MessageSet(message); producer.send( topic1, set); streams[] = Consumer.createMessageStreams( topic1, 1) for (message : streams[0]) { bytes = message.payload(); // do something with the bytes } Message streams Load balancing Unlike traditional iterators, the message stream iterator never terminates. If there are currently no more messages block until new messages are published to the topic. Both point-to-point delivery model multiple consumers jointly consume a single copy of all messages in a topic and publish/subscribe model multiple consumers each retrieve its own copy of a topic. Divide topic into partitions Each broker stores one or more copies of the partition

Partition Kafka log Simple storage One partition==one (logical) log One (logical) log==a set of segment files of approximately the same size One (segment) file open for writing/partition Append new messages to that file fush the segment files to disk only after a configurable number of messages have been published or a certain amount of time has elapsed. A message is only exposed to the consumers after it is flushed. Messages addressed by their offset in the log No special id Message id+(message length)=next message id Message consumption Message consumption Consumer always consumes messages from a particular partition sequentially Brokers keep sorted list of offsets Including offset of the first message in every segment file Consumer acknowledges a particular message offset He received all messages prior to that offset in the partition. Under the covers, The consumer is issuing asynchronous pull requests to the broker to have a buffer of data ready for the application to consume. Each pull request contains the offset of the message from which the consumption begins and an acceptable number of bytes to fetch After a consumer receives a message, it computes the offset of the next message to consume and uses it in the next pull request

Stateless broker Stateless broker Broker does not keep track of who consumed what It is the consumers who should keep track of what they have consumed But then how do you delete something if you are not sure that all consumers have already used it? Retention policy Your message is safe and sound for X time units (typically 7 days) Most consumers consume their message daily, hourly or in real time Does performance degrade with larger stored data size? No since you consume using offsets, and files are kept within limits, e.g., 1 GB. A consumer can deliberately rewind back to an old offset and re-consume data. Violates the common contract of a queue, but proves to be an essential feature for many consumers. For example, when there is an error in application logic in the consumer, the application can re-play certain messages after the error is fixed. Distributed coordination Distributed coordination Consumer groups Those interested in the same topic(s) No coordination needed between consumer groups Decision 1: a partition within a topic the smallest unit of parallelism All messages from one partition are consumed only by a single consumer within each consumer group. No locking and no state-maintenance overhead For the load to be truly balanced, Many more partitions are needed in a topic than the consumers in each group. Achieve this by over partitioning a topic.

Distributed coordination Decision 2: No master (central) node No worries about master failures Use ZooKeeper to facilitate the coordination ZooKeeper usage in Kafka Detect the addition/removal of brokers and consumers Trigger a rebalance process in each consumer when a new broker/consumer added Maintaining the consumption relationship and keeping track of the consumed offset of each partition. ZooKeeper usage in Kafka Kafka usage at LinkedIn When each broker or consumer starts up stores its information in a broker or consumer registry in Zookeeper. The broker registry contains the broker s host name and port, and the set of topics and partitions stored on it. The consumer registry includes the consumer group to which a consumer belongs and the set of topics that it subscribes to. Each consumer group is associated with an ownership registry and an offset registry in Zookeeper. The ownership registry has one path for every subscribed partition and the path value is the id of the consumer currently consuming from this partition The offset registry stores for each subscribed partition, the offset of the last consumed message in the partition.

Kafka usage at linkedin Kafka usage at linkedin One Kafka cluster co-located with each datacenter The frontend services generate various kinds of log data and publish it to the local Kafka brokers in batches Another deployment in an analysis center Kafka accumulates hundreds of gigabytes of data and close to a billion messages per day (2011). Questions?