STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day



Similar documents
I Logs. Apache Kafka, Stream Processing, and Real-time Data Jay Kreps

Putting Apache Kafka to Use!

Architectures for massive data management

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Apache Kafka Your Event Stream Processing Solution

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Using Kafka to Optimize Data Movement and System Integration. Alex

Data Pipeline with Kafka

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Kafka: a Distributed Messaging System for Log Processing

Introduction to Apache Kafka And Real-Time ETL. for Oracle DBAs and Data Analysts

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Kafka & Redis for Big Data Solutions

Hadoop & Spark Using Amazon EMR

NoSQL Data Base Basics

Chapter 7. Using Hadoop Cluster and MapReduce

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran)

Getting Real Real Time Data Integration Patterns and Architectures

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Apache Hadoop. Alexandru Costan

Real-time Big Data Analytics with Storm

Big Data Analytics - Accelerated. stream-horizon.com

WSO2 Message Broker. Scalable persistent Messaging System

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

YARN Apache Hadoop Next Generation Compute Platform

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Unified Big Data Processing with Apache Spark. Matei

HDMQ :Towards In-Order and Exactly-Once Delivery using Hierarchical Distributed Message Queues. Dharmit Patel Faraj Khasib Shiva Srivastava

Hadoop Ecosystem B Y R A H I M A.

Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

CSE-E5430 Scalable Cloud Computing Lecture 2

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

Amazon EC2 Product Details Page 1 of 5

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Design and Evolution of the Apache Hadoop File System(HDFS)

Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture

THE HADOOP DISTRIBUTED FILE SYSTEM

Extending Hadoop beyond MapReduce

BIG DATA ANALYTICS For REAL TIME SYSTEM

the missing log collector Treasure Data, Inc. Muga Nishizawa

JoramMQ, a distributed MQTT broker for the Internet of Things

[Hadoop, Storm and Couchbase: Faster Big Data]

HDFS: Hadoop Distributed File System

Wisdom from Crowds of Machines

Scaling Out With Apache Spark. DTL Meeting Slides based on

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

MapReduce with Apache Hadoop Analysing Big Data

Apache Hama Design Document v0.6

Unified Batch & Stream Processing Platform

Streaming items through a cluster with Spark Streaming

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Large scale processing using Hadoop. Ján Vaňo

BIG DATA FOR MEDIA SIGMA DATA SCIENCE GROUP MARCH 2ND, OSLO

Real-time Data Replication

Apache HBase. Crazy dances on the elephant back

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Hadoop: Embracing future hardware

Using distributed technologies to analyze Big Data

I/O Considerations in Big Data Analytics

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

NOT IN KANSAS ANY MORE

How To Use Big Data For Telco (For A Telco)

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Improve performance and availability of Banking Portal with HADOOP

Big Data Technology Core Hadoop: HDFS-YARN Internals

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Big Data Trends and HDFS Evolution

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Scalable Architecture on Amazon AWS Cloud

GigaSpaces Real-Time Analytics for Big Data

NextGen Infrastructure for Big DATA Analytics.

Removing Failure Points and Increasing Scalability for the Engine that Drives webmd.com

HiBench Introduction. Carson Wang Software & Services Group

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Dominik Wagenknecht Accenture

Introducing Storm 1 Core Storm concepts Topology design

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Hadoop IST 734 SS CHUNG

Real Time Analytics for Big Data. NtiSh Nati

Hadoop & its Usage at Facebook

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Transcription:

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA Processing billions of events every day

Neha Narkhede Co-founder and Head of Engineering @ Stealth Startup Prior to this Lead, Streams Infrastructure @ LinkedIn (Kafka & Samza) One of the initial authors of Apache Kafka, committer and PMC member Reach out at @nehanarkhede

Agenda Real-time Data Integration Introduction to Logs & Apache Kafka Logs & Stream processing Apache Samza Stateful stream processing

The Data Needs Pyramid Self actualization Automation Esteem Understanding Love/Belonging Safety Physiological Data processing Data collection Maslow's hierarchy of needs Data needs

Agenda Real-time Data Integration Introduction to Logs & Apache Kafka Logs & Stream processing Apache Samza Stateful stream processing

Increase in diversity of data 1980+ Database data (users, products, orders etc) 2000+ Events (clicks, impressions, pageviews) Application logs (errors, service calls) Application metrics (CPU usage, requests/sec) Siloed data feeds 2010+ IoT sensors

Explosion in diversity of systems Live Systems Voldemort Espresso GraphDB Search Samza Batch Hadoop Teradata

Data integration disaster Espresso Espresso Espresso Voldemort Voldemort Voldemort Oracle Oracle Oracle User Tracking Logs Operational Metrics Hadoop Log Search Monitoring Data Warehous e Social Graph Rec. Engine Search Security... Email Production Services

Centralized service Espresso Espresso Espresso Voldemort Voldemort Voldemort Oracle Oracle Oracle User Tracking Logs Operational Metrics Data Pipeline Hadoop Log Search Monitorin g Data Warehous e Social Graph Rec Engine & Life Search Security... Email Production Services

Agenda Real-time Data Integration Introduction to Logs & Apache Kafka Logs & Stream processing Apache Samza Stateful stream processing

Kafka at 10,000 ft Producer Producer Producer Producer Cluster of brokers Producer Producer Distributed from ground up Persistent Multi-subscriber Producer Consumer Producer Consumer Producer Consumer

Key design principles Scalability of a file system Hundreds of MB/sec/server throughput Many TBs per server Guarantees of a database Messages strictly ordered All data persistent Distributed by default Replication model Partitioning model

Kafka adoption

Apache Kafka @ LinkedIn 175 TB of in-flight log data per colo Low-latency: ~1.5ms Replicated to each datacenter Tens of thousands of data producers Thousands of consumers 7 million messages written/sec 35 million messages read/sec Hadoop integration

Logs The data structure every systems engineer should know

The Log 1st record next record written Ordered 0 1 2 3 4 5 6 7 8 9 10 11 12 Append only Immutable

The Log: Partitioning Partition 0 0 1 2 3 4 5 6 7 8 9 10 11 12 Partition 1 0 1 2 3 4 5 6 7 8 9 Partition 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Logs: pub/sub done right Data source writes 0 1 2 3 4 5 6 7 8 9 10 11 12 reads reads Destination system A (time = 7) Destination system B (time = 11)

Logs for data integration User updates profile with new job KAFKA Newsfeed Search Hadoop Standardization engine

Agenda Real-time Data Integration Introduction to Logs & Apache Kafka Logs & Stream processing Apache Samza Stateful stream processing

Stream processing = f(log) Log A Job 1 Log B

Stream processing = f(log) Log A Job 1 Log B Log C Job 2 Log D Log E

Apache Samza at LinkedIn User updates profile with new job KAFKA Newsfeed Search Hadoop Standardization engine

Latency spectrum of data systems RPC Latency Asynchronous processing (seconds to minutes) Synchronous (milliseconds) Batch (Hours)

Agenda Real-time Data Integration Introduction to Logs & Apache Kafka Logs & Stream processing Apache Samza Stateful stream processing

Samza API getkey(), getmsg() public interface StreamTask { void process (IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator); } sendmsg(topic, key, value) commit(), shutdown()

Samza Architecture (Logical view) Log A partition 0 partition 1 partition 2 Task 1 Task 2 Task 3 partition 0 partition 1 Log B

Samza Architecture (Logical view) Log A partition 0 partition 1 partition 2 Samza container 1 Samza container 2 Task 1 Task 2 Task 3 partition 0 partition 1 Log B

Samza Architecture (Physical view) Samza container 1 Samza container 2 Host 1 Host 2

Samza Architecture (Physical view) Node manager Node manager Samza container 1 Samza container 2 Samza YARN AM Host 1 Host 2

Samza Architecture (Physical view) Node manager Node manager Samza container 1 Samza container 2 Samza YARN AM Kafka Kafka Host 1 Host 2

Samza Architecture: Equivalence to Map Reduce Node manager Node manager Map Reduce Map Reduce YARN AM HDFS HDFS Host 1 Host 2

M/R Operation Primitives Filter records matching some condition Map record = f(record) Join Two/more datasets by key Group records with same key Aggregate f(records within the same group) Pipe job 1 s output => job 2 s input

M/R Operation Primitives on streams Filter records matching some condition Map record = f(record) Join Two/more datasets by key Group records with same key Aggregate f(records within the same group) Pipe job 1 s output => job 2 s input Requires state maintenance

Agenda Real-time Data Integration Introduction to Logs & Apache Kafka Logs & Stream processing Apache Samza Stateful stream processing

Example: Newsfeed User... posted "..." User 989 posted "Blah Blah" User 567 posted "Hello World" Status update log External connection DB Fan out messages to followers 567 -> [123, 679, 789,...] 999 -> [156, 343,... ] Push notification log Refresh user 123's newsfeed Refresh user 679's newsfeed Refresh user...'s newsfeed

Local state vs Remote state: Remote 100-500K msg/sec/node 100-500K msg/sec/node Samza task partition 0 Samza task partition 1 Performance Isolation Limited APIs Disk Remote state 1-5K queries/sec?? ex: Cassandra, MongoDB, etc

Local state: Bring data closer to computation Samza task partition 0 Samza task partition 1 Local Local LevelDB/RocksDB LevelDB/RocksDB

Local state: Bring data closer to computation Samza task partition 0 Samza task partition 1 Local Local LevelDB/RocksDB LevelDB/RocksDB Disk Change log stream

Example Revisited: Newsfeed User... posted "..." User 989 posted "Blah Blah" User 567 posted "Hello World" Status update log User... followed... User 123 followed 567 User 890 followed 234 New connection log Fan out messages to followers 567 -> [123, 679, 789,...] 999 -> [156, 343,... ] Push notification log Refresh user 123's newsfeed Refresh user 679's newsfeed Refresh user...'s newsfeed

Fault tolerance? Node manager Node manager Samza container 1 Samza container 2 Samza YARN AM Kafka Kafka Host 1 Host 2

Fault tolerance in Samza Samza task partition 0 Samza task partition 1 Local Local LevelDB/RocksDB LevelDB/RocksDB Durable change log

Slow jobs Log A Job 1 Log B Log C Drop data Backpressure Job 2 Queue In memory On disk (KAFKA) Log D Log E

Summary Real time data integration is crucial for the success and adoption of stream processing Logs form the basis for real time data integration Stream processing = f(logs) Samza is designed from ground-up for scalability and provides fault-tolerant, persistent state

Thank you! The Log http://bit.ly/the_log Apache Kafka http://kafka.apache.org Apache Samza http://samza.incubator.apache.org Me @nehanarkhede http://www.linkedin.com/in/nehanarkhede