Putting Apache Kafka to Use!

Similar documents
I Logs. Apache Kafka, Stream Processing, and Real-time Data Jay Kreps

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Large scale processing using Hadoop. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo

Using Kafka to Optimize Data Movement and System Integration. Alex

Introduction to Apache Kafka And Real-Time ETL. for Oracle DBAs and Data Analysts

Constructing a Data Lake: Hadoop and Oracle Database United!

Hadoop Ecosystem B Y R A H I M A.

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

The Future of Data Management

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

From Spark to Ignition:

Big Data With Hadoop

Saving Millions through Data Warehouse Offloading to Hadoop. Jack Norris, CMO MapR Technologies. MapR Technologies. All rights reserved.

Architectures for massive data management

Kafka & Redis for Big Data Solutions

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Using distributed technologies to analyze Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Big Data Analytics - Accelerated. stream-horizon.com

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

How To Use Big Data For Telco (For A Telco)

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Big Data Analytics - Accelerated. stream-horizon.com

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

How to Choose Between Hadoop, NoSQL and RDBMS

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

The Future of Big Data SAS Automotive Roundtable Los Angeles, CA 5 March 2015 Mike Olson Chief Strategy Officer,

Real Time Big Data Processing

Big Data Technologies Compared June 2014

MapReduce with Apache Hadoop Analysing Big Data

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

NoSQL and Hadoop Technologies On Oracle Cloud

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

HDMQ :Towards In-Order and Exactly-Once Delivery using Hierarchical Distributed Message Queues. Dharmit Patel Faraj Khasib Shiva Srivastava

HadoopRDF : A Scalable RDF Data Analysis System

How To Scale Out Of A Nosql Database

Information Builders Mission & Value Proposition

Moving From Hadoop to Spark

NOT IN KANSAS ANY MORE

BIG DATA TECHNOLOGY. Hadoop Ecosystem

and NoSQL Data Governance for Regulated Industries Using Hadoop Justin Makeig, Director Product Management, MarkLogic October 2013

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Big Data Processing with Google s MapReduce. Alexandru Costan

Unified Big Data Processing with Apache Spark. Matei

Hadoop & Spark Using Amazon EMR

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Introduction to NOSQL

Unified Big Data Analytics Pipeline. 连 城

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BIG DATA TRENDS AND TECHNOLOGIES

OLTP Meets Bigdata, Challenges, Options, and Future Saibabu Devabhaktuni

How To Handle Big Data With A Data Scientist

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

Dominik Wagenknecht Accenture

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Hadoop and Map-Reduce. Swati Gore

The Future of Data Management with Hadoop and the Enterprise Data Hub

How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran)

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Hadoop IST 734 SS CHUNG

Big Data and Industrial Internet

Testing Big data is one of the biggest

SAP and Hortonworks Reference Architecture

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM APACHE HADOOP WORKSHOP

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Big Data Course Highlights

Apache Kafka Your Event Stream Processing Solution

Can the Elephants Handle the NoSQL Onslaught?

REAL-TIME BIG DATA ANALYTICS

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Tap into Hadoop and Other No SQL Sources

[Hadoop, Storm and Couchbase: Faster Big Data]

Play with Big Data on the Shoulders of Open Source

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Native Connectivity to Big Data Sources in MSTR 10

Apache Hadoop: Past, Present, and Future

Big Data Use Case: Business Analytics

NoSQL Data Base Basics

BIG DATA ANALYTICS For REAL TIME SYSTEM

Transcription:

Putting Apache Kafka to Use! Building a Real-time Data Platform for Event Streams! JAY KREPS, CONFLUENT!

A Couple of Themes!

Theme 1: Rise of Events!

Theme 2: Immutability Everywhere! Level! Example! Immutable Alternative! Mutable local state! Counter in a for loop! Functional Programming! Mutable process-wide state! ConcurrentHashMap! Functional Programming!! Mutable on disk structures! B-Tree! LSM! Distributed systems! Dynamo-like key-value store! State machine replication! Mutability in databases! RDBMS! Event Sourcing! Company-wide data flow! Double write! Kafka!

Theme 3: Datacenter-Level Thinking!

Experience at LinkedIn!

2009: We want all our data in Hadoop!!

What is all our data?!

Initial approach: gut it out!

Problems! Data coverage! Many source systems! Relational DBs! Log files! Metrics! Messaging systems! Many data formats! Constant change! New schemas! New data sources!

Needed: organizational scalability! Θ(N) => Θ(1)!

How does everything else work?!?!

Relational database changes! Apps and Services OLTP Queries Relational Databases Cache Data Guard CSV Dump Poll For Changes ODS Hadoop App App App Caches & Derived Stores Relational Data Warehouse Transforms Transforms

NoSQL! App App App Key-value Store ETL Load Hadoop

User events! Apps and Services Apps and Services Apps and Services HTTP Log Aggregation NFS rsync NFS Load Transform & Load Hadoop Relational Data Warehouse Transform

Application Logs! Apps and Services Apps and Services Apps and Services Splunk

Messaging! App App App App App Broker Broker Processor Processor Processor Processor App App App Broker Processor Processor Processor Processor

Metrics and operational data! App App App Monitoring

This is a giant mess! Apps and Services Apps and Services Apps and Services OLTP Queries ActiveMQ HTTP HTTP Monitoring ActiveMQ Apps Apps Splunk Relational Databases Cache Poll For Changes App App App Data Guard Apps ODS CSV Dump Apps Hadoop Key-value Store Load Log Aggregation NFS rsync NFS Caches & Derived Stores Relational Data Warehouse Transforms Transform & Load Transforms

Impossible ideas! Publish data from Hadoop to a search index! Run a SQL query to find the biggest latency bottleneck! Run a SQL query to find common error patterns! Low latency monitoring of database changes or user activity! Incorporate popularity in real-time display and relevance algorithms! Products that incorporate user activity!

An infrastructure solution?!

Idea: Stream Data Platform! Search Impala Apps Monitoring Hive RDBMS Stream Data Platform:? HADOOP: Offline Data DWH NoSQL Real-time Analytics Stream Processing Spark Map- Reduce Synchronous Req/Response Near real time 0-100s ms > 100s ms Offline batch > 1 hour

First Attempt: Messaging systems!!

Problems! Throughput! Batch systems! Persistence! Stream Processing! Ordering guarantees! Partitioning!

Second Attempt: Build Kafka!!

What does it do?! Producer Producer Producer Producer Producer Kafka Cluster Consumer Consumer Consumer Consumer Consumer

Commit Log Abstraction! Reader 1 Reader 2 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 Writes Old New

Logs & Publish-Subscribe Messaging! Source System writes Log 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 reads reads Destination System A Destination System B

A Kafka Topic! Partition 0 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 Partition 1 0 1 2 3 4 5 6 7 8 9 Writes Partition 2 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 Old New

Replication! Server 1 Server 2 Server 3 A:0 A:0 A:0 A:1 A:1 A:1 B:0 B:0 Controller

Scaling Consumers! Kafka Cluster Server 1 Server 2 P0 P3 P1 P2 C1 C2 C3 C4 C5 C6 Consumer Group A Consumer Group B

Kafka: A Modern Distributed System for Streams! Scalability of a filesystem! Hundreds of MB/sec/server throughput! Many TB per server! Guarantees of a database! Messages strictly ordered! All data persistent! Distributed by default! Replication! Partitioning model! Producers, Consumers, and Brokers all fault tolerant and horizontally scalable!

Stream Data Platform! Search Impala Apps Monitoring Hive RDBMS KAFKA: Stream Data Platform HADOOP: Offline Data DWH NoSQL Real-time Analytics Stream Processing Spark Map- Reduce Synchronous Req/Response Near real time 0-100s ms > 100s ms Offline batch > 1 hour

Batch Data => Batch Processing!

Stream processing is a! generalization! of batch processing! and request/response processing!

Request/Response processing:! One input => One output!

Batch processing:! All inputs => All outputs!

Stream Processing:! Some inputs => some outputs! (you choose how much some is)!

Stream Processing a la carte! Input Kafka Topic Transform Transform Transform Intermediate Kafka Topic Your code cat input grep foo wc -l Transform Transform Transform Output Kafka Topic Hadoop Live Data Store

Stream Processing with Frameworks! +! =! Stream Processing!

Unix Pipes, Modernized! cat /usr/share/dict/words wc -l

On Schemas! Bad Schemas < No Schemas < Good Schemas!

Put it all together! Apps Apps Apps Apps Social Graph Search Key-Value Oracle Newsfeed OLAP Storage Log Search Apps Apps Monitoring Security & Fraud Real-time Analytics Kafka Samza Hadoop Teradata

At LinkedIn! Everything in the company is a real-time stream! > 800 billion messages written per day! > 2.9 trillion messages read per day! ~ 1 PB of stream data! Tens of thousands of producer processes! Backbone for data stores! Search! Social Graph! Newsfeed! Primary storage (in progress)! Basis for stream processing!

Elsewhere!

Why this is the future! 1. System diversity is increasing! 2. Data diversity and volume is increasing! 3. The world is getting faster! 4. The technology exists!

Mission: Make this a practical reality everywhere! Product! Apache Kafka! Schemas and metadata management! Connectors for common systems! Monitor data flow end-to-end! Stream processing integration!

Questions?! Confluent! @confluentinc! http://confluent.io! http://blog.confluent.io/2015/02/25/ stream-data-platform-1! Apache Kafka! @apachekafka! http://kafka.apache.org! http://linkd.in/199imwy! Me! @jaykreps!