I Logs. Apache Kafka, Stream Processing, and Real-time Data Jay Kreps



Similar documents
STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Putting Apache Kafka to Use!

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

NoSQL Data Base Basics

Using Kafka to Optimize Data Movement and System Integration. Alex

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Kafka & Redis for Big Data Solutions

NoSQL and Hadoop Technologies On Oracle Cloud

From Spark to Ignition:

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Big Data Technology Core Hadoop: HDFS-YARN Internals

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Large scale processing using Hadoop. Ján Vaňo

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop implementation of MapReduce computational model. Ján Vaňo

Fast Data in the Era of Big Data: Twitter s Real-

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Big Data Analytics - Accelerated. stream-horizon.com

Big Data With Hadoop

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Big Data Use Case: Business Analytics

Improve performance and availability of Banking Portal with HADOOP

Architectures for massive data management

NOT IN KANSAS ANY MORE

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Introduction to Apache Kafka And Real-Time ETL. for Oracle DBAs and Data Analysts

HadoopRDF : A Scalable RDF Data Analysis System

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Kafka: a Distributed Messaging System for Log Processing

A TECHNICAL WHITE PAPER ATTUNITY VISIBILITY

Apache Hadoop. Alexandru Costan

Scalable Architecture on Amazon AWS Cloud

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Apache Hadoop: Past, Present, and Future

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

How To Use Big Data For Telco (For A Telco)

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Hypertable Architecture Overview

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran)

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Search Big Data with MySQL and Sphinx. Mindaugas Žukas

Distributed Systems. Tutorial 12 Cassandra

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

Future-Proofing MySQL for the Worldwide Data Revolution

HDMQ :Towards In-Order and Exactly-Once Delivery using Hierarchical Distributed Message Queues. Dharmit Patel Faraj Khasib Shiva Srivastava

Big Data and Industrial Internet

Apache Kafka Your Event Stream Processing Solution

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Chase Wu New Jersey Ins0tute of Technology

Constructing a Data Lake: Hadoop and Oracle Database United!

Hadoop & Spark Using Amazon EMR

Hadoop & its Usage at Facebook

Hadoop IST 734 SS CHUNG

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Analytics on Spark &

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Hadoop & its Usage at Facebook

How To Scale Out Of A Nosql Database

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Database Scalability and Oracle 12c

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

Hadoop Ecosystem B Y R A H I M A.

Application Development. A Paradigm Shift

MapReduce with Apache Hadoop Analysing Big Data

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Performance and Scalability Overview

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Data Warehousing in the Age of Big Data

Introduction to NOSQL

Data Pipeline with Kafka

Transcription:

I Logs Apache Kafka, Stream Processing, and Real-time Data Jay Kreps

The Plan 1. What is Data Integration? 2. What is Apache Kafka? 3. Logs and Distributed Systems 4. Logs and Data Integration 5. Logs and Stream Processing

Data Integration

Maslow s Hierarchy Self-Actualization Esteem Love & Belonging Safety Physiological

For Data Automation Understanding Semantics Acquisition/Collection

New Types of Data Database data Users, products, orders, etc Events Clicks, Impressions, Pageviews, etc Application metrics CPU usage, requests/sec Application logs Service calls, errors

Live Stores Voldemort Espresso Graph OLAP Search InGraphs Offline Hadoop Teradata New Types of Systems

Bad Espresso Espresso Espresso Voldemort Voldemort Voldemort Oracle Oracle Oracle User Tracking Operational Logs Operational Metrics Hadoop Log Search Monitoring Data Warehouse Social Graph Rec. Engine Search Security... Email Production Services

Good Espresso Espresso Espresso Voldemort Voldemort Voldemort Oracle Oracle Oracle User Tracking Operational Logs Operational Metrics Log Hadoop Log Search Monitoring Data Warehouse Social Graph Rec Engine Search Security... Email Production Services

The Plan 1. What is Data Integration? 2. What is Apache Kafka? 3. Logs and Distributed Systems 4. Logs and Data Integration 5. Logs and Stream Processing

Apache Kafka producer producer producer producer producer producer producer producer producer kafka cluster consumer consumer consumer consumer consumer consumer consumer consumer consumer

A" brief" history" of" Kafka

Three design principles 1. One pipeline to rule them all 2. Stream processing >> messaging 3. Clusters not servers

Characteristics Scalability of a filesystem Hundreds of MB/sec/server throughput Many TB per server Guarantees of a database Messages strictly ordered All data persistent Distributed by default Replication Partitioning model

Kafka At LinkedIn 175 TB of in-flight log data per colo Low-latency: ~1.5 ms Replicated to each datacenter Tens of thousands of data producers Thousands of consumers 7 million messages written/sec 35 million messages read/sec Hadoop integration

The Plan 1. What is Data Integration? 2. What is Apache Kafka? 3. Logs and Distributed Systems 4. Logs and Data Integration 5. Logs and Stream Processing

Kafka is about logs

What is a log?

1st Record Next Record Written 0 1 2 3 4 5 6 7 8 9 10 11 12

Partitioning Partition 0 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 Partition 1 0 1 2 3 4 5 6 7 8 9 Partition 2 0 1 2 3 4 5 6 7 8 9 1 0 1 1

Logs: pub/sub done right Data Source writes Log 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 reads reads Destination System A (time = 7) Destination System B (time = 11)

Logs And Distributed Systems

Example: " A Fault-tolerant CEO Hash Table

Operations PUT('microsoft', 'bill gates') PUT('apple', 'steve jobs') PUT('microsoft', 'steve ballmer') PUT('google', 'larry page') PUT('yahoo', 'terry semel') PUT('google', 'eric schmidt') PUT('yahoo', 'jerry yang') PUT('yahoo', 'carol bartz') PUT('apple', 'tim cook') PUT('google', 'larry page') PUT('yahoo', 'scott thompson') PUT('yahoo', 'marissa mayer') PUT('microsoft', 'satya nadella') { } Final State 'microsoft': 'satya nadella', 'apple': 'tim cook', 'google': 'larry page', 'yahoo': 'marissa mayer' Replica 1 Replica 2

0 1 2 3 4 5 6 7 8 9 10 11 12 PUT(microsoft, bill gates) PUT(apple, steve jobs) PUT(microsoft, steve ballmer) PUT(google, larry page) PUT(yahoo, terry semel) PUT(google, eric schmidt) PUT(yahoo, jerry yang) PUT(yahoo, carol bartz) PUT(apple, tim cook) PUT(google, larry page) PUT(yahoo, scott thompson) PUT(yahoo, marissa mayer) PUT(microsoft, satya nadella) Replica 1 (offset=10) Replica 2 (offset=12)

Two System Design Styles State-machine Replication Primary-backup Writes Reads Requests The Log Master Slave Slave Peer Peer Peer state changes The Log

The Plan 1. What is Data Integration? 2. What is Apache Kafka? 3. Logs and Distributed Systems 4. Logs and Data Integration 5. Logs and Stream Processing

Example: User views job Jobs Frontend Job Views Kafka Job Views Hadoop Security Job Poster Analytics Rec. Engine Monitoring

It s all one big distributed system Espresso Espresso Espresso Voldemort Voldemort Voldemort Oracle Oracle Oracle User Tracking Operational Logs Operational Metrics Log Hadoop Log Search Monitoring Data Warehouse Social Graph Rec Engine Search Security... Email Production Services

Comparing Data Transfer Mechanisms

The Plan 1. What is Data Integration? 2. What is Apache Kafka? 3. Logs and Distributed Systems 4. Logs and Data Integration 5. Logs and Stream Processing

Stream Processing

Stream Processing = Logs + Jobs Log A Log B Log C Job 1 Job 2 Log D Log E Job 3 Log F

Stream processing is a" generalization" of batch processing

Examples Monitoring Security Content processing Recommendations Newsfeed ETL

Systems Can Help

Samza Architecture Job Job Job Job Samza Kafka YARN

Log-centric Architecture Graph DB, OLAP Store, Etc Key-Value Query Layer Search Query Layer Monitoring & Graphs Log Stream Proces sing Hadoop

Kafka! http://kafka.apache.org!! Samza! http://samza.incubator.apache.org!! Log Blog! http://linkd.in/199imwy!! Me! http://www.linkedin.com/in/jaykreps! @jaykreps!!