Using Kafka to Optimize Data Movement and System Integration. Alex Holmes @



Similar documents
The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Putting Apache Kafka to Use!

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Introduction to Apache Kafka And Real-Time ETL. for Oracle DBAs and Data Analysts

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Architectures for massive data management

NOT IN KANSAS ANY MORE

I Logs. Apache Kafka, Stream Processing, and Real-time Data Jay Kreps

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

How To Scale Out Of A Nosql Database

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Apache Hadoop: Past, Present, and Future

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Luncheon Webinar Series May 13, 2013

Wisdom from Crowds of Machines

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Comprehensive Analytics on the Hortonworks Data Platform

Openbus Documentation

How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran)

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Data Pipeline with Kafka

Kafka: a Distributed Messaging System for Log Processing

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Practical Hadoop by Example

Application Development. A Paradigm Shift

Constructing a Data Lake: Hadoop and Oracle Database United!

Case Study : 3 different hadoop cluster deployments

The Future of Data Management with Hadoop and the Enterprise Data Hub

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

WSO2 Message Broker. Scalable persistent Messaging System

Implement Hadoop jobs to extract business value from large and varied data sets

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Dominik Wagenknecht Accenture

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA TRENDS AND TECHNOLOGIES

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

GigaSpaces Real-Time Analytics for Big Data

ROCANA WHITEPAPER Rocana Ops Architecture

Moving From Hadoop to Spark

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

How To Use Big Data For Telco (For A Telco)

Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared

Apache Kafka Your Event Stream Processing Solution

Workshop on Hadoop with Big Data

SAP and Hortonworks Reference Architecture

Big Data Infrastructure at Spotify

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Big Data Analytics - Accelerated. stream-horizon.com

Unified Big Data Processing with Apache Spark. Matei

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

Oracle Database 12c Plug In. Switch On. Get SMART.

Introduction to Big Data Training

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Creating Big Data Applications with Spring XD

Upcoming Announcements

Hadoop implementation of MapReduce computational model. Ján Vaňo

A Brief Outline on Bigdata Hadoop

The Internet of Things and Big Data: Intro

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Apache HBase. Crazy dances on the elephant back

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Future Internet Technologies

Big Data Analytics - Accelerated. stream-horizon.com

HDP Hadoop From concept to deployment.

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

BIG DATA FOR MEDIA SIGMA DATA SCIENCE GROUP MARCH 2ND, OSLO

Hadoop Ecosystem B Y R A H I M A.

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

THE HADOOP DISTRIBUTED FILE SYSTEM

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

MapReduce with Apache Hadoop Analysing Big Data

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Schema Design Patterns for a Peta-Scale World. Aaron Kimball Chief Architect, WibiData

Big Data Course Highlights

Dell In-Memory Appliance for Cloudera Enterprise

Kafka & Redis for Big Data Solutions

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Can the Elephants Handle the NoSQL Onslaught?

Hadoop. for Oracle database professionals. Alex Gorbachev Calgary, AB September 2013

MySQL és Hadoop mint Big Data platform (SQL + NoSQL = MySQL Cluster?!)

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Transcription:

Using Kafka to Optimize Data Movement and System Integration Alex Holmes @

https://www.flickr.com/photos/tom_bennett/7095600611

THIS SUCKS E T (circa 2560 B.C.E.) L

a few years later...

2,014 C.E. i need you to copy data from oracle to HADOOP... 3 weeks later... this sucks!

Agenda Look at challenges with data and system integration Provide an overview of Kafka, including its performance Examine Kafka data and system integration patterns

whoami Alex Holmes Software engineer Working on distributed systems for many years @grep_alex grepalex.com

Challenges with data integration

i need you to copy data from oracle into Hadoop for our data scientists

Sqoop Custom ETL Hadoop Problems you ll need to solve: Reliability/fault tolerance Scalability/throughput Handle large tables Throttle network IO Idempotent writes Scheduling

OH, AND WE NEED To use the same data to calculate aggregates in real-time

Sqoop Hadoop? NoSQL

Hadoop OLTP OLAP / EDW HBase Cassan dra Voldem ort Rec. Engine Analytics Security Search Monitoring Social Graph

Hadoop OLTP OLAP / EDW HBase Cassan dra Voldem ort kafka Rec. Engine Analytics Security Search Monitoring Social Graph

Kafka

Background Apache project Originated from LinkedIn Open-sourced in 2011 Written in Scala and Java Borrows concepts in messaging systems and logs

Powered by Twitter LinkedIn Netflix Spotify Mozilla

A log (partition in Kafka parlance) First record Most recent record 0 1 2 3 4 5 6 7 8 9 10 11...

Producer-consumer system Producer Producer Producer Kafka Consumer Consumer Consumer

Kafka!= JMS Doesn t use the JMS API Ordering guarantees Mid-to-long term message storage Rewind capabilities Many-to-many messaging Broker doesn t maintain delivery state

Concepts A Kafka cluster consists of 1..N brokers and manages 1..N topics Topics are a category or feed name where messages are published Topics have one or more partitions, which are ordered, immutable sequence of messages that are continually appended to, much like a commit log

Data scaling and parallelism Partition 0 0 1 2 3 4 5 6 7 8 9 10 11... Partition 1 0 1 2 3 4 5 6 7 8... Partition 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13...

Partition distribution Round-robin/ load balance Producer Domain-specific semantic partition strategy Server Server Server Partition 0 Partition 1 Partition 2 Partition 3 Partition 4 Partition 5

Consumers Server Server Server Partition 0 Partition 1 Partition 2 Consumer A Consumer B Consumer A Consumer group 1 Queue-balancing load Pub-sub - all messages broadcasted to all consumers

Consumer state High-level consumer Low-level consumer (manage your own state) ZooKeeper Pick your own offset storage adventure Consumer Partition 0 Consumer Partition 0 Oldest Oldest Newest Newest

Partition replication Producers Server Server Server "Leader" partition "Follower" partition "Follower" partition Consumers

Performance + Scalability

Scalability Producer Producer Producer Kafka Server Kafka Server Kafka Server Consumer Consumer Consumer

Sequential disk IO Consumer A Consumer B Producer reads reads writes Disk 0 1 2 3 4 5 6 7 8 9 10 11...

O.S. page cache is leveraged Consumer A Consumer B Producer reads reads writes OS page cache Disk 0 1 2 3 4 5 6 7 8 9 10 11...

Zero-copy IO Traditional Zero-copy

Throughput Producer Producer Producer 2,024,032 TPS 2ms Kafka Server Kafka Server Kafka Server 2,615,968 TPS Consumer Consumer Consumer http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

Kafka data integration patterns

External data subscription Subscriber Subscriber Your App Kafka Subscriber Subscriber

Internal data subscription Hadoop Analytics Your App Kafka Recomm endations Monitoring Security

Stream processing Your App Kafka Your App NoSQL Your App

Lambda architecture Kafka NoSQL Hadoop

Multi-datacenter Datacenter 1 Datacenter 2 Kafka Kafka Datacenter 3 Kafka

Additional patterns Partition on user ID so all events for a user land on a single partition Publish system and application logs for downstream aggregation and monitoring System of record for your data

Data standardization Important to pick a lingua franca for your data Considerations should include compactness, schema evolution, tooling and third-party support Avro is a good fit (coupled with Parquet for on-disk)

Conclusion Kafka is a scalable system with useful traits such as ordered messages, durability, availability Popular for data movement and feeding multiple downstream systems using the same data pipe Excellent documentation available at http://kafka.apache.org/documentation.html

Shameless plug Book signing at the JavaOne bookstore @ 1pm tomorrow CON3515 - The Top 10 Hadoop Patterns and Anti-patterns @ 11am tomorrow Parc 55 - Embarcadero