Introduction to Apache Kafka And Real-Time ETL. for Oracle DBAs and Data Analysts



Similar documents
Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Putting Apache Kafka to Use!

Using Kafka to Optimize Data Movement and System Integration. Alex

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran)

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Kafka & Redis for Big Data Solutions

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Apache Hadoop: Past, Present, and Future

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Building Your Big Data Team

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

The Future of Data Management

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

BIG DATA FOR MEDIA SIGMA DATA SCIENCE GROUP MARCH 2ND, OSLO

Enterprise Information Integration (EII) A Technical Ally of EAI and ETL Author Bipin Chandra Joshi Integration Architect Infosys Technologies Ltd

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Oracle Data Integrator for Big Data. Alex Kotopoulis Senior Principal Product Manager

HDP Hadoop From concept to deployment.

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

HDP Enabling the Modern Data Architecture

the missing log collector Treasure Data, Inc. Muga Nishizawa

High Availability Databases based on Oracle 10g RAC on Linux

Dominik Wagenknecht Accenture

How To Use Big Data For Telco (For A Telco)

Data Pipeline with Kafka

FINANCIAL SERVICES: FRAUD MANAGEMENT A solution showcase

I Logs. Apache Kafka, Stream Processing, and Real-time Data Jay Kreps

Getting Real Real Time Data Integration Patterns and Architectures

BIG DATA ANALYTICS For REAL TIME SYSTEM

The Internet of Things and Big Data: Intro

Hadoop. for Oracle database professionals. Alex Gorbachev Calgary, AB September 2013

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

So What s the Big Deal?

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Integrating a Big Data Platform into Government:

Hadoop & Spark Using Amazon EMR

Upcoming Announcements

Streaming items through a cluster with Spark Streaming

Real-time Data Replication

Hadoop Ecosystem B Y R A H I M A.

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Unified Batch & Stream Processing Platform

Big Data Analytics Nokia

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Architectures for massive data management

Practical Considerations for Real-Time Business Intelligence. Donovan Schneider Yahoo! September 11, 2006

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Customized Report- Big Data

Highly available, scalable and secure data with Cassandra and DataStax Enterprise. GOTO Berlin 27 th February 2014

Real World Big Data Architecture - Splunk, Hadoop, RDBMS

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

From Spark to Ignition:

Constructing a Data Lake: Hadoop and Oracle Database United!

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

GigaSpaces Real-Time Analytics for Big Data

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

Ganzheitliches Datenmanagement

Where is... How do I get to...

Luncheon Webinar Series May 13, 2013

A SURVEY OF POPULAR CLUSTERING TECHNOLOGIES

Big Data Analytics - Accelerated. stream-horizon.com

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

Getting Started Practical Input For Your Roadmap

Reference Architecture, Requirements, Gaps, Roles

NOT IN KANSAS ANY MORE

Making Sense of Big Data in Insurance

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

ScaleArc idb Solution for SQL Server Deployments

Real Time Data Processing using Spark Streaming

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Data Virtualization and ETL. Denodo Technologies Architecture Brief

Internals of Hadoop Application Framework and Distributed File System

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Building Scalable Big Data Pipelines

Apache Kafka Your Event Stream Processing Solution

Towards Smart and Intelligent SDN Controller

Big Data With Hadoop

How to Enhance Traditional BI Architecture to Leverage Big Data

Oracle Database In-Memory A Practical Solution

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

This article is the second

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

FIFTH EDITION. Oracle Essentials. Rick Greenwald, Robert Stackowiak, and. Jonathan Stern O'REILLY" Tokyo. Koln Sebastopol. Cambridge Farnham.

Transcription:

Introduction to Apache Kafka And Real-Time ETL for Oracle DBAs and Data Analysts 1

About Myself Gwen Shapira System Architect @Confluent Committer @ Apache Kafka, Apache Sqoop Author of Hadoop Application Architectures, Kafka The Definitive Guide Previously: Software Engineer @ Cloudera Oracle ACE Director Senior Consultants @ Pythian DBA @ Mercury Interactive Find me: gwen@confluent.io @gwenshap

There s a Book on That!

An Optical Illusion Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform

We ll talk about: Redo Logs Pub-sub message queues So What is Kafka? Awesome use-cases for Kafka Data streams and real-time ETL Where can you learn more

Transaction Log: The most crucial structure for recovery operations store all changes made to the database as they occur.

Important Point The redo log is the only reliable source of information about current state of the database.

Redo Log is used for Recover consistent state of a database Replicate the database (Dataguard, Streams, GoldenGate ) If you look far enough into archive logs you can reconstruct the entire database

Publish-Subcribe Messaging

The Problem We have more user requests than database connections

What do we do? 1. Your call is important to us 2. Show them static content 3. Distract them with funny cat photos 4. Prioritize them 5. Acknowledge them and handle the request later 6. Jump them to the head of line 7. Tell them how long the wait is 12

Solution Message Queue Msg 1 Msg 2 Msg 3... Msg N

Application Business Layer Message Queue Application Data Layer DataSource Interface DataSource JDBC Driver Connection Pool 1 4

Queues decouple systems: Database is protected from user load

Raise your hand if this sounds familiar My next project was to get a working Hadoop setup Having little experience in this area, we naturally budgeted a few weeks for getting data in and out, and the rest of our time for implementing fancy algorithms. --Jay Kreps, Kafka PMC

Client Source Data Pipelines Start like this. 17

Client Source Client Client Client Then we reuse them 18

Client Backend Client Another Backend Client Client Then we add consumers to the existing sources 19

Client Backend Client Another Backend Client Another Backend Client Another Backend Then it starts to look like this 20

Client Backend Client Another Backend Client Another Backend Client Another Backend With maybe some of this 21

Queues decouple systems: Adding new systems doesn t require changing Existing systems

This is where we are trying to get Producers Source System Source System Source System Source System Brokers Kafka Consumers Hadoop Security Systems Real-time monitoring Data Warehouse Kafka decouples Data Pipelines Kafka decouples Data Pipelines 23

Important notes: Producers and Consumers dont need to know about each other Performance issues on Consumers dont impact Producers Consumers are protected from herds of Producers Lots of flexibility in handling load Messages are available for anyone lots of new use cases, monitoring, audit, troubleshooting http://www.slideshare.net/gwenshap/queues-pools-caches

That s nice, but what is Kafka?

Kafka provides a fast, distributed, highly scalable, highly available, publish-subscribe messaging system. In turn this solves part of a much harder problem: Communication and integration between components of large software systems Click to enter confidentiality

The Basics Messages are organized into topics Producers push messages Consumers pull messages Kafka runs in a cluster. Nodes are called brokers

Topics, Partitions and Logs

Each partition is a log

Each Broker has many partitions Partition 0 Partition 0 Partition 0 Partition 1 Partition 1 Partition 1 Partition 2 Partition 2 Partion 2

Producers load balance between partitions Client Partition 0 Partition 0 Partition 0 Partition 1 Partition 1 Partition 1 Partition 2 Partition 2 Partion 2

Producers load balance between partitions Client Partition 0 Partition 0 Partition 0 Partition 1 Partition 1 Partition 1 Partition 2 Partition 2 Partion 2

Off Set X Off Set X Off Set X Off Set Y Off Set Y Off Set Y Consumers Kafka Cluster Consumer Group X Order retained with in partition Topic Consumer Partition A (File) Consumer Consumer Partition B (File) Consumer Group Y Order retained with in partition but not over partitions Partition C (File) Consumer Off sets are kept per consumer group

Kafka Magic Why is it so fast? 250M Events per sec on one node at 3ms latency Scales to any number of consumers Stores data for set amount of time Without tracking who read what data Replicates but no need to sync to disk Zero-copy writes from memory / disk to network

How do people use Kafka? As a message bus As a buffer for replication systems (Like AdvancedQueue in Streams) As reliable feed for event processing As a buffer for event processing Decouple apps from database (both OLTP and DWH)

My Favorite Use Cases Shops consume inventory updates Clicking around an online shop? Your clicks go to Kafka and recommendations come back. Flagging credit card transactions as fraudulant Flagging game interactions as abuse Least favorite: Surge pricing in Uber Huge list of users at kafka.apache.org 36

Got it! But what about real-time ETL?

Remember This? Producers Source System Source System Source System Source System Brokers Kafka Consumers Hadoop Security Systems Real-time monitoring Data Warehouse Kafka is smack in middle of all Data Pipelines Kafka decouples Data Pipelines 38

If data flies into Kafka in real time Why wait 24h before pulling it into a DWH? 39

40

Why Kafka makes real-time ETL better? Can integrate with any data source RDBMS, NoSQL, Applications, web applications, logs Consumers can be real-time But they do not have to Reading and writing to/from Kafka is cheap So this is a great place to store intermediate state You can fix mistakes by rereading some of the data again Same data in same order Adding more pipelines / aggregations has no impact on source systems = low risk 41

It is all valuable data Raw data Clean data Aggregated data Enriched data Raw data Clean data Filtered data Dash board Report Data scientist Alerts OMG

But wait, how do we process the data? However you want: You just consume data, modify it, and produce it back Built into Kafka: Kprocessor Kstream Popular choices: Storm SparkStreaming

Need More Kafka? https://kafka.apache.org/documentation.html My video tutorial: http://shop.oreilly.com/product/0636920038603.do http://www.michael-noll.com/blog/2014/08/18/apache-kafkatraining-deck-and-tutorial/ Our website: http://confluent.io Oracle guide to real-time ETL: http://www.oracle.com/technetwork/middleware/dataintegrator/overview/best-practices-for-realtime-data-wa-132882.pdf