Data Pipeline with Kafka



Similar documents
STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Building your Big Data Architecture on Amazon Web Services

Architectures for massive data management

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

A Performance Analysis of Distributed Indexing using Terrier

Big Data Infrastructure at Spotify

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Openbus Documentation

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Kafka & Redis for Big Data Solutions

Building a logging pipeline with Open Source tools. Iñigo Ortiz de Urbina Cazenave

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Using Kafka to Optimize Data Movement and System Integration. Alex

docs.hortonworks.com

Search and Real-Time Analytics on Big Data

Comparing Scalable NOSQL Databases

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

The Hadoop Distributed File System

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Katta & Hadoop. Katta - Distributed Lucene Index in Production. Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com

Wisdom from Crowds of Machines

Kafka: a Distributed Messaging System for Log Processing

The Game of Big Data! Analytics Infrastructure at KIXEYE

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Introduction to Apache Kafka And Real-Time ETL. for Oracle DBAs and Data Analysts

Testing Spark: Best Practices

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Putting Apache Kafka to Use!

HADOOP MOCK TEST HADOOP MOCK TEST II

Hadoop. Sunday, November 25, 12

MapReduce, Hadoop and Amazon AWS

Fast Data in the Era of Big Data: Twitter s Real-

Virtualizing Apache Hadoop. June, 2012

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

The Cloud to the rescue!

Integrating VoltDB with Hadoop

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Big Data Spatial Analytics An Introduction

Hadoop IST 734 SS CHUNG

Best Practices for Sharing Imagery using Amazon Web Services. Peter Becker

Introduction to Hbase Gkavresis Giorgos 1470

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

HDFS Cluster Installation Automation for TupleWare

BIG DATA TRENDS AND TECHNOLOGIES

How Companies are! Using Spark

Moving From Hadoop to Spark

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

THE HADOOP DISTRIBUTED FILE SYSTEM

NoSQL for SQL Professionals William McKnight

CS 378 Big Data Programming

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Productionizing a 24/7 Spark Streaming Service on YARN

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

[Hadoop, Storm and Couchbase: Faster Big Data]

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Large Scale Text Analysis Using the Map/Reduce

Chase Wu New Jersey Ins0tute of Technology

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Introduction to HDFS. Prasanth Kothuri, CERN

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Comparing SQL and NOSQL databases

Media Upload and Sharing Website using HBASE

Scalable Architecture on Amazon AWS Cloud

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Application Development. A Paradigm Shift

Unified Big Data Processing with Apache Spark. Matei

Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang

Big Data Analytics - Accelerated. stream-horizon.com

Luncheon Webinar Series May 13, 2013

Workshop: From Zero. Budapest DW Forum 2014

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Apache Hadoop FileSystem and its Usage in Facebook

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

Performance Testing of Big Data Applications

How To Scale Out Of A Nosql Database

Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture

Creating Big Data Applications with Spring XD

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

How To Create A Data Visualization With Apache Spark And Zeppelin

BIG DATA USING HADOOP

Hadoop Distributed File System (HDFS) Overview

Transcription:

Data Pipeline with Kafka Peerapat Asoktummarungsri AGODA

Senior Software Engineer Agoda.com Contributor Thai Java User Group (THJUG.com) Contributor Agile66

AGENDA Big Data & Data Pipeline Kafka Introduction Quick Start Monitoring Data Pipeline for Search API Hadoop integration with Camus

Big Data Hadoop Information MapReduce + HDFS

Pipeline Website hadoop log

Growth Website log hadoop Mobile

Complex Website log hadoop Mobile message realtime monitoring

Features becomes the problem Website hadoop New Mobile realtime monitoring New API NEW Data Warehouse

Data Pipeline Website Produce hadoop Mobile Data Pipeline realtime monitoring Consume API Warehouse

1 Consumer Queue 2 Consumer 3 Consumer compare 1 Consumer Topic 1 Consumer 1 Consumer

General Topic Implement 2 Consumer 1 Topic Consumer 2 2 Consumer 3 This consumer will lose a message.

Distributed by Design Fast Scalable - It can be elastically and transparently expanded without downtime. Durable - Messages are persisted on disk and replicated within the cluster to prevent data loss.

msg gid = hadoop msg 4 1 Topic Consumer 1 msg 7 2 Consumer 2 6 5 3 Consumer 3 gid = Group ID

msg gid = hadoop msg 3 2 1 Topic msg hadoop gid = rtmon 3 2 1 realtime monitoring gid = warehouse 3 2 1 data warehouse gid = Group ID

msg msg 9 Topic 3 gid = hadoop hadoop gid = Group ID New Consumer gid = newconsumer 2 1 9 9 gid = rtmon realtime monitoring gid = warehouse data warehouse

Vagrant Install Vagrant Install Virtual Box Clone https://github.com/stealthly/scala-kafka vagrant up

BREW brew update brew install zookeeper kafka -y

Some Kafka Config # The id of the broker. This must be set to a unique integer for each broker. broker.id=0 # The port the socket server listens on port=9092 # Zookeeper connection string (see zookeeper docs for details). zookeeper.connect=localhost:2181 # Timeout in ms for connecting to zookeeper zookeeper.connection.timeout.ms=6000 # The minimum age of a log file to be eligible for deletion log.retention.hours=168

Kafka @ Linkedin (2013) 10 billion message writes per day 55 billion messages delivered to real-time consumers 367 topics that cover both user activity topics and operational data the largest of which adds an average of 92GB per day of batch-compressed messages Messages are kept for 7 days, and these average at about 9.5 TB of compressed messages across all topics.

KafkaOffsetMonitor 1 Jar file, KafkaOffsetMonitor-assembly-0.2.1.jar java -cp KafkaOffsetMonitor-assembly-0.2.1.jar \ com.quantifind.kafka.offsetapp.offsetgetterweb \ --zk localhost \ --port 8080 \ --refresh 10.seconds \ --retain 2.days Download KafkaOffsetMonitor from Github https://github.com/quantifind/kafkaoffsetmonitor

Produce Change CHANGE API Kafka Price & Inventory Consumer Calculate Price Hotel Manager HTTP Search API Hotels Cassandra

API CHANGE Kafka Price & Inventory Consumer Hotel Manager A Consumer Hotels B Consumer

Camus

Slide available here http://www.slideshare.net/nuboat Sourcecode available here https://github.com/nuboat/akkakafkaexam

REFERENCES http://www.slideshare.net/charmalloc/ developingwithapachekafka-29910685 http://www.infoq.com/articles/apache-kafka http://kafka.apache.org/ https://github.com/stealthly/scala-kafka https://github.com/quantifind/kafkaoffsetmonitor

Q & A