Data Pipeline with Kafka

Size: px

Start display at page:

Download "Data Pipeline with Kafka"

Garry Adams
10 years ago
Views:

1 Data Pipeline with Kafka Peerapat Asoktummarungsri AGODA

2 Senior Software Engineer Agoda.com Contributor Thai Java User Group (THJUG.com) Contributor Agile66

3 AGENDA Big Data & Data Pipeline Kafka Introduction Quick Start Monitoring Data Pipeline for Search API Hadoop integration with Camus

4 Big Data Hadoop Information MapReduce + HDFS

5 Pipeline Website hadoop log

6 Growth Website log hadoop Mobile

7 Complex Website log hadoop Mobile message realtime monitoring

8 Features becomes the problem Website hadoop New Mobile realtime monitoring New API NEW Data Warehouse

9 Data Pipeline Website Produce hadoop Mobile Data Pipeline realtime monitoring Consume API Warehouse

10 1 Consumer Queue 2 Consumer 3 Consumer compare 1 Consumer Topic 1 Consumer 1 Consumer

11 General Topic Implement 2 Consumer 1 Topic Consumer 2 2 Consumer 3 This consumer will lose a message.

12 Distributed by Design Fast Scalable - It can be elastically and transparently expanded without downtime. Durable - Messages are persisted on disk and replicated within the cluster to prevent data loss.

13 msg gid = hadoop msg 4 1 Topic Consumer 1 msg 7 2 Consumer Consumer 3 gid = Group ID

14 msg gid = hadoop msg Topic msg hadoop gid = rtmon realtime monitoring gid = warehouse data warehouse gid = Group ID

15 msg msg 9 Topic 3 gid = hadoop hadoop gid = Group ID New Consumer gid = newconsumer gid = rtmon realtime monitoring gid = warehouse data warehouse

17 Vagrant Install Vagrant Install Virtual Box Clone vagrant up

18 BREW brew update brew install zookeeper kafka -y

19 Some Kafka Config # The id of the broker. This must be set to a unique integer for each broker. broker.id=0 # The port the socket server listens on port=9092 # Zookeeper connection string (see zookeeper docs for details). zookeeper.connect=localhost:2181 # Timeout in ms for connecting to zookeeper zookeeper.connection.timeout.ms=6000 # The minimum age of a log file to be eligible for deletion log.retention.hours=168

20 Linkedin (2013) 10 billion message writes per day 55 billion messages delivered to real-time consumers 367 topics that cover both user activity topics and operational data the largest of which adds an average of 92GB per day of batch-compressed messages Messages are kept for 7 days, and these average at about 9.5 TB of compressed messages across all topics.

the largest of which adds an average of 92GB per day of batch-compressed messages Messages

21 KafkaOffsetMonitor 1 Jar file, KafkaOffsetMonitor-assembly jar java -cp KafkaOffsetMonitor-assembly jar \ com.quantifind.kafka.offsetapp.offsetgetterweb \ --zk localhost \ --port 8080 \ --refresh 10.seconds \ --retain 2.days Download KafkaOffsetMonitor from Github

28 Produce Change CHANGE API Kafka Price & Inventory Consumer Calculate Price Hotel Manager HTTP Search API Hotels Cassandra

29 API CHANGE Kafka Price & Inventory Consumer Hotel Manager A Consumer Hotels B Consumer

30 Camus

31 Slide available here Sourcecode available here

32 REFERENCES developingwithapachekafka

33 Q & A

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA Processing billions of events every day Neha Narkhede Co-founder and Head of Engineering @ Stealth Startup Prior to this Lead, Streams Infrastructure