Using Kafka to Optimize Data Movement and System Integration. Alex

Size: px

Start display at page:

Download "Using Kafka to Optimize Data Movement and System Integration. Alex Holmes @"

Emmeline Byrd
9 years ago
Views:

1 Using Kafka to Optimize Data Movement and System Integration Alex

3 THIS SUCKS E T (circa 2560 B.C.E.) L

4 a few years later...

5 2,014 C.E. i need you to copy data from oracle to HADOOP... 3 weeks later... this sucks!

6 Agenda Look at challenges with data and system integration Provide an overview of Kafka, including its performance Examine Kafka data and system integration patterns

7 whoami Alex Holmes Software engineer Working on distributed systems for many grepalex.com

8 Challenges with data integration

9 i need you to copy data from oracle into Hadoop for our data scientists

10 Sqoop Custom ETL Hadoop Problems you ll need to solve: Reliability/fault tolerance Scalability/throughput Handle large tables Throttle network IO Idempotent writes Scheduling

11 OH, AND WE NEED To use the same data to calculate aggregates in real-time

12 Sqoop Hadoop? NoSQL

13 Hadoop OLTP OLAP / EDW HBase Cassan dra Voldem ort Rec. Engine Analytics Security Search Monitoring Social Graph

14 Hadoop OLTP OLAP / EDW HBase Cassan dra Voldem ort kafka Rec. Engine Analytics Security Search Monitoring Social Graph

15 Kafka

16 Background Apache project Originated from LinkedIn Open-sourced in 2011 Written in Scala and Java Borrows concepts in messaging systems and logs

17 Powered by Twitter LinkedIn Netflix Spotify Mozilla

18 A log (partition in Kafka parlance) First record Most recent record

19 Producer-consumer system Producer Producer Producer Kafka Consumer Consumer Consumer

20 Kafka!= JMS Doesn t use the JMS API Ordering guarantees Mid-to-long term message storage Rewind capabilities Many-to-many messaging Broker doesn t maintain delivery state

21 Concepts A Kafka cluster consists of 1..N brokers and manages 1..N topics Topics are a category or feed name where messages are published Topics have one or more partitions, which are ordered, immutable sequence of messages that are continually appended to, much like a commit log

22 Data scaling and parallelism Partition Partition Partition

23 Partition distribution Round-robin/ load balance Producer Domain-specific semantic partition strategy Server Server Server Partition 0 Partition 1 Partition 2 Partition 3 Partition 4 Partition 5

24 Consumers Server Server Server Partition 0 Partition 1 Partition 2 Consumer A Consumer B Consumer A Consumer group 1 Queue-balancing load Pub-sub - all messages broadcasted to all consumers

25 Consumer state High-level consumer Low-level consumer (manage your own state) ZooKeeper Pick your own offset storage adventure Consumer Partition 0 Consumer Partition 0 Oldest Oldest Newest Newest

26 Partition replication Producers Server Server Server "Leader" partition "Follower" partition "Follower" partition Consumers

27 Performance + Scalability

28 Scalability Producer Producer Producer Kafka Server Kafka Server Kafka Server Consumer Consumer Consumer

29 Sequential disk IO Consumer A Consumer B Producer reads reads writes Disk

30 O.S. page cache is leveraged Consumer A Consumer B Producer reads reads writes OS page cache Disk

31 Zero-copy IO Traditional Zero-copy

32 Throughput Producer Producer Producer 2,024,032 TPS 2ms Kafka Server Kafka Server Kafka Server 2,615,968 TPS Consumer Consumer Consumer

33 Kafka data integration patterns

34 External data subscription Subscriber Subscriber Your App Kafka Subscriber Subscriber

35 Internal data subscription Hadoop Analytics Your App Kafka Recomm endations Monitoring Security

36 Stream processing Your App Kafka Your App NoSQL Your App

37 Lambda architecture Kafka NoSQL Hadoop

38 Multi-datacenter Datacenter 1 Datacenter 2 Kafka Kafka Datacenter 3 Kafka

39 Additional patterns Partition on user ID so all events for a user land on a single partition Publish system and application logs for downstream aggregation and monitoring System of record for your data

40 Data standardization Important to pick a lingua franca for your data Considerations should include compactness, schema evolution, tooling and third-party support Avro is a good fit (coupled with Parquet for on-disk)

41 Conclusion Kafka is a scalable system with useful traits such as ordered messages, durability, availability Popular for data movement and feeding multiple downstream systems using the same data pipe Excellent documentation available at

42 Shameless plug Book signing at the JavaOne 1pm tomorrow CON The Top 10 Hadoop Patterns and 11am tomorrow Parc 55 - Embarcadero

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex Holmes @

The Top 10 7 Hadoop Patterns and Anti-patterns Alex Holmes @ whoami Alex Holmes Software engineer Working on distributed systems for many years Hadoop since 2008 @grep_alex grepalex.com what s hadoop...