STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Size: px

Start display at page:

Download "STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day"

Emery Nelson
8 years ago
Views:

1 STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA Processing billions of events every day

2 Neha Narkhede Co-founder and Head of Stealth Startup Prior to this Lead, Streams LinkedIn (Kafka & Samza) One of the initial authors of Apache Kafka, committer and PMC member Reach out

Infrastructure @ LinkedIn (Kafka & Samza) One of the

3 Agenda Real-time Data Integration Introduction to Logs & Apache Kafka Logs & Stream processing Apache Samza Stateful stream processing

4 The Data Needs Pyramid Self actualization Automation Esteem Understanding Love/Belonging Safety Physiological Data processing Data collection Maslow's hierarchy of needs Data needs

5 Agenda Real-time Data Integration Introduction to Logs & Apache Kafka Logs & Stream processing Apache Samza Stateful stream processing

6 Increase in diversity of data Database data (users, products, orders etc) Events (clicks, impressions, pageviews) Application logs (errors, service calls) Application metrics (CPU usage, requests/sec) Siloed data feeds IoT sensors

pageviews) Application logs (errors, service calls)

7 Explosion in diversity of systems Live Systems Voldemort Espresso GraphDB Search Samza Batch Hadoop Teradata

8 Data integration disaster Espresso Espresso Espresso Voldemort Voldemort Voldemort Oracle Oracle Oracle User Tracking Logs Operational Metrics Hadoop Log Search Monitoring Data Warehous e Social Graph Rec. Engine Search Security... Production Services

Operational Metrics Hadoop Log Search Monitoring Data Warehous e

9 Centralized service Espresso Espresso Espresso Voldemort Voldemort Voldemort Oracle Oracle Oracle User Tracking Logs Operational Metrics Data Pipeline Hadoop Log Search Monitorin g Data Warehous e Social Graph Rec Engine & Life Search Security... Production Services

Metrics Data Pipeline Hadoop Log Search Monitorin g Data Warehous e

10 Agenda Real-time Data Integration Introduction to Logs & Apache Kafka Logs & Stream processing Apache Samza Stateful stream processing

11 Kafka at 10,000 ft Producer Producer Producer Producer Cluster of brokers Producer Producer Distributed from ground up Persistent Multi-subscriber Producer Consumer Producer Consumer Producer Consumer

12 Key design principles Scalability of a file system Hundreds of MB/sec/server throughput Many TBs per server Guarantees of a database Messages strictly ordered All data persistent Distributed by default Replication model Partitioning model

of a database Messages strictly ordered All data persistent

13 Kafka adoption

14 Apache LinkedIn 175 TB of in-flight log data per colo Low-latency: ~1.5ms Replicated to each datacenter Tens of thousands of data producers Thousands of consumers 7 million messages written/sec 35 million messages read/sec Hadoop integration

15 Logs The data structure every systems engineer should know

16 The Log 1st record next record written Ordered Append only Immutable

17 The Log: Partitioning Partition Partition Partition

18 Logs: pub/sub done right Data source writes reads reads Destination system A (time = 7) Destination system B (time = 11)

19 Logs for data integration User updates profile with new job KAFKA Newsfeed Search Hadoop Standardization engine

20 Agenda Real-time Data Integration Introduction to Logs & Apache Kafka Logs & Stream processing Apache Samza Stateful stream processing

21 Stream processing = f(log) Log A Job 1 Log B

22 Stream processing = f(log) Log A Job 1 Log B Log C Job 2 Log D Log E

23 Apache Samza at LinkedIn User updates profile with new job KAFKA Newsfeed Search Hadoop Standardization engine

24 Latency spectrum of data systems RPC Latency Asynchronous processing (seconds to minutes) Synchronous (milliseconds) Batch (Hours)

25 Agenda Real-time Data Integration Introduction to Logs & Apache Kafka Logs & Stream processing Apache Samza Stateful stream processing

26 Samza API getkey(), getmsg() public interface StreamTask { void process (IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator); } sendmsg(topic, key, value) commit(), shutdown()

27 Samza Architecture (Logical view) Log A partition 0 partition 1 partition 2 Task 1 Task 2 Task 3 partition 0 partition 1 Log B

28 Samza Architecture (Logical view) Log A partition 0 partition 1 partition 2 Samza container 1 Samza container 2 Task 1 Task 2 Task 3 partition 0 partition 1 Log B

29 Samza Architecture (Physical view) Samza container 1 Samza container 2 Host 1 Host 2

30 Samza Architecture (Physical view) Node manager Node manager Samza container 1 Samza container 2 Samza YARN AM Host 1 Host 2

31 Samza Architecture (Physical view) Node manager Node manager Samza container 1 Samza container 2 Samza YARN AM Kafka Kafka Host 1 Host 2

32 Samza Architecture: Equivalence to Map Reduce Node manager Node manager Map Reduce Map Reduce YARN AM HDFS HDFS Host 1 Host 2

33 M/R Operation Primitives Filter records matching some condition Map record = f(record) Join Two/more datasets by key Group records with same key Aggregate f(records within the same group) Pipe job 1 s output => job 2 s input

34 M/R Operation Primitives on streams Filter records matching some condition Map record = f(record) Join Two/more datasets by key Group records with same key Aggregate f(records within the same group) Pipe job 1 s output => job 2 s input Requires state maintenance

35 Agenda Real-time Data Integration Introduction to Logs & Apache Kafka Logs & Stream processing Apache Samza Stateful stream processing

36 Example: Newsfeed User... posted "..." User 989 posted "Blah Blah" User 567 posted "Hello World" Status update log External connection DB Fan out messages to followers 567 -> [123, 679, 789,...] 999 -> [156, 343,... ] Push notification log Refresh user 123's newsfeed Refresh user 679's newsfeed Refresh user...'s newsfeed

37 Local state vs Remote state: Remote K msg/sec/node K msg/sec/node Samza task partition 0 Samza task partition 1 Performance Isolation Limited APIs Disk Remote state 1-5K queries/sec?? ex: Cassandra, MongoDB, etc

38 Local state: Bring data closer to computation Samza task partition 0 Samza task partition 1 Local Local LevelDB/RocksDB LevelDB/RocksDB

39 Local state: Bring data closer to computation Samza task partition 0 Samza task partition 1 Local Local LevelDB/RocksDB LevelDB/RocksDB Disk Change log stream

40 Example Revisited: Newsfeed User... posted "..." User 989 posted "Blah Blah" User 567 posted "Hello World" Status update log User... followed... User 123 followed 567 User 890 followed 234 New connection log Fan out messages to followers 567 -> [123, 679, 789,...] 999 -> [156, 343,... ] Push notification log Refresh user 123's newsfeed Refresh user 679's newsfeed Refresh user...'s newsfeed

41 Fault tolerance? Node manager Node manager Samza container 1 Samza container 2 Samza YARN AM Kafka Kafka Host 1 Host 2

42 Fault tolerance in Samza Samza task partition 0 Samza task partition 1 Local Local LevelDB/RocksDB LevelDB/RocksDB Durable change log

43 Slow jobs Log A Job 1 Log B Log C Drop data Backpressure Job 2 Queue In memory On disk (KAFKA) Log D Log E

44 Summary Real time data integration is crucial for the success and adoption of stream processing Logs form the basis for real time data integration Stream processing = f(logs) Samza is designed from ground-up for scalability and provides fault-tolerant, persistent state

45 Thank you! The Log Apache Kafka Apache Samza

I Logs. Apache Kafka, Stream Processing, and Real-time Data Jay Kreps

I Logs. Apache Kafka, Stream Processing, and Real-time Data Jay Kreps I Logs Apache Kafka, Stream Processing, and Real-time Data Jay Kreps The Plan 1. What is Data Integration? 2. What is Apache Kafka? 3. Logs and Distributed Systems 4. Logs and Data Integration 5. Logs