Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

Size: px

Start display at page:

Download "Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter"

Gabriel Page
7 years ago
Views:

1 Storm Distributed and fault-tolerant realtime computation Nathan Marz Twitter

2 Basic info Open sourced September 19th Implementation is 15,000 lines of code Used by over 25 companies >2400 watchers on Github (most watched JVM project) Very active mailing list >1800 messages >560 members

3 Before Storm Queues Workers

4 Example (simplified)

5 Example Workers schemify tweets and append to Hadoop

6 Example Workers update statistics on URLs by incrementing counters in Cassandra

7 Scaling Deploy Reconfigure/redeploy

8 Problems Scaling is painful Poor fault-tolerance Coding is tedious

9 What we want Guaranteed data processing Horizontal scalability Fault-tolerance No intermediate message brokers! Higher level abstraction than message passing Just works

10 Storm Guaranteed data processing Horizontal scalability Fault-tolerance No intermediate message brokers! Higher level abstraction than message passing Just works

11 Use cases Stream processing Distributed RPC Continuous computation

12 Storm Cluster

13 Storm Cluster Master node (similar to Hadoop JobTracker)

14 Storm Cluster Used for cluster coordination

15 Storm Cluster Run worker processes

16 Starting a topology

17 Killing a topology

18 Concepts Streams Spouts Bolts Topologies

19 Streams Tuple Tuple Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples

20 Spouts Source of streams

21 Spout examples Read from Kestrel queue Read from Twitter streaming API

22 Bolts Processes input streams and produces new streams

23 Bolts Functions Filters Aggregation Joins Talk to databases

24 Topology Network of spouts and bolts

25 Tasks Spouts and bolts execute as many tasks across the cluster

26 Task execution Tasks are spread across the cluster

27 Task execution Tasks are spread across the cluster

28 Stream grouping When a tuple is emitted, which task does it go to?

29 Stream grouping Shuffle grouping: pick a random task Fields grouping: mod hashing on a subset of tuple fields All grouping: send to all tasks Global grouping: pick task with lowest id

30 Topology shuffle [ id1, id2 ] shuffle [ url ] shuffle all

31 Streaming word count TopologyBuilder is used to construct topologies in Java

32 Streaming word count Define a spout in the topology with parallelism of 5 tasks

33 Streaming word count Split sentences into words with parallelism of 8 tasks

34 Streaming word count Consumer decides what data it receives and how it gets grouped Split sentences into words with parallelism of 8 tasks

35 Streaming word count Create a word count stream

36 Streaming word count splitsentence.py

37 Streaming word count

38 Streaming word count Submitting topology to a cluster

39 Streaming word count Running topology in local mode

40 Demo

41 Distributed RPC Data flow for Distributed RPC

42 DRPC Example Computing reach of a URL on the fly

43 Reach Reach is the number of unique people exposed to a URL on Twitter

44 Computing reach Tweeter Follower Follower Distinct follower URL Tweeter Follower Follower Distinct follower Count Reach Tweeter Follower Follower Distinct follower

45 Reach topology

46 Reach topology

47 Reach topology

48 Reach topology Keep set of followers for each request id in memory

49 Reach topology Update followers set when receive a new follower

50 Reach topology Emit partial count after receiving all followers for a request id

51 Demo

52 Guaranteeing message processing Tuple tree

53 Guaranteeing message processing A spout tuple is not fully processed until all tuples in the tree have been completed

54 Guaranteeing message processing If the tuple tree is not completed within a specified timeout, the spout tuple is replayed

55 Guaranteeing message processing Reliability API

56 Guaranteeing message processing Anchoring creates a new edge in the tuple tree

57 Guaranteeing message processing Marks a single node in the tree as complete

58 Guaranteeing message processing Storm tracks tuple trees for you in an extremely efficient way

59 Transactional topologies How do you do idempotent counting with an at least once delivery guarantee?

60 Transactional topologies Won t you overcount?

61 Transactional topologies Transactional topologies solve this problem

62 Transactional topologies Built completely on top of Storm s primitives of streams, spouts, and bolts

63 Transactional topologies Batch 1 Batch 2 Batch 3 Process small batches of tuples

64 Transactional topologies Batch 1 Batch 2 Batch 3 If a batch fails, replay the whole batch

65 Transactional topologies Batch 1 Batch 2 Batch 3 Once a batch is completed, commit the batch

66 Transactional topologies Batch 1 Batch 2 Batch 3 Bolts can optionally be committers

67 Transactional topologies Commit 1 Commit 1 Commit 2 Commit 3 Commit 4 Commit 4 Commits are ordered. If there s a failure during commit, the whole batch + commit is retried

68 Example

69 Example New instance of this object for every transaction attempt

70 Example Aggregate the count for this batch

71 Example Only update database if transaction ids differ

72 Example This enables idempotency since commits are ordered

73 Example (Credit goes to Kafka devs for this trick)

74 Transactional topologies Multiple batches can be processed in parallel, but commits are guaranteed to be ordered

75 Transactional topologies Will be available in next version of Storm (0.7.0) Requires a source queue that can replay identical batches of messages storm-kafka has a transactional spout implementation for Kafka

76 Storm UI

77 Storm on EC2 One-click deploy tool

78 Starter code Example topologies

79 Documentation

80 Ecosystem Scala, JRuby, and Clojure DSL s Kestrel, AMQP, JMS, and other spout adapters Serializers Multilang adapters Cassandra, MongoDB integration

81 Questions?

82 Future work State spout Storm on Mesos Swapping Auto-scaling Higher level abstractions

83 Implementation KafkaTransactionalSpout

84 Implementation all all all

85 Implementation all all TransactionalSpout is a subtopology consisting of a spout and a bolt all

86 Implementation all all The spout consists of one task that coordinates the transactions all

87 Implementation all all all The bolt emits the batches of tuples

88 Implementation all all The coordinator emits a batch stream and a commit stream all

89 Implementation all all all Batch stream

90 Implementation all all all Commit stream

91 Implementation all all Coordinator reuses tuple tree framework to detect success or failure of batches or commits and replays appropriately all

Openbus Documentation

Openbus Documentation Release 1 Produban February 17, 2014 Contents i ii An open source architecture able to process the massive amount of events that occur in a banking IT Infraestructure. Contents: