Real Time Analytics for Big Data A Twitter Inspired Case Study NtiSh Nati Shalom @natishalom
Big Data Predictions Overthe next few years we'll see the adoption of scalable frameworks and platforms for handling streaming, or near real time, analysis and processing. In the same way that Hadoop has been borne out of large scale web applications, these platforms will be driven by the needs of largescale location aware mobile, social and sensor use. Edd Dumbill, O REILLY 2
The Two Vs of Big Data g Velocity 3 Copyright 2011 Gigaspaces Ltd. All Rights Reserved Volume
We re Living in a Real Time World Social User Tracking & Engagement Homeland Security ecommerce Financial i Services Real Time Search 4
The Flavors of Big Data Analytics Counting Correlating Research 5
Analytics @ Twitter Counting y @ g How many signups, How many signups, tweets, retweets for a topic? What ss the average What the average latency? Demographics Countries and cities Gender Age groups Device types 6 Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Analytics @ Twitter Correlating What devices fail at the same time? What features get user hooked? What places on the globe are happening? 7
Analytics @ Twitter y @ Research Sentiment analysis Obama is popular Trends People like to tweet after watching American Idol American Idol Spam patterns How can you tell when a user spams? 8 Copyright 2011 Gigaspaces Ltd. All Rights Reserved
It s All about Timing Real time (< few Seconds) Reasonably Quick (seconds minutes) Batch (hours/days) 9
It s All about Timing Event driven / stream processing High resolution every tweet gets counted Ad hoc querying Medium resolution (aggregations) This is what we re here to discuss Long running batch jobs (ETL, map/reduce) Low resolution (trends & patterns) 10
Challenge Word Count Tweets Count Word:Count Hottest topics URL mentions etc. 11
URL Mentions Here s One Use Case 12
Twitter in Numbers (March 2011) It takes a week for users to 1 billion send Tweets. Source: http://blog.twitter.com/2011/03/numbers.html 13
Twitter in Numbers (March 2011) On average, 140 million tweets get sent every day. Source: http://blog.twitter.com/2011/03/numbers.html 14
Twitter in Numbers (March 2011) The highest throughput to date is 6,939 tweets/sec. Source: http://blog.twitter.com/2011/03/numbers.html 15
Twitter in Numbers (March 2011) 460,000 new accounts are created daily. Source: http://blog.twitter.com/2011/03/numbers.html 16
Twitter in Numbers 5% of the users generate 75% of the content. Source: http://www.sysomos.com/insidetwitter/ 17
Analyze the Problem (Tens of) thousands of tweets per second to process Assumption: Need to process in near real time Aggregate counters for each word A few 10s of thousands of words (or hundreds of thousands if we include URLs) System needs to linearly scale System needs to be fault tolerant 18
Key Elements in Real Time Big Data Analytics 19
Sharding (Partitioning) Tokenizer1 Filterer 1 Tokenizer2 Filterer 2 Counter Updater 1 Counter Updater 2 Tokenizer 3 Filterer 3 Counter Updater 3 Tokenizer n Filterer n Counter Updater n
Keep Things In Memory Facebook keeps 80% of its data in Memory (Stanford research) RAMis 100 1000x faster RAM is 100 1000x faster than Disk (Random seek) Disk: 5 10ms RAM: ~0.001msec
Use EDA (Event Driven Architecture) Raw Tokenizer Tokenized Filterer Filtered Counter 22
Know Your Toolset 23
References Learn and fork the code on github: https://github.com/gigaspaces/rt analytics Detailed blog post http://bit.ly/gs bigdata analytics /gs bigdata anal tics Twitter in numbers: http://blog.twitter.com/2011/03/numbers.htmltwitter html Twitter Storm: http://bit.ly/twitter storm Apache S4 http://incubator.apache.org/s4/ 24
25