Real Time Analytics for Big Data. NtiSh Nati Shalom @natishalom



Similar documents
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

How To Use Big Data For Telco (For A Telco)

GigaSpaces Real-Time Analytics for Big Data

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

How To Scale Out Of A Nosql Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

CSE-E5430 Scalable Cloud Computing Lecture 2

Putting Apache Kafka to Use!

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Hadoop IST 734 SS CHUNG

Application Development. A Paradigm Shift

Reference Architecture, Requirements, Gaps, Roles

Hadoop implementation of MapReduce computational model. Ján Vaňo

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

BIG DATA ANALYTICS For REAL TIME SYSTEM

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

How To Handle Big Data With A Data Scientist

Real Time Data Processing using Spark Streaming

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Large scale processing using Hadoop. Ján Vaňo

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Understanding Neo4j Scalability

HadoopRDF : A Scalable RDF Data Analysis System

ANALYTICS BUILT FOR INTERNET OF THINGS

Apache HBase. Crazy dances on the elephant back

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Unified Big Data Analytics Pipeline. 连 城

Big Data Processing with Google s MapReduce. Alexandru Costan

Real Time Big Data Processing

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Search and Real-Time Analytics on Big Data

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Chapter 7. Using Hadoop Cluster and MapReduce

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Talend Big Data. Delivering instant value from all your data. Talend

The 4 Pillars of Technosoft s Big Data Practice

Open source Google-style large scale data analysis with Hadoop

An Oracle White Paper October Oracle: Big Data for the Enterprise

Advanced Big Data Analytics with R and Hadoop

Real-time Big Data Analytics with Storm

Big Data Analysis: Apache Storm Perspective

Practical Considerations for Real-Time Business Intelligence. Donovan Schneider Yahoo! September 11, 2006

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Big Data Systems CS 5965/6965 FALL 2014

MapReduce and Hadoop Distributed File System V I J A Y R A O

Open source large scale distributed data management with Google s MapReduce and Bigtable

Architectures for Big Data Analytics A database perspective

How To Create A Data Visualization With Apache Spark And Zeppelin

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Introduction to Hadoop

An Oracle White Paper June Oracle: Big Data for the Enterprise

Big Systems, Big Data

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Internals of Hadoop Application Framework and Distributed File System

TIBCO Live Datamart: Push-Based Real-Time Analytics

Big Data Use Cases Update

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

In-Memory Analytics for Big Data

NoSQL and Hadoop Technologies On Oracle Cloud

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

NextGen Infrastructure for Big DATA Analytics.

Big Data Analytics. Lucas Rego Drumond

Real World Hadoop Use Cases

Scaling Out With Apache Spark. DTL Meeting Slides based on

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Are You Ready for Big Data?

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Analyzing Big Data with AWS

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

The big data revolution

Big Data and Industrial Internet

Big Data and Open Data

[Hadoop, Storm and Couchbase: Faster Big Data]

Big Data Analytics - Accelerated. stream-horizon.com

Transcription:

Real Time Analytics for Big Data A Twitter Inspired Case Study NtiSh Nati Shalom @natishalom

Big Data Predictions Overthe next few years we'll see the adoption of scalable frameworks and platforms for handling streaming, or near real time, analysis and processing. In the same way that Hadoop has been borne out of large scale web applications, these platforms will be driven by the needs of largescale location aware mobile, social and sensor use. Edd Dumbill, O REILLY 2

The Two Vs of Big Data g Velocity 3 Copyright 2011 Gigaspaces Ltd. All Rights Reserved Volume

We re Living in a Real Time World Social User Tracking & Engagement Homeland Security ecommerce Financial i Services Real Time Search 4

The Flavors of Big Data Analytics Counting Correlating Research 5

Analytics @ Twitter Counting y @ g How many signups, How many signups, tweets, retweets for a topic? What ss the average What the average latency? Demographics Countries and cities Gender Age groups Device types 6 Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Analytics @ Twitter Correlating What devices fail at the same time? What features get user hooked? What places on the globe are happening? 7

Analytics @ Twitter y @ Research Sentiment analysis Obama is popular Trends People like to tweet after watching American Idol American Idol Spam patterns How can you tell when a user spams? 8 Copyright 2011 Gigaspaces Ltd. All Rights Reserved

It s All about Timing Real time (< few Seconds) Reasonably Quick (seconds minutes) Batch (hours/days) 9

It s All about Timing Event driven / stream processing High resolution every tweet gets counted Ad hoc querying Medium resolution (aggregations) This is what we re here to discuss Long running batch jobs (ETL, map/reduce) Low resolution (trends & patterns) 10

Challenge Word Count Tweets Count Word:Count Hottest topics URL mentions etc. 11

URL Mentions Here s One Use Case 12

Twitter in Numbers (March 2011) It takes a week for users to 1 billion send Tweets. Source: http://blog.twitter.com/2011/03/numbers.html 13

Twitter in Numbers (March 2011) On average, 140 million tweets get sent every day. Source: http://blog.twitter.com/2011/03/numbers.html 14

Twitter in Numbers (March 2011) The highest throughput to date is 6,939 tweets/sec. Source: http://blog.twitter.com/2011/03/numbers.html 15

Twitter in Numbers (March 2011) 460,000 new accounts are created daily. Source: http://blog.twitter.com/2011/03/numbers.html 16

Twitter in Numbers 5% of the users generate 75% of the content. Source: http://www.sysomos.com/insidetwitter/ 17

Analyze the Problem (Tens of) thousands of tweets per second to process Assumption: Need to process in near real time Aggregate counters for each word A few 10s of thousands of words (or hundreds of thousands if we include URLs) System needs to linearly scale System needs to be fault tolerant 18

Key Elements in Real Time Big Data Analytics 19

Sharding (Partitioning) Tokenizer1 Filterer 1 Tokenizer2 Filterer 2 Counter Updater 1 Counter Updater 2 Tokenizer 3 Filterer 3 Counter Updater 3 Tokenizer n Filterer n Counter Updater n

Keep Things In Memory Facebook keeps 80% of its data in Memory (Stanford research) RAMis 100 1000x faster RAM is 100 1000x faster than Disk (Random seek) Disk: 5 10ms RAM: ~0.001msec

Use EDA (Event Driven Architecture) Raw Tokenizer Tokenized Filterer Filtered Counter 22

Know Your Toolset 23

References Learn and fork the code on github: https://github.com/gigaspaces/rt analytics Detailed blog post http://bit.ly/gs bigdata analytics /gs bigdata anal tics Twitter in numbers: http://blog.twitter.com/2011/03/numbers.htmltwitter html Twitter Storm: http://bit.ly/twitter storm Apache S4 http://incubator.apache.org/s4/ 24

25