Fast Data in the Era of Big Data: Twitter s Real-



Similar documents
Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture

Fast Data in the Era of Big Data: Twitter s Real-Time Related Query Suggestion Architecture

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Cleveland State University

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

L1: Introduction to Hadoop

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

A Performance Analysis of Distributed Indexing using Terrier

Hadoop Ecosystem B Y R A H I M A.

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Big Data and Apache Hadoop s MapReduce

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

BIG DATA What it is and how to use?

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

Distributed Computing and Big Data: Hadoop and MapReduce

Hadoop: Embracing future hardware

Implement Hadoop jobs to extract business value from large and varied data sets

Improve performance and availability of Banking Portal with HADOOP

How To Handle Big Data With A Data Scientist

Hadoop implementation of MapReduce computational model. Ján Vaňo

ITG Software Engineering

Big Data Course Highlights

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop Job Oriented Training Agenda

Large scale processing using Hadoop. Ján Vaňo

I Logs. Apache Kafka, Stream Processing, and Real-time Data Jay Kreps

Hadoop. Sunday, November 25, 12

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

I/O Considerations in Big Data Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop SNS. renren.com. Saturday, December 3, 11

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Data Mining in the Swamp

What do Big Data & HAVEn mean? Robert Lejnert HP Autonomy

Online Content Optimization Using Hadoop. Jyoti Ahuja Dec

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Manifest for Big Data Pig, Hive & Jaql

Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing

Workshop on Hadoop with Big Data

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

How To Scale Out Of A Nosql Database

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Big Data and Analytics: Challenges and Opportunities

Are You Ready for Big Data?

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Introduction to Big Data Training

Architectures for Big Data Analytics A database perspective

Hadoop Scalability at Facebook. Dmytro Molkov YaC, Moscow, September 19, 2011

Large-Scale Test Mining

How Companies are! Using Spark

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Apache Hadoop. Alexandru Costan

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Using In-Memory Computing to Simplify Big Data Analytics

Big Data Processing with Google s MapReduce. Alexandru Costan

Massive Cloud Auditing using Data Mining on Hadoop

Peers Techno log ies Pv t. L td. HADOOP

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Testing Big data is one of the biggest

ITG Software Engineering

BIG DATA - HADOOP PROFESSIONAL amron

the missing log collector Treasure Data, Inc. Muga Nishizawa

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Market Basket Analysis Algorithm on Map/Reduce in AWS EC2

Distributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Design and Evolution of the Apache Hadoop File System(HDFS)

Complete Java Classes Hadoop Syllabus Contact No:

COMP9321 Web Application Engineering

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Qsoft Inc

A Distributed Network Security Analysis System Based on Apache Hadoop-Related Technologies. Jeff Springer, Mehmet Gunes, George Bebis

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Hadoop and Map-Reduce. Swati Gore

Moving From Hadoop to Spark

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Introduction To Hive

Policy-based Pre-Processing in Hadoop

Transcription:

Fast Data in the Era of Big Data: Twitter s Real- Time Related Query Suggestion Architecture Gilad Mishne, Jeff Dalton, Zhenghua Li, Aneesh Sharma, Jimmy Lin Presented by: Rania Ibrahim 1

AGENDA Motivation & Background Contributions Real-Time Query Suggestion First Solution Second Solution Future work Conclusion Discussion 2

AGENDA Motivation & Background Contributions Real-Time Query Suggestion First Solution Second Solution Future work Conclusion Discussion 3

Motivation & Background Develop a real time query suggestion system The figures are taken from https://blog.twitter.com/2012/related-queries-and-spelling-corrections-search 4

Motivation & Background Develop a real time query suggestion system Example: When Marissa Mayer was in Google Marissa Mayer Query Yahoo Query 5

Motivation & Background Develop a real time query suggestion system Example: When Marissa Mayer was in Google Marissa Mayer Query Yahoo Query When Marissa Mayer scheduled for Yahoo CEO Marissa Mayer Query Yahoo Query 6

Motivation & Background Goal Provide Relevant Related Query Suggestions within 10 Minutes of Major Events 7

AGENDA Motivation & Background Contributions Real-Time Query Suggestion First Solution Second Solution Future work Conclusion Discussion 8

Contributions Introduce real time related query suggestion problem Explain two solutions: First Solution: using Hadoop Second Solution: using in memory processing engine Suggest future work to reduce the gap between big data and fast data 9

AGENDA Motivation & Background Contributions Real-Time Query Suggestion First Solution Second Solution Future work Conclusion Discussion 10

Real Time Query Suggestion Good related query suggestions provide: Topicality Temporality Topicality: capture same topic Temporality: capture temporal connection #SCOTUS suggestions: healthcare and #aca Marissa Mayer example 11

Real Time Query Suggestion Time constrain to include news breaks When is the best time to make suggestions? Too early: No enough evidences Too late: User experience 12

Real Time Query Suggestion Steve Jobs died: steve jobs becomes 15% stay foolish and apple Window size: 10 minutes The figure is taken from the paper Fast Data in the Era of Big Data: Twitter s Real-Time Related Query Suggestion Architecture 13

Real Time Query Suggestion Algorithm Query A and B are seen in same context A and B are related queries Context can be: User search session Tweet itself 14

Real Time Query Suggestion Algorithm User search session Submit Query #Oscars2015 15

Real Time Query Suggestion Algorithm User search session Submit Query #Oscars2015 Submit Query Interstellar Tweet: Terms in the tweet are related 16

Real Time Query Suggestion Algorithm A is before B in time B is interested to users who liked A A and B are similar and B has more results B is spelling correction of A Measures relatedness between query A and B Decays measurement with time 17

AGENDA Motivation & Background Contributions Real-Time Query Suggestion First Solution Second Solution Future work Conclusion Discussion 18

First Solution (Hadoop) First Why to use Hadoop?! 19

First Solution (Hadoop) Twitter has robust and production Hadoop cluster Twitter has incorporated components on top of Hadoop Pig, Hive, ZooKeeper and Vertica Use Oink pig flow manager The first version was developed in Pig and Java UDF 20

First Solution (Hadoop) Using pig script to: Aggregate user search session Compute term and co-occurrence statistics Rank related queries and spelling correction Frontend loads outputs and serves requests Unacceptable latency (several hours!) 21

First Solution (Hadoop) Two bottlenecks Log Import Hadoop The figure is taken from the paper Fast Data in the Era of Big Data: Twitter s Real-Time Related Query Suggestion Architecture 22

First Solution (Hadoop) Hadoop delay Resource contention Mapreduce jobs took 15-20 minutes Stragglers 23

First Solution (Hadoop) Hadoop delay Resource contention Mapreduce jobs took 15-20 minutes Stragglers Hadoop is not designed for latency sensitive jobs 24

AGENDA Motivation & Background Contributions Real-Time Query Suggestion First Solution Second Solution Future work Conclusion Discussion 25

Second Solution The figure is taken from the paper Fast Data in the Era of Big Data: Twitter s Real-Time Related Query Suggestion Architecture 26

Second Solution Every 5 minutes: Results are stored in HDFS Cold Restart: Read from HDFS Replication The figure is taken from the paper Fast Data in the Era of Big Data: Twitter s Real-Time Related Query Suggestion Architecture 27

Second Solution (In-Memory Stores) Session stores (sliding window): User session: Queries and co-occurrence queries Query statistics stores: Query statistics and decay weights Query co-occurrence statistics stores: Query pairs statistics Store query before\after in user session 28

Second Solution (Data Flow) When new query arrives (Query Path) Update query statistics Add query to sessions store For each previous query in user session & the new query Update query co-occurrence statistics store 29

Second Solution (Data Flow) When new tweet arrives (Tweet Path) Retrieve its n-grams Check if they occurred before as queries Repeat query Path for each query 30

Second Solution (Data Flow) Decay/Prune Cycles Decay all weights periodically Remove queries and co-occurrence queries <= threshold Remove users sessions with no recent activities 31

Second Solution (Data Flow) Ranking Cycles Periodic process to rank queries Uses queries statistics For each query: it generates suggestions 32

Second Solution (Scalability) CPU limitation One server needs to consume query hose and fire hose Turn out not a limitation Memory limitation Memory size vs. Coverage 33

Second Solution (Background Models) Previous model limited to temporal coverage Solution: Run background process over older data For spelling correction: Form pairwise edit distance between all queries Results are stored in HDFS Frontend cache combines real time & background results 34

AGENDA Motivation & Background Contributions Real-Time Query Suggestion First Solution Second Solution Future work Conclusion Discussion 35

Future Work Automatically perform pruning when memory is needed Single unified data platform to deal with real time and slower moving suggestions (fast data + big data) 36

AGENDA Motivation & Background Contributions Real-Time Query Suggestion First Solution Second Solution Future work Conclusion Discussion 37

Conclusion The paper proposed two solutions for real time related query suggestion The first solution was using Hadoop The second solution was using in memory approach 38

Thank you Any Questions 39

Discussion No experimental results? Memory size vs. coverage trade off, how to reduce the gap? A distributed in memory system? (challenges) How to decide automatically which data to prune? Would sampling help to solve log import bottleneck in first solution? How? How to use other information like click graph with in memory structures to enhance the ranking? 40