Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture



Similar documents
Fast Data in the Era of Big Data: Twitter s Real-

arxiv: v1 [cs.ir] 27 Oct 2012

Fast Data in the Era of Big Data: Twitter s Real-Time Related Query Suggestion Architecture

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

How To Scale Out Of A Nosql Database

Apache Hadoop: Past, Present, and Future

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Large scale processing using Hadoop. Ján Vaňo

Apache HBase. Crazy dances on the elephant back

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Real-time Big Data Analytics with Storm

Dominik Wagenknecht Accenture

Hadoop IST 734 SS CHUNG

Peers Techno log ies Pv t. L td. HADOOP

BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS

Big Data Course Highlights

BIG DATA TRENDS AND TECHNOLOGIES

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

the missing log collector Treasure Data, Inc. Muga Nishizawa

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop implementation of MapReduce computational model. Ján Vaňo

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Apache Hadoop: The Big Data Refinery

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

ZooKeeper. Table of contents

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

CSE-E5430 Scalable Cloud Computing Lecture 2

Getting Real Real Time Data Integration Patterns and Architectures

Unified Batch & Stream Processing Platform

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

CDH AND BUSINESS CONTINUITY:

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Bigtable is a proven design Underpins 100+ Google services:

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

BIG DATA What it is and how to use?

Operations and Big Data: Hadoop, Hive and Scribe. Zheng 铮 9 12/7/2011 Velocity China 2011

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Upcoming Announcements

Trafodion Operational SQL-on-Hadoop

How To Handle Big Data With A Data Scientist

Big Data Pipeline and Analytics Platform

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

Real Time Data Processing using Spark Streaming

Hadoop. Sunday, November 25, 12

<Insert Picture Here> Big Data

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Dell In-Memory Appliance for Cloudera Enterprise

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

The Hadoop Distributed File System

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

BIG DATA SOLUTION DATA SHEET

Hadoop Ecosystem B Y R A H I M A.

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Infrastructures for big data

Speak<geek> Tech Brief. RichRelevance Distributed Computing: creating a scalable, reliable infrastructure

Hadoop Scalability at Facebook. Dmytro Molkov YaC, Moscow, September 19, 2011

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Sentimental Analysis using Hadoop Phase 2: Week 2

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

HADOOP. Revised 10/19/2015

Analyzing Big Data at. Web 2.0 Expo, 2010 Kevin

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Real Time Big Data Processing

BigMemory and Hadoop: Powering the Real-time Intelligent Enterprise

Workshop on Hadoop with Big Data

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

NoSQL and Hadoop Technologies On Oracle Cloud

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Massive Cloud Auditing using Data Mining on Hadoop

Pepper: An Elastic Web Server Farm for Cloud based on Hadoop. Subramaniam Krishnan, Jean Christophe Counio Yahoo! Inc. MAPRED 1 st December 2010

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Chase Wu New Jersey Ins0tute of Technology

Big Data Management and Security

Transcription:

Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture Gilad Mishne, Jeff Dalton, Zhenghua Li, Aneesh Sharma, Jimmy Lin Adeniyi Abdul 2522715

Agenda Abstract Introduction First Solution Second Solution (Deployed Solution) Conclusion

Abstract This presentation is based on the architecture of Twitter s real time related query suggestion and spelling correction service. After significant breaking news events, Twitter aims to provide relevant results within minutes; typically ten minutes. Two systems were built to achieve this target but only one was eventually deployed. First implementation was based on a typical Hadoop-based analytics stack. Second implementation, which was eventually deployed is a custom in-memory processing engine.

Introduction Architecture behind Twitter's real-time related query suggestion and spelling correction service is collectively called "search assistance" @ Twitter Though a common concept, the Twitter context introduces a real-time "twist" :after significant breaking news events, twitter aims toprovide relevant search results within minutes. At twitter, search assistance needs to be provided in real time and must dynamically adapt to the rapidly evolving "global conversation". The architecture considers 3 aspects of data volume, velocity, & variety, and it addressed the challenges of real-time data processing in the era of "big data Twitter s search assistance architecture exposed a flaw in the typical usage of Hadoop as a "big data" platform. It is not well suited to lowlatency processing.

First Solution: Hadoop The first solution sought to take advantage of Twitter's existing analytics platform -Hadoop Incorporated into its' Hadoop platform are components such as Pig, Hbase, ZooKeeper, and Vertica. First version of search assistance was built using this platform. Hadoop Platform @ Twitter Data is written to the Hadoop Distributed File System (HDFS) viaa number of realtime and batch processes. Intead of directly writing Hadoop code in Java, analystics at Twitter is performed mostly using Pig. Pig is a high-level dataflow language that compiles into physical plans that are executed on Hadoop. Pig provides concise primitives for expressing operations such as projection, selection, group, join, etc

Hadoop Platform @ Twitter

Production Pig analytics jobs are coordinated by a work-flow manager called Oink. Oink schedules recurring jobs at fixed intervals (hourly or daily), and it handles dataflow dependencies between jobs. If job B requires data generated by job A, Oink will schedule A,verify that A has successfully complete, and then schedule job B. Oink also preserves execution traces for audit purposes: when a job began, how long it lasted and whether it was completed successfully. Oink schedules hundreds of Pig scripts on a daily basis, which translates into thousands of Hadoop jobs The first version of search assistance was written in Pig, with custom Java UDFs for computation that could not be expressed with Pig primitives.

A pig script that aggregates user search sessions, computes termand co-occurrence statistics and ranks related queries and spellings suggestions would run on Hadoop stack. Although the system worked reasonably in terms of output, however, latency was estimated in hours. This is a far cry from the targeted 10 minutes. The latency is primarily attributed to: Data import pipeline moving data from tens of thousands of production hosts onto HDFS. MapReduce jobs

Second Solution: In-memory processing engine The Hadoop based system was replaced due to it s inability to meet the latency requirements of the search assistance application. Overall Search Architecture

Overall search architecture consists of : Blender : The frontend part which brokers access to all service requests to Twitter s family of search services. Blender has a complete record of users search sessions -thus there is no need for client event Scribe logs. However, it does not have access to clickthrough data. EarlyBird is Twitter s inverted indexing engine. A fleet of these servers ingests tweets from the firehose -a streaming API providing access to all tweets as they are published-to update in-memory indexes. The in-memory processing engine focused on two sources of data tweet fire hose and the Blender query hose (tweets and search sessions ), augmentedby offline processing components.

Architecture of the Search Assistance Engine

Lightweight in-memory caches periodically read fresh results from HDFS; serve as frontend nodes (operated as Thrift service). Together, the frontend nodes form a single replicated, fault-tolerance service endpoint that can be arbitrarily scaled out to handle increased query load. Request routing to the replicas is handled by ServerSet a Twitter abstraction which provides client-side load-balanced access to replicated service coordinated by the ZooKeeper. Actual computations are performed on backend nodes. The backend processing is replicated for fault tolerance but not shared. Every 5 minutes, computed results are persisted to HDFS the instances perform leader election using ZooKeeper, and the winner proceeds to write its result. Every minute, the frontend caches poll a known HDFS location forupdated results, thus ensuring freshness of query suggestion and spelling corrections.

Each backend instance is a multi-threaded application that consists of three major components: The Stats Collector which reads the firehose and query hose. In-memory Stores which hold the most up-to-date statistics. Rankers which periodically execute one or more ranking algorithm by consulting the in-memory stores for the raw features. There are three separate in-memory stores to keep track of relevant statistics: The Session Store keeps track of user sessions observed in the query hose, and not for each, the history of the queries issues in a linked list. The Query Statistics Store retains up-to-date statistics about individual queries. The Query Coocurrence Statistics Store similar to Query Statistic Store, except that it holds data about pairs of cooccuring queries.

Rationale for the Search Assistance Engine Architecture The architecture adopted decoupled frontend and backend design due to different scalability requirements. The frontend caches need to scale out with increased query load,whereas the backend face no pressure. Persisting data to HDFS ensures that upon a cold restart, the frontend caches can serve the most recently written results immediately without waiting for the backend. Persisted result can also be analyzed retrospectively with Pig to understand how the service is being used. Since the data is on HDFS, it is easy to join them with log datafor behavior analysis and user data to slice-and-dice by demographic characteristics.

Conclusion Obviously, there is a growing recognition that volume, velocity,and variety require different models of computation and alternative processing platforms. An important future direction in data management is bridging thegap between platforms for big data.