Rainbird: Real-time Analytics @Twitter



Similar documents
Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Fast Data in the Era of Big Data: Twitter s Real-

Introduction to Hbase Gkavresis Giorgos 1470

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

How To Scale Out Of A Nosql Database

Evaluation of NoSQL databases for large-scale decentralized microblogging

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture

Building a logging pipeline with Open Source tools. Iñigo Ortiz de Urbina Cazenave

How To Use Big Data For Telco (For A Telco)

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Introduction to Apache Cassandra

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Apache Cassandra for Big Data Applications

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Unlocking the Intelligence in. Big Data. Ron Kasabian General Manager Big Data Solutions Intel Corporation

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Moving From Hadoop to Spark

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

Data-cubing made-simple! with Spark, Algebird and HBase goo.gl/dbgr0h

Application Development. A Paradigm Shift

Big Data? Definition # 1: Big Data Definition Forrester Research

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Cassandra A Decentralized, Structured Storage System

Data Pipeline with Kafka

INTRODUCTION TO CASSANDRA

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Big Data and Scripting Systems build on top of Hadoop

Integration of Apache Hive and HBase

How Companies are! Using Spark

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Are You Ready for Big Data?

So What s the Big Deal?

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Can the Elephants Handle the NoSQL Onslaught?

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Using Cloud Services for Test Environments A case study of the use of Amazon EC2

Time series IoT data ingestion into Cassandra using Kaa

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Are You Ready for Big Data?

Infrastructures for big data

Big Data Infrastructure at Spotify

CA Virtual Assurance/ Systems Performance for IM r12 DACHSUG 2011

FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Real-time Big Data Analytics with Storm

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

NoSQL and Hadoop Technologies On Oracle Cloud

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Apache Hadoop: Past, Present, and Future

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

ZooKeeper. Table of contents

Object Level Authentication

Big Data Web Analytics Platform on AWS for Yottaa

Real-Time Big Data in practice with Cassandra. Michaël

the missing log collector Treasure Data, Inc. Muga Nishizawa

Big Data Analytics Nokia

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Streaming items through a cluster with Spark Streaming

An Open Source NoSQL solution for Internet Access Logs Analysis

Big Data Executive Survey

Large-Scale Test Mining

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

XpoLog Center Suite Log Management & Analysis platform

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Certified Big Data and Apache Hadoop Developer VS-1221

COMP9321 Web Application Engineering

these three NoSQL databases because I wanted to see a the two different sides of the CAP

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Hypertable Goes Realtime at Baidu. Yang Dong Sherlock Yang(

Kafka & Redis for Big Data Solutions

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Xiaoming Gao Hui Li Thilina Gunarathne

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

.NET User Group Bern

Case Study: Real-time Analytics With Druid. Salil Kalia, Tech Lead, TO THE NEW Digital

Dominik Wagenknecht Accenture

How To Handle Big Data With A Data Scientist

Transforming the Telecoms Business using Big Data and Analytics

Putting Apache Kafka to Use!

Hadoop: Embracing future hardware

Transcription:

Rainbird: Real-time Analytics @Twitter Kevin Weil -- @kevinweil Product Lead for Revenue, Twitter TM

Agenda Why Real-time Analytics? Rainbird and Cassandra Production Uses at Twitter Open Source

My Background Mathematics and Physics at Harvard, Physics at Stanford Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data Cooliris (web media): Hadoop and Pig for analytics, TBs of data Twitter: Hadoop, Pig, HBase, Cassandra, data viz, social graph analysis, soon to be PBs of data

My Background Mathematics and Physics at Harvard, Physics at Stanford Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data Cooliris (web media): Hadoop and Pig for analytics, TBs of data Twitter: Hadoop, Pig, HBase, Cassandra, data viz, social graph analysis, soon to be PBs of data Now revenue products!

Agenda Why Real-time Analytics? Rainbird and Cassandra Production Uses at Twitter Open Source

Why Real-time Analytics Twitter is real-time

Why Real-time Analytics Twitter is real-time... even in space

And My Personal Favorite

And My Personal Favorite

Real-time Reporting Discussion around ad-based revenue model Help shape the conversation in real-time with Promoted Tweets

Real-time Reporting Discussion around ad-based revenue model Help shape the conversation in real-time with Promoted Tweets Realtime reporting ties it all together

Agenda Why Real-time Analytics? Rainbird and Cassandra Production Uses at Twitter Open Source

Requirements Extremely high write volume Needs to scale to 100,000s of WPS

Requirements Extremely high write volume Needs to scale to 100,000s of WPS High read volume Needs to scale to 10,000s of RPS

Requirements Extremely high write volume Needs to scale to 100,000s of WPS High read volume Needs to scale to 10,000s of RPS Horizontally scalable (reads, storage, etc) Needs to scale to 100+ TB

Requirements Extremely high write volume Needs to scale to 100,000s of WPS High read volume Needs to scale to 10,000s of RPS Horizontally scalable (reads, storage, etc) Needs to scale to 100+ TB Low latency Most reads <100 ms (esp. recent data)

Cassandra Pro: In-house expertise Pro: Open source Apache project Pro: Writes are extremely fast Pro: Horizontally scalable, low latency Pro: Other startup adoption (Digg, SimpleGeo)

Cassandra Pro: In-house expertise Pro: Open source Apache project Pro: Writes are extremely fast Pro: Horizontally scalable, low latency Pro: Other startup adoption (Digg, SimpleGeo) Con: It was really young (0.3a)

Cassandra Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra

Cassandra Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra Say hi to @kelvin

Cassandra Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra Say hi to @kelvin And @lenn0x

Cassandra Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra Say hi to @kelvin And @lenn0x A dude from Sweden began helping: @skr

Cassandra Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra Say hi to @kelvin And @lenn0x A dude from Sweden began helping: @skr Now all at Twitter :)

Rainbird It counts things. Really quickly. Layers on top of the distributed counters patch, CASSANDRA-1072

Rainbird It counts things. Really quickly. Layers on top of the distributed counters patch, CASSANDRA-1072 Relies on Zookeeper, Cassandra, Scribe, Thrift Written in Scala

Rainbird Design Aggregators buffer for 1m Intelligent flush to Cassandra Query servers read once written 1m is configurable

Rainbird Data Structures struct Event { 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<property> properties, 6: optional map<property, i64> propertieswithcounts }

Rainbird Data Structures struct Event { 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, Unix timestamp of event 5: optional set<property> properties, 6: optional map<property, i64> propertieswithcounts }

Rainbird Data Structures struct Event { 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, Stat category name 5: optional set<property> properties, 6: optional map<property, i64> propertieswithcounts }

Rainbird Data Structures struct Event { 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, Stat keys (hierarchical) 5: optional set<property> properties, 6: optional map<property, i64> propertieswithcounts }

Rainbird Data Structures struct Event { 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, Actual count (diff) 5: optional set<property> properties, 6: optional map<property, i64> propertieswithcounts }

Rainbird Data Structures struct Event { 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, More later 5: optional set<property> properties, 6: optional map<property, i64> propertieswithcounts }

Hierarchical Aggregation Say we re counting Promoted Tweet impressions category = pti keys = [advertiser_id, campaign_id, tweet_id] count = 1 Rainbird automatically increments the count for [advertiser_id, campaign_id, tweet_id] [advertiser_id, campaign_id] [advertiser_id] Means fast queries over each level of hierarchy Configurable in rainbird.conf, or dynamically via ZK

Hierarchical Aggregation Another example: tracking URL shortener tweets/clicks full URL = http://music.amazon.com/some_really_long_path keys = [com, amazon, music, full URL] count = 1 Rainbird automatically increments the count for [com, amazon, music, full URL] [com, amazon, music] [com, amazon] [com] Means we can count clicks on full URLs And automatically aggregate over domains and subdomains!

Hierarchical Aggregation Another example: tracking URL shortener tweets/clicks full URL = http://music.amazon.com/some_really_long_path keys = [com, amazon, music, full URL] count = 1 Rainbird automatically increments the count for [com, amazon, music, full URL] [com, amazon, music] [com, amazon] [com] How many people tweeted full URL? Means we can count clicks on full URLs And automatically aggregate over domains and subdomains!

Hierarchical Aggregation Another example: tracking URL shortener tweets/clicks full URL = http://music.amazon.com/some_really_long_path keys = [com, amazon, music, full URL] count = 1 Rainbird automatically increments the count for [com, amazon, music, full URL] [com, amazon, music] [com, amazon] [com] How many people tweeted any music.amazon.com URL? Means we can count clicks on full URLs And automatically aggregate over domains and subdomains!

Hierarchical Aggregation Another example: tracking URL shortener tweets/clicks full URL = http://music.amazon.com/some_really_long_path keys = [com, amazon, music, full URL] count = 1 Rainbird automatically increments the count for [com, amazon, music, full URL] [com, amazon, music] [com, amazon] [com] How many people tweeted any amazon.com URL? Means we can count clicks on full URLs And automatically aggregate over domains and subdomains!

Hierarchical Aggregation Another example: tracking URL shortener tweets/clicks full URL = http://music.amazon.com/some_really_long_path keys = [com, amazon, music, full URL] count = 1 Rainbird automatically increments the count for [com, amazon, music, full URL] [com, amazon, music] [com, amazon] [com] How many people tweeted any.com URL? Means we can count clicks on full URLs And automatically aggregate over domains and subdomains!

Temporal Aggregation Rainbird also does (configurable) temporal aggregation Each count is kept minutely, but also denormalized hourly, daily, and all time Gives us quick counts at varying granularities with no large scans at read time Trading storage for latency

Multiple Formulas So far we have talked about sums Could also store counts (1 for each event)... which gives us a mean And sums of squares (count * count for each event)... which gives us a standard deviation And min/max as well Configure this per-category in rainbird.conf

Rainbird Write 100,000s of events per second, each with hierarchical structure Query with minutely granularity over any level of the hierarchy, get back a time series Or query all time values Or query all time means, standard deviations Latency < 100ms

Agenda Why Real-time Analytics? Rainbird and Cassandra Production Uses at Twitter Open Source

Production Uses It turns out we need to count things all the time As soon as we had this service, we started finding all sorts of use cases for it Promoted Products Tweeted URLs, by domain/subdomain Per-user Tweet interactions (fav, RT, follow) Arbitrary terms in Tweets Clicks on t.co URLs

Use Cases Promoted Tweet Analytics

Production Uses Each different metric is part of the key hierarchy Promoted Tweet Analytics

Production Uses Uses the temporal aggregation to quickly show different levels of granularity Promoted Tweet Analytics

Production Uses Data can be historical, or from 60 seconds ago Promoted Tweet Analytics

Production Uses Internal Monitoring and Alerting We require operational reporting on all internal services Needs to be real-time, but also want longer-term aggregates Hierarchical, too: [stat, datacenter, service, machine]

Production Uses Tweet Button Counts Tweet Button counts are requested many many times each day from across the web Uses the all time field

Agenda Why Real-time Analytics? Rainbird and Cassandra Production Uses at Twitter Open Source

Open Source? Yes!

Open Source? Yes!... but not yet

Open Source? Yes!... but not yet Relies on unreleased version of Cassandra

Open Source? Yes!... but not yet Relies on unreleased version of Cassandra... but the counters patch is committed in trunk (0.8)

Open Source? Yes!... but not yet Relies on unreleased version of Cassandra... but the counters patch is committed in trunk (0.8)... also relies on some internal frameworks we need to open source

Open Source? Yes!... but not yet Relies on unreleased version of Cassandra... but the counters patch is committed in trunk (0.8)... also relies on some internal frameworks we need to open source It will happen

Open Source? Yes!... but not yet Relies on unreleased version of Cassandra... but the counters patch is committed in trunk (0.8)... also relies on some internal frameworks we need to open source It will happen See http://github.com/twitter for proof of how much Twitter open source

Team John Corwin (@johnxorz) Adam Samet (@damnitsamet) Johan Oskarsson (@skr) Kelvin Kakugawa (@kelvin) Chris Goffinet (@lenn0x) Steve Jiang (@sjiang) Kevin Weil (@kevinweil)

If You Only Remember One Slide... Rainbird is a distributed, high-volume counting service built on top of Cassandra Write 100,000s events per second, query it with hierarchy and multiple time granularities, returns results in <100 ms Used by Twitter for multiple products internally, including our Promoted Products, operational monitoring and Tweet Button Will be open sourced so the community can use and improve it!

Questions? Follow me: @kevinweil TM