Delivering Intelligence to Publishers Through Big Data

Similar documents

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Parquet. Columnar storage for the people

CitusDB Architecture for Real-Time Big Data

COURSE CONTENT Big Data and Hadoop Training

Using distributed technologies to analyze Big Data

Implement Hadoop jobs to extract business value from large and varied data sets

xpaaerns on Spark, Shark, Tachyon and Mesos

Real-time Big Data Analytics with Storm

The Game of Big Data! Analytics Infrastructure at KIXEYE

Big Data Analytics Nokia

Yu Xu Pekka Kostamaa Like Gao. Presented By: Sushma Ajjampur Jagadeesh

Complete Java Classes Hadoop Syllabus Contact No:

Big Data Course Highlights

Trafodion Operational SQL-on-Hadoop

ITG Software Engineering

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Best Practices for Hadoop Data Analysis with Tableau

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Introducing Storm 1 Core Storm concepts Topology design

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Big Data Development CASSANDRA NoSQL Training - Workshop. March 13 to am to 5 pm HOTEL DUBAI GRAND DUBAI

Big Data? Definition # 1: Big Data Definition Forrester Research

Data processing goes big

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Oracle Big Data Spatial & Graph Social Network Analysis - Case Study

Creating a universe on Hive with Hortonworks HDP 2.0

HADOOP. Revised 10/19/2015

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Peers Techno log ies Pv t. L td. HADOOP

Unified Big Data Processing with Apache Spark. Matei

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Analytics on Spark &

Big Data Advanced Analytics for Game Monetization. Kimberly Chulis

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Using RDBMS, NoSQL or Hadoop?

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

ITG Software Engineering

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Where is Hadoop Going Next?

Hadoop Ecosystem B Y R A H I M A.

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

IBM Data Retrieval Technologies: RDBMS, BLU, IBM Netezza, and Hadoop

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

How To Handle Big Data With A Data Scientist

Apache Kylin Introduction Dec 8,

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Data Mining in the Swamp

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

The Hadoop Eco System Shanghai Data Science Meetup

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Move Data from Oracle to Hadoop and Gain New Business Insights

Tiber Solutions. Understanding the Current & Future Landscape of BI and Data Storage. Jim Hadley

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Cloudera Certified Developer for Apache Hadoop

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Hadoop Usage At Yahoo! Milind Bhandarkar

Hadoop: The Definitive Guide

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

the missing log collector Treasure Data, Inc. Muga Nishizawa

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

Next-Generation Cloud Analytics with Amazon Redshift

Transforming the Telecoms Business using Big Data and Analytics

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Professional Hadoop Solutions

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

How to Choose Between Hadoop, NoSQL and RDBMS

Chase Wu New Jersey Ins0tute of Technology

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Search and Real-Time Analytics on Big Data

Hadoop and Map-Reduce. Swati Gore

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

Monetizing Millions of Mobile Users with Cloud Business Analytics

Bringing Big Data to People

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Introduction To Hive

Testing Big data is one of the biggest

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

INTRODUCTION. The Challenges

Syndacast AdBoost. Product Description and Features. Find out how AdBoost can guide your business to higher ROI

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Analysis of Web Archives. Vinay Goel Senior Data Engineer

TRAINING PROGRAM ON BIGDATA/HADOOP

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

Comprehensive Analytics on the Hortonworks Data Platform

Cloud Computing at Google. Architecture

Integrating Apache Spark with an Enterprise Data Warehouse

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Big Data Can Drive the Business and IT to Evolve and Adapt

Transcription:

Delivering Intelligence to Publishers Through Big Data 2015-05- 21 Jonathan Sharley Team Lead, Data Operations www.sovrn.com

Who is Sovrn? Ø An advertising network with direct relationships to 20,000+ publisher websites Ø Advocate for medium to long tail publishers Ø A real- time bidding (RTB) exchange for programmatic ad buying

Case Study: BI for Publishers Sovrn works with online publishers of all shapes and sizes to help them better understand and engage their audiences and grow their businesses with effective site monetization.

Case Study: BI for Publishers 1. Basic quantitative metrics Requests, Impressions, Revenue, Unique counts sliced by Time, Zone, Site, Category, Geo 2. Ratios CPM, Fill 3. Comparative Metrics Category averages Viewer segmentation Trending up or down?

The Tech 1. Storm Near real- time aggregation 2. Hadoop Raw data, source of record 3. Cassandra Serving layer

The Tech

Storm

Storm: Implementation Data continuously processed in micro- batches; updates submitted to Cassandra. Measures are summed along various dimensions. Column families organized as materialized views to minimize computation on retrieval.

Storm: Implementation Data ingested using Storm Trident abstraction, supporting exactly- once delivery. Dynamic batch size depends on volume, backlog of streamed data. We look at spout consumer offsets in Zookeeper to know how far we are behind.

Storm: Implementation Throttling: Logs are collected for 10 seconds up to a maximum data size for each topic partition, to force a certain amount of reduction. Otherwise Storm may make many smaller updates and saturate Cassandra. A reduce on 100,000 vs 10,000 records produces only 20% more keys in the Cassandra update.

Storm: Implementation Problem A Storm batch may necessitate more than one batch update to Cassandra. Trident will rerun failed batches, but how do we ensure exactly- once semantics across multiple writes to Cassandra?

Storm: Implementation Batches are sorted and split into 2000 key sub- batches for optimized writes to Cassandra. Each sub- batch = 1 Cassandra update. Batch processing logic is deterministic: we can reprocess a batch and produce the exact same sub- batches.

Storm: Implementation We use the transactional spout for Trident In the case of a failure, the batch will have exactly the same messages. Storm processing would have gaps if the exact same messages weren t available. This method relies on Kadka s ability to reliably replay messages.

Storm: Implementation Trident transactions are aided by a key store > Before each sub- batch is written we check for a key of <Trident txid> + <sub-batch id>. If we dind the key, we skip our Cassandra sub- batch update. > After the Cassandra update, we mark the sub- batch done. These steps are accomplished through the begincommit() and commit() methods of the State interface.

Storm: BI Outputs Basic measures such as requests, impressions, revenue are summed along dimensions of time, site, zone, domestic/intl. A publisher s ad trafdic is counted and available in their dashboard within about 15 seconds.

mydomain.com

Hadoop

Hadoop: Implementation Raw log data from Kadka ingested through Camus into Avro schema- driven format Runs as a map/reduce job where mappers provide ingestion concurrency - - map per Kadka topic partition We modidied Camus to create metadata for Hive, available via HCatalog for Pig, etc.

Hadoop: Implementation Ka-a topic par44ons P1 P2 P3 P4 Hadoop mappers Camus M1 M2 M3 M4 Create partitions Hive Metastore Accessible in Hive & HCatalog Avro HDFS files dt=2015052110 dt=2015052111

Hadoop: Implementation Avro formatted data is normalized and written into ORC format. ORC benedits: Columnar with compression => Less I/O pressure, more retention Projection => Fetch just the column(s) you need ( We have about 100 columns of data on our ad activity) Predicate pushdown => Indexed row groups allow skipping of row sets not meeting select criteria.

Hadoop: Implementation Ad activity combined with user- supplied and 3 rd party data points to segment publisher trafdic sources Metric data pushed to Cassandra via Pig jobs, leveraging the Datastax client driver. Parallelism controlled as a crude method of rate limiting.

Hadoop: BI Outputs Segmented site audience info > Who visits and how often? > What other sites do they browse? > How much is each audience worth to advertisers? Produce counts and averages by site category Identify unique site viewers and assess their activity amongst sites in our network

Hadoop: BI Outputs Which brands are buying site inventory? Which ad tags are serving? Has any ad trafdic been denied?

Hadoop: Other Benefits Source of record: raw data supports analysts and drives algorithms that help better monetize unique viewers. Validate counts done through Storm. Backdill new metrics for publishers or repair BI metrics computed through Storm.

Cassandra

Cassandra: Implementation Measures are Cassandra counters: requests, impressions, revenue, etc. Counters are integer- only, so we shift the decimal point to handle decimal values such as revenue. Each metric written by Storm is stored at the minute, hour, day, and month levels.

Cassandra: Implementation As counters have no TTL, we must explicitly delete counters for retention purposes. We remove some minute- level metrics in detailed views after a week.

Cassandra: Implementation On Query: Time ranges are summed across multiple keys via the Sovrn API. We limit the number of key fetches using the largest possible time blocks within the date range. Fill rate and CPM are calculated on the dly. Revenue counter decimal point is shifted back to produce decimal value.

mydomain.com

Case Study: Lessons Validate metrics between batch & streaming paths. Monitoring is critical to catch discrepancies quickly. Build in a catch- up mechanism to recover from outages or processing incidents. Limit the number of Cassandra fetches when charting metrics to keep latency reasonable.

Thank You I m happy to take questions. email me at: jsharley@sovrn.com