Online Content Optimization Using Hadoop. Jyoti Ahuja Dec 20 2011



Similar documents
How To Handle Big Data With A Data Scientist

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

How To Scale Out Of A Nosql Database

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop Ecosystem B Y R A H I M A.

Big data blue print for cloud architecture

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang

Workshop on Hadoop with Big Data

Hadoop implementation of MapReduce computational model. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo

NoSQL for SQL Professionals William McKnight

Internals of Hadoop Application Framework and Distributed File System

Big Data and Apache Hadoop s MapReduce

Oracle Big Data SQL Technical Update

So What s the Big Deal?

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Data processing goes big

Bringing Big Data into the Enterprise

A Brief Introduction to Apache Tez

GigaSpaces Real-Time Analytics for Big Data

Open source large scale distributed data management with Google s MapReduce and Bigtable

America s Most Wanted a metric to detect persistently faulty machines in Hadoop

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Implement Hadoop jobs to extract business value from large and varied data sets

How Companies are! Using Spark

The Internet of Things and Big Data: Intro

A Brief Outline on Bigdata Hadoop

Architectures for Big Data Analytics A database perspective

Azure Data Lake Analytics

Comprehensive Analytics on the Hortonworks Data Platform

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Native Connectivity to Big Data Sources in MSTR 10

Open source Google-style large scale data analysis with Hadoop

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Big Data Explained. An introduction to Big Data Science.

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Hadoop Big Data for Processing Data and Performing Workload

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Fast Data in the Era of Big Data: Twitter s Real-

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Customer Case Study. Sharethrough

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Hadoop & its Usage at Facebook

Scaling Up HBase, Hive, Pegasus

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Similarity Search in a Very Large Scale Using Hadoop and HBase

BIG DATA What it is and how to use?

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

SEO 360: The Essentials of Search Engine Optimization INTRODUCTION CONTENTS. By Chris Adams, Director of Online Marketing & Research

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

How To Use Big Data For Telco (For A Telco)

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

KNIME & Avira, or how I ve learned to love Big Data

Big Data Analytics Nokia

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

CSE-E5430 Scalable Cloud Computing Lecture 2

Ubuntu and Hadoop: the perfect match

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

PEPPERDATA OVERVIEW AND DIFFERENTIATORS

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

WHAT S NEW IN SAS 9.4

Hadoop Scalability at Facebook. Dmytro Molkov YaC, Moscow, September 19, 2011

ANALYTICS CENTER LEARNING PROGRAM

Big Data Course Highlights

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

ASAM ODS, Peak ODS Server and openmdm as a company-wide information hub for test and simulation data. Peak Solution GmbH, Nuremberg

Big Data on Microsoft Platform

Big Data? Definition # 1: Big Data Definition Forrester Research

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

CRITEO INTERNSHIP PROGRAM 2015/2016

Cloud Platforms, Challenges & Hadoop. Aditee Rele Karpagam Venkataraman Janani Ravi

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Advanced Big Data Analytics with R and Hadoop

Oracle Big Data Spatial & Graph Social Network Analysis - Case Study

ORACLE DATABASE 10G ENTERPRISE EDITION

Unified Batch & Stream Processing Platform

Cisco Data Preparation

Analytics on Spark &

Virtualizing Apache Hadoop. June, 2012

L1: Introduction to Hadoop

The Future of Data Management

White Paper. How to Achieve Best-in-Class Performance Monitoring for Distributed Java Applications

Using distributed technologies to analyze Big Data

Transcription:

Online Content Optimization Using Hadoop Jyoti Ahuja Dec 20 2011

What do we do? Deliver right CONTENT to the right USER at the right TIME o Effectively and pro-actively learn from user interactions with content that are displayed to maximize our objectives A new scientific discipline at the interface of o Large scale Machine Learning and Statistics o Multi-objective optimization in the presence of uncertainty o User understanding o Content understanding

Relevance at Yahoo! People 10s of Items Important Editors Popular Personal / Social Science Millions of Items

Ranking Problems Most Popular Most engaging overall based on objective metrics Most Popular + Per User History Engaging overall, and aware of what I ve already seen Light Personalization More relevant to me based on my age, gender and property usage Deep Personalization Most relevant to me based on my deep interests X Y Related Items Behavioral Affinity: People who did X, did Y Real-time Dashboard Business Optimization

Recommendation: A Match-making Problem Recommendation problems Search: Web, Vertical Online advertising Item Inventory Articles, web page, ads, Use an automated algorithm to select item(s) to show Opportunity Users, queries, pages, Get feedback (click, time spent,..) Refine the models Repeat (large number of times) Measure metric(s) of interest (Total clicks, Total revenue, )

Recommendation - match making problem Content optimization example 1 o I have an important module on my page, content inventory is obtained from a third part source which is further refined through editorial oversight. Can I algorithmically recommend content on this module? I want to drive up total CTR on this module Content optimization example 2 o I got X% lift in CTR. But I have additional information on other downstream utilities (e.g. dwell time). Can I increase downstream utility without losing too many clicks 6

Problem Characteristics : Today module Traffic obtained from a controlled randomized experiment Things to note: a) Short lifetimes b) temporal effects c) often breaking news story

Bird s eye view 8

Flow Content feed with biz rules Optimization Engine Rules Engine Content Metadata Real-time Feedback Explore ~1% Exploit ~99% Real-time Insights Dashboard Optimized Module

Technology Stack Analytics and Debugging 10

Hadoop Framework for running applications on large clusters built of commodity hardware Lets one easily write and run applications that process vast amounts of data (petabytes) Distributed File System o Modeled on GFS Distributed Processing Framework o Using map-reduce metaphor Scheduler/Resource Management Open source Written in java with client apps in various languages

Hbase Supports random reads and writes HBase is a storage system that is o Distributed o Column oriented o Multi-dimensional o High availability o High-performance You must be OK with RDBMS anti-schema o Denormalized data o Wide and sparsely populated tables

Hive SQL like query engine o Enables ad-hoc querying, summarization, analysis of large volumes of data o Allows MR programmers to plugin their custom logic Hive is not o Comparable to real time processing systems like Oracle o Tend to have latency in mins o Not designed for online transaction processing Extensively used at o Facebook, Yahoo!, Digg, CNET, Last.fm, Rocket Fuel etc

Grid Edge Services Keeps MR jobs lean and mean Provides ability to control non-gridifyable solutions to be deployed easily Have different scaling characteristics (E.g. Memory, CPU) Provide gateway for accessing external data sources in M/R Map and/or Reduce step interact with Edge Services using standard client Examples o Categorization o Geo Tagging o Feature Transformation 14

How it happens? Additional Content & User Feature Generation Feature Generation Item BASE M F ATTR CAT_Sports id 1 0.8 +1.2-1.5-0.9 1.0 id 2-0.9-0.9 +2.6 +0.3 1.0 User Events At time t User u (user attr: age, gen, loc) interacted with Content id at Position o Property/Site p Section - s Module m International - i Content id Has associated metadata meta meta = {entity, keyword, geo, topic, category} Modeling Item Metadata 5 min latency ITEM Model Request Ranking B-Rules SLA 50 ms 200 ms 5 30 min latency USER Model STORE: PNUTS Item BASE M F ATTR CAT_Sports u 1 0.8 1 1 0.2 u 2-0.9 1-1.2

Models USER x CONTENT FEATURES USER MODEL : Tracks User interest in terms of Content Features ITEM x USER FEATURES ITEM MODEL : Tracks behavior of Item across user features USER FEATURES x CONTENT FEATURES USER x USER PRIORS : Tracks interactions of user features with content features ITEM x ITEM CLUSTERING : Looks at User-User Affinity based on the feature vectors CLUSTERING : Looks at Item-Item Affinity based on item feature vectors 16

Scale Million events per second Hundreds of GB per run Million of stories in pool Tens of Thousands of Features (Content and/or User) 17

Modeling Framework Global state provided by HBase A collection of PIG UDFs Flow for modeling or stages assembled in PIG o OLR o Clustering o Affinity o Regression Models Configuration based behavioral changes for stages of modeling o Type of Features to generated o Type of joins to perform User / Item / Feature Input : DFS and/or HBase Output: DFS and/or Hbase 18

HBase ITEM Table o Stores item related features o Stores ITEM x USER FEATURES model o Stores parameters about item like view count, click count, unique user count. o 10 of Millions of Items o Updated every 5 minutes USER Model o Store USER x CONTENT FEATURES model for each individual user by Unique ID o Stores summarized user history Essential for Modeling in terms of item decay o Millions of profiles o Updated every 5 to 30 minutes TERM Model o Inverts the Item Table and stores statistics for the terms. o Used to find the trending features and provide baselines for user features o Millions of terms and hundreds of parameters tracked o Updates every 5 minutes 19

Analytics and Debugging Provides ability to debug modeling issues near-real time Run complex queries for analysis Easy to use interface PM, Engineers, Research use this cluster to get near-real time insights 100s of Modeling monitoring and Reporting queries every 5 minute Output fed to (near) real time dashboard We use HIVE 20

Learnings PIG & HBase has been best combination so far o Made it simple to build different kind of science models o Point lookup using HBase has proven to be very useful o Modeling = Matrices HBase provides a natural way to represent and access them Edge Services o Have provided simplicity to whole stack o Management (Upgrades, Outage) has been easy HIVE has provided us a great way for analyzing the results o PIG was also considered 21

Q & A

Questions?