Online Content Optimization Using Hadoop Jyoti Ahuja Dec 20 2011
What do we do? Deliver right CONTENT to the right USER at the right TIME o Effectively and pro-actively learn from user interactions with content that are displayed to maximize our objectives A new scientific discipline at the interface of o Large scale Machine Learning and Statistics o Multi-objective optimization in the presence of uncertainty o User understanding o Content understanding
Relevance at Yahoo! People 10s of Items Important Editors Popular Personal / Social Science Millions of Items
Ranking Problems Most Popular Most engaging overall based on objective metrics Most Popular + Per User History Engaging overall, and aware of what I ve already seen Light Personalization More relevant to me based on my age, gender and property usage Deep Personalization Most relevant to me based on my deep interests X Y Related Items Behavioral Affinity: People who did X, did Y Real-time Dashboard Business Optimization
Recommendation: A Match-making Problem Recommendation problems Search: Web, Vertical Online advertising Item Inventory Articles, web page, ads, Use an automated algorithm to select item(s) to show Opportunity Users, queries, pages, Get feedback (click, time spent,..) Refine the models Repeat (large number of times) Measure metric(s) of interest (Total clicks, Total revenue, )
Recommendation - match making problem Content optimization example 1 o I have an important module on my page, content inventory is obtained from a third part source which is further refined through editorial oversight. Can I algorithmically recommend content on this module? I want to drive up total CTR on this module Content optimization example 2 o I got X% lift in CTR. But I have additional information on other downstream utilities (e.g. dwell time). Can I increase downstream utility without losing too many clicks 6
Problem Characteristics : Today module Traffic obtained from a controlled randomized experiment Things to note: a) Short lifetimes b) temporal effects c) often breaking news story
Bird s eye view 8
Flow Content feed with biz rules Optimization Engine Rules Engine Content Metadata Real-time Feedback Explore ~1% Exploit ~99% Real-time Insights Dashboard Optimized Module
Technology Stack Analytics and Debugging 10
Hadoop Framework for running applications on large clusters built of commodity hardware Lets one easily write and run applications that process vast amounts of data (petabytes) Distributed File System o Modeled on GFS Distributed Processing Framework o Using map-reduce metaphor Scheduler/Resource Management Open source Written in java with client apps in various languages
Hbase Supports random reads and writes HBase is a storage system that is o Distributed o Column oriented o Multi-dimensional o High availability o High-performance You must be OK with RDBMS anti-schema o Denormalized data o Wide and sparsely populated tables
Hive SQL like query engine o Enables ad-hoc querying, summarization, analysis of large volumes of data o Allows MR programmers to plugin their custom logic Hive is not o Comparable to real time processing systems like Oracle o Tend to have latency in mins o Not designed for online transaction processing Extensively used at o Facebook, Yahoo!, Digg, CNET, Last.fm, Rocket Fuel etc
Grid Edge Services Keeps MR jobs lean and mean Provides ability to control non-gridifyable solutions to be deployed easily Have different scaling characteristics (E.g. Memory, CPU) Provide gateway for accessing external data sources in M/R Map and/or Reduce step interact with Edge Services using standard client Examples o Categorization o Geo Tagging o Feature Transformation 14
How it happens? Additional Content & User Feature Generation Feature Generation Item BASE M F ATTR CAT_Sports id 1 0.8 +1.2-1.5-0.9 1.0 id 2-0.9-0.9 +2.6 +0.3 1.0 User Events At time t User u (user attr: age, gen, loc) interacted with Content id at Position o Property/Site p Section - s Module m International - i Content id Has associated metadata meta meta = {entity, keyword, geo, topic, category} Modeling Item Metadata 5 min latency ITEM Model Request Ranking B-Rules SLA 50 ms 200 ms 5 30 min latency USER Model STORE: PNUTS Item BASE M F ATTR CAT_Sports u 1 0.8 1 1 0.2 u 2-0.9 1-1.2
Models USER x CONTENT FEATURES USER MODEL : Tracks User interest in terms of Content Features ITEM x USER FEATURES ITEM MODEL : Tracks behavior of Item across user features USER FEATURES x CONTENT FEATURES USER x USER PRIORS : Tracks interactions of user features with content features ITEM x ITEM CLUSTERING : Looks at User-User Affinity based on the feature vectors CLUSTERING : Looks at Item-Item Affinity based on item feature vectors 16
Scale Million events per second Hundreds of GB per run Million of stories in pool Tens of Thousands of Features (Content and/or User) 17
Modeling Framework Global state provided by HBase A collection of PIG UDFs Flow for modeling or stages assembled in PIG o OLR o Clustering o Affinity o Regression Models Configuration based behavioral changes for stages of modeling o Type of Features to generated o Type of joins to perform User / Item / Feature Input : DFS and/or HBase Output: DFS and/or Hbase 18
HBase ITEM Table o Stores item related features o Stores ITEM x USER FEATURES model o Stores parameters about item like view count, click count, unique user count. o 10 of Millions of Items o Updated every 5 minutes USER Model o Store USER x CONTENT FEATURES model for each individual user by Unique ID o Stores summarized user history Essential for Modeling in terms of item decay o Millions of profiles o Updated every 5 to 30 minutes TERM Model o Inverts the Item Table and stores statistics for the terms. o Used to find the trending features and provide baselines for user features o Millions of terms and hundreds of parameters tracked o Updates every 5 minutes 19
Analytics and Debugging Provides ability to debug modeling issues near-real time Run complex queries for analysis Easy to use interface PM, Engineers, Research use this cluster to get near-real time insights 100s of Modeling monitoring and Reporting queries every 5 minute Output fed to (near) real time dashboard We use HIVE 20
Learnings PIG & HBase has been best combination so far o Made it simple to build different kind of science models o Point lookup using HBase has proven to be very useful o Modeling = Matrices HBase provides a natural way to represent and access them Edge Services o Have provided simplicity to whole stack o Management (Upgrades, Outage) has been easy HIVE has provided us a great way for analyzing the results o PIG was also considered 21
Q & A
Questions?