Multi-Label Learning with Millions of Labels for Query Recommendation

Similar documents

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Data Mining Practical Machine Learning Tools and Techniques

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Search Engines. Stephen Shaw 18th of February, Netsoc

Journée Thématique Big Data 13/03/2015

Social Media Mining. Data Mining Essentials

Similarity Search in a Very Large Scale Using Hadoop and HBase

Machine Learning Final Project Spam Filtering

Hadoop SNS. renren.com. Saturday, December 3, 11

Fast Analytics on Big Data with H20

Collective Behavior Prediction in Social Media. Lei Tang Data Mining & Machine Learning Group Arizona State University

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Big Data Text Mining and Visualization. Anton Heijs

Active Learning SVM for Blogs recommendation

Maximize Revenues on your Customer Loyalty Program using Predictive Analytics

Distributed forests for MapReduce-based machine learning

RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Tree based ensemble models regularization by convex optimization

Big Data Analytics CSCI 4030

Invited Applications Paper

Bilinear Prediction Using Low-Rank Models

BIG DATA What it is and how to use?

Decision Trees from large Databases: SLIQ

Predict the Popularity of YouTube Videos Using Early View Data

Streamdrill: Analyzing Big Data Streams in Realtime

CSE 473: Artificial Intelligence Autumn 2010

Lecture 10: Regression Trees

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Data Mining Techniques

Sibyl: a system for large scale machine learning

The Operational Value of Social Media Information. Social Media and Customer Interaction

Measuring the online experience of auto insurance companies

Machine Learning over Big Data

IJCSES Vol.7 No.4 October 2013 pp Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS

Predicting Flight Delays

Automated Model Based Testing for an Web Applications

Statistical Machine Learning

The Scientific Data Mining Process

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Chapter 6. The stacking ensemble approach

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Using multiple models: Bagging, Boosting, Ensembles, Forests

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

STA 4273H: Statistical Machine Learning

Extreme Computing. Big Data. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Client Based Power Iteration Clustering Algorithm to Reduce Dimensionality in Big Data

Classification of Bad Accounts in Credit Card Industry

Bringing Big Data Modelling into the Hands of Domain Experts

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Linear Threshold Units

Bootstrapping Big Data

Finding Advertising Keywords on Web Pages. Contextual Ads 101

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Question 2 Naïve Bayes (16 points)

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Neural Network Add-in

HT2015: SC4 Statistical Data Mining and Machine Learning

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Steven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Dynamics of Genre and Domain Intents

Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

How To Perform An Ensemble Analysis

Data Mining. Nonlinear Classification

E-commerce Transaction Anomaly Classification

Multi-Class and Structured Classification

Leveraging Ensemble Models in SAS Enterprise Miner

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Homework 2. Page 154: Exercise Page 145: Exercise 8.3 Page 150: Exercise 8.9

A Simple Feature Extraction Technique of a Pattern By Hopfield Network

Simple and efficient online algorithms for real world applications

Data Mining and Predictive Analytics - Assignment 1 Image Popularity Prediction on Social Networks

Content-Based Recommendation

The Need for Training in Big Data: Experiences and Case Studies

How To Write A Data Processing Pipeline In R

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Transcription:

Multi-Label Learning with Millions of Labels for Query Recommendation Rahul Agrawal Microsoft AdCenter Yashoteja Prabhu Microsoft Research India Archit Gupta IIT Delhi Manik Varma Microsoft Research India

Recommending Advertiser Bid Phrases geico auto insurance geico car insurance geico insurance www geico com care geicos geico com need cheap auto insurance wisconsin cheap car insurance quotes cheap auto insurance florida all state car insurance coupon code

Query Rewriting geico auto insurance geico car insurance Absolutely cheapest car insurance geico insurance www geico com care geicos geico com need cheap auto insurance wisconsin cheap car insurance quotes cheap auto insurance florida all state car insurance coupon code

Ranking & Relevance Meta Stream geico auto insurance geico car insurance geico insurance www geico com care geicos geico com need cheap auto insurance wisconsin cheap car insurance quotes cheap auto insurance florida geico twitter

Recommending Advertiser Bid Phrases geico auto insurance geico car insurance geico insurance www geico com care geicos geico com need cheap auto insurance wisconsin cheap car insurance quotes cheap auto insurance florida all state car insurance coupon code

Learning to Predict a Set of Queries italian restaurant f : X 2 Y need cheap auto insurance geico online quote car insurance iphone X: Ads Y: Queries

Learning to Predict a Set of Queries f ( ) need cheap auto insurance geico car insurance

Multi-Label Learning Challenges f ( ) need cheap auto insurance geico car insurance Infinite number of labels (queries) Training data acquisition Efficient training Cost of prediction

Binary Classification & Ranking h : (X, Y) {, } h(, geico) h(, iphone) Infinite number of labels (queries) Training data acquisition Efficient training Cost of prediction

Binary Classification h : (X, Y) {, } italian restaurant need cheap auto insurance geico online quote car insurance Infinite number of labels (queries) Training data acquisition Efficient training Cost of prediction iphone

Binary Classification KEX h : (X, Y) {, } switching to geico geico online quote car insurance Infinite number of labels (queries) Training data acquisition Efficient training Cost of prediction

Query Recommendations by KEX

Query Recommendations by KEX h(, car insurance)? h(, iphone)?

Query Recommendations by KEX plastic ponies simone plastics clothing and accessories sylvia pony clothing couture playground plastic recycling children's clothing

Multi-Label Learning Formulation italian restaurant f : X 2 Y need cheap auto insurance geico online quote car insurance iphone X: Ads Y: Queries

Learning with Millions of Labels italian restaurant f : X 2 Y need cheap auto insurance geico online quote car insurance iphone X: Ads Y: 10 Million Queries

Multi-Label Random Forests We develop Multi-Label Random Forests with logarithmic prediction costs that make predictions in a few milliseconds. We train on 200 M points, 100 M categories and 10 M features in 28 hours on a grid with 1000 compute nodes. We develop a tree growing criterion which learns from positive data alone. We generate training data automatically from click logs. We develop a sparse SSL formulation to infer beliefs about the state of missing and noisy labels.

Training Data Missing Labels No annotator can mark all the relevant labels for a data point. We have missing labels during Training Validation Testing. Even fundamental ML techniques such as validation can go awry. One can t design error metrics invariant to missing labels.

Training Data and Features iphone color material TF-IDF Bag of Words Features

Training Labels case for iphone best iphone case apple iphone 3g metallic slim fit case best iphone nn4 cases iphone cases best iphone cases apple iphone 4g cases best iphone nn4 case iphone 3gs cases iphone 4s case case iphone otterbox universal defender case iphone nn4 black silicone black plastic sena iphone cases apple iphone 4g premium soft silicone rubber black phone protector skin cover case apple iphone nn4 cases belkin grip vue tint case iphone nn4 clear black white premium bumper case apple iphone nn4 att bunny rabbit silicone case skin iphone nn4 stand tail holder iphone color material iphone case iphone 4g cases iphone case speck iphone case best case iphone 4s iphone 4gs cases iphone nn4 case switcheasy neo case iphone 3g black best case iphone nn4 iphone 4s defender series case 3g iphone cases waterproof iphone case best iphone 3g cases iphone case design TF-IDF Bag of Words Features iphone cases 4g apple iphone cases waterproof iphone cases best iphone 4s case iphone cases 3g best iphone 3g case amazonbasics protective tpu case screen protector att verizon iphone nn4 iphone 4s clear best iphone 4s cases

Training Labels

Missing and Noisy Labels best italian restaurants philadelphia italian restaurants italian restaurant italian restaurants arkansas italian restaurants connecticut italian restaurants idaho italian restaurants phoenix italian restaurant chains italian restaurant connecticut italian restaurant district columbia thai restaurant thai restaurants restaurants mexican restaurants

Missing and Noisy Labels

Frequency Biased Training Data Most labels will have very few positive training examples Zipf's Law

Multi-Label Prediction Costs Linear prediction costs are infeasible geico car insurance pizza iphone cases 1-vs-All Classification

Label and Feature Space Compression 10M Dimensional Label Space car motor vehicle auto iphone cases iphone case cases iphone 6M Dimensional Feature Space Car Ads iphone Case Ads 1K Dimensional Embedding Space

Hierarchical Prediction Prediction in logarithmic time

Gating Tree Based Prediction Prediction in logarithmic time Is the word insurance present in the ad? Yes No Is the word geico present in the ad? Yes No 0.4 0.3 0.2 0.1 0

Ensemble of Randomized Gating Trees.4.3.2.1 0 0.4 0.3 0.2 0.1 0 0.4 0.3 0.2 0.1 0

Efficient Training We seek classifiers and optimization algorithms that Are massively parallelizable Don t need to load the feature vectors (1 Tb) into RAM Don t need to load the label matrix (100 Gb) into RAM Number of training points Number of labels Dimensionality of feature vector 200 Million 100 Million 10 Million Number of cores 500 1000 RAM per core Training time 2 Gb 28 hours

Multi-Label Random Forests The splitting cost needs to be calculated in a 2 10M space Is the word insurance present? 0.2 0.15 0.1 0.05 0 0.2 0.15 0.1 0.05 0

Learning from Positively Labeled Data Split condition : x f > t f, t = argmin f,t n l k p l l k (1 p l l k ) + n r k p r l k (1 p r l k ) p l k = i p l k ad i p(ad i ) 0.6 0.4 0.2 x f > t 0.8 0.6 0.4 0.2 0 l1 l2 l3 0 l1 l2 l3

Multi-Label Random Forests x 1, y 1 = {l 2, l 3 } 1 0 l1 l2 l3 (x 1, y 1 ) (x 2, y 2 ) (x 3, y 3 ) x 2, y 2 = {l 1, l 3 } x 3, y 3 = {l 1, l 2, l 3 } 1 0 1 0 l1 l2 l3 l1 l2 l3 p(y) 0.5 0 l1 l2 l3

Query Recommendation Data Sets Data set statistics Data Set # of Training Points (M) # of Test Points (M) # of Dimensions (M) # of Labels (M) Wikipedia 1.53 0.66 1.89 0.97 Ads1 8.00 0.50 1.58 1.22 Web 40.00 1.50 2.62 1.22 Ads2 90.00 5.00 5.80 9.70

Performance Evaluation Precision@k We use loss functions where the penalty incurred for predicting the real (but unknown) ground truth is never more than that of predicting any other labelling L y, y Observed L y, y Observed y Y Hamming Loss Precision at k We found Precision at 10 to be robust for our application.

Query Recommendation Results 30.00 25.00 20.00 15.00 10.00 MLRF KEX 5.00 0.00 Wikipedia Ads1 Web Ads2 Percentage of top 10 predictions that were clicked queries

Query Recommendation Results 60.00 50.00 40.00 30.00 20.00 MLRF KEX 10.00 0.00 Wikipedia Ads1 Web Ads2 Percentage of top 10 predictions that were relevant

Geico Car Insurance KEX MLRF geico auto insurance geico car insurance geico insurance www geico com care geicos geico com need cheap auto insurance wisconsin cheap car insurance quotes cheap auto insurance florida all state car insurance coupon code

Domino s Pizza KEX MLRF dominos dominos pizza domino pizza domino pasta bowls domino pizza coupons domino pizza deals domino pizza locations domino pizza menu domino pizza online

Simone & Sylvia Kid s Clothing KEX plastic ponies simone plastics clothing and accessories sylvia pony clothing couture playground Plastic recycling children's clothing MLRF toddlers clothes toddlers clothing toddler costumes children clothes sale children clothes designer children clothes cute children clothes retro clothing retro baby clothes baby clothing

KCS Flowers KEX funeral flowers sympathy funeral flowers web home bleitz funeral home funeral flowers discount yarington's funeral home harvey funeral home green lake funeral home howden kennedy funeral home arranging flowers MLRF flowers delivery funeral arrangements birthday flowers funeral flowers funeral planning flowers valentines free delivery flowers cheap flowers florists cheap flowers funeral

Vistaprint Designer T-Shirts KEX embroidered apparel custom apparel readymade apparel customizable apparel customizable apparel leading print online business cards apparel and accessories own text MLRF custom t shirts funny t shirts hanes beefy t shirts hanes t shirts long sleeve t shirts personalized t shirts printed t shirts retro gamer t shirts t shirts buy custom t shirts

Metlife Auto Insurance KEX metlife auto home insurance auto home insurance auto insurance massachusetts metlife agent driver discount additional cost saving benefits car discount auto quote MLRF metlife auto insurance auto Insurance car Insurance automobile Insurance geico insurance cheap car insurance metlife auto insurance broker insurance home insurance

Wanta Thai Restaurant KEX authentic thai restaurant delicious thai food thai cuisine thai restaurant thai food wanta best thai restaurant thai eateries thai contemporary thai MLRF thai restaurant thai restaurants mexican restaurants cheap hotels hotels fast food restaurants restaurants coupons best web hosting restaurants vegetarian foods new york restaurants

best italian restaurants philadelphia italian restaurants italian restaurant italian restaurants arkansas italian restaurants connecticut italian restaurants idaho italian restaurants phoenix italian restaurant chains italian restaurant connecticut italian restaurant district columbia thai restaurant thai restaurants restaurants mexican restaurants

Compensating for Missing Labels 0.5 Case-mate phone cases 0.7 Auto insurance quotes Esurance 0.8 American family insurance 0.9 Progressive insurance Allstate auto insurance Maggiano s restaurant

Training on Belief Vectors 1 x 1, y 1 = l 2, l 3, f 1 0 l1 l2 l3 (x 1, f 1 ) (x 2, f 2 ) (x 3, f 3 ) x 2, y 2 = l 1, l 3, f 2 x 3, y 3 = l 1, l 2, l 3, f 3 0 1 0 l1 l2 l3 l1 l2 l3 1 p(f) 0.5 0 l1 l2 l3

Sparse Semi-Supervised Learning Graph-based SSL optimizes label belief smoothness and fidelity to original labels 1 F* = Min Tr F 2 Ft I D 1 2 W D 1 2 F + λ 2 s. t. F 0 K F Y 2 W MXM D MXM Y MXL F MXL λ M L K Document-document similarity matrix Diagonal matrix representing the row sums of W 0/1 label matrix Real valued label belief matrix Trade-off Hyperparameter Number of documents Number of labels Sparsity constant

Sparse Semi-Supervised Learning Graph-based SSL optimizes label belief smoothness and fidelity to original labels F* = Min F 1 Σ 2 i=1..lσ j=1..m l=1..m s. t. F 0 K w jl ( F ij D jj F il D ll ) 2 + λ 2 Σ i=1..m j=1..l (F ij Y ij ) 2 W MXM D MXM Y MXL F MXL λ M L K Document-document similarity matrix Diagonal matrix representing the row sums of W 0/1 label matrix Real valued label belief matrix Trade-off Hyperparameter Number of documents Number of labels Sparsity constant

Iterative Hard Thresholding Sparse SSL formulation F* = Min F J F = 1 2 Tr Ft I D 1 2 W D 1 2 F + λ 2 s. t. F 0 K F Y 2 The iterative hard thresholding algorithm converges to a global/local optimum F 0 F t+ 1 2 = Y = 1 λ+1 D 1 2 W D 1 2F t + F t+1 = Top K (F 1 t+ ) 2 λ λ+1 Y

Iterative Hard Thresholding If Y ij {0, 1} and W is positive definite then The sequence F 0, F 1, converges to a stationary point F. J(F 0 ) J(F 1 ) J(F ) If F 0 < K then F is a globally optimal solution If F 0 = K then F is a locally optimal solution J F J F + Min( λ 2 K + Y λ + 1 0, 2 ML K α K (F ) Y 0 )

Semi-Supervised Learning Results Precision@10 as judged by automatically generated click labels as well as by human experts. Data Set MLRF Click Labels (%) Human Verification (%) MLRF+ SSL KEX MLRF MLRF+ SSL Wikipedia 15.72 18.53 11.63 24.46 27.17 17.51 KEX Ads1 18.13 19.88 11.96 45.86 47.53 41.95 Bing 22.51 25.32 18.42 50.47 51.83 47.69 Ads2 15.91 17.12 12.45 41.28 43.78 36.69

Query Expansion Results Query expansion techniques can help both KEX and MLRF Data Set Click Labels (%) Human Verification (%) MLRF+ SSL+KSP KEX+KSP MLRF+ SSL+KSP KEX+KSP Wikipedia 18.01 10.81 31.48 22.14 Ads1 21.54 12.38 51.08 43.27 Web 26.66 19.88 53.69 48.13 Ads2 19.24 14.35 46.77 40.07

Query Recommendation Results Edit distance [Ravi et al. WSDM 2010] Data Set Click Labels (%) KEX KEX+KSP MLRF MLRF+SSL MLRF+SSL+ KSP Wikipedia 0.81 0.78 0.71 0.66 0.63 Ads1 0.83 0.76 0.71 0.65 0.61 Web 0.73 0.68 0.65 0.62 0.58 Ads2 0.77 0.73 0.69 0.63 0.59

Conclusions Query recommendation can be posed as multi-label learning. Learning with millions of labels can be tractable and accurate. Other applications Query expansion. Document and ad relevance and ranking. Fine-grained query intent classification.

Deepak Bapna Prateek Jain A. Kumaran Mehul Parsana Krishna Leela Poola Adarsh Prasad Varun Singla Acknowledgements

Advantages of an ML Approach Can generalize to other domains such as images on Flickr or videos on YouTube.

System Architecture We leverage the Map/Reduce framework. Trees are grown in parallel breadth-wise. Number of compute nodes Evaluators 500 Combiners 100 Maximizers 25 Evaluator 1 Maximizer 1 Maximizer 2 Combiner 1 Evaluator 2 F*, T* Combiner 2 Evaluator 3 Combiner 3 Evaluator 4 Our objective is to balance the compute load across machines while minimizing data flow X 1,Y 1 to X N, Y N X N+1,Y N+1 to X 2N, Y 2N X 2N+1,Y 2N+1 to X 3N, Y 3N X 3N+1,Y 3N+1 to X 4N, Y 4N

Evaluators Input N training instances Set of keys Tree ID, Node ID, Feature ID and threshold Output Partial label distributions for the keys Evaluator 1 Maximizer 1 Maximizer 2 Combiner 1 Evaluator 2 F*, T* Combiner 2 Evaluator 3 Combiner 3 Evaluator 4 Computation N * # of keys X 1,Y 1 to X N, Y N X N+1,Y N+1 to X 2N, Y 2N X 2N+1,Y 2N+1 to X 3N, Y 3N X 3N+1,Y 3N+1 to X 4N, Y 4N

Combiners Input Partial label distributions for assigned keys F*, T* Maximizer 1 Maximizer 2 Output Objective function values for the keys. Combiner 1 Combiner 2 Combiner 3 Computation # of keys * Avg # of Evaluators / key * # of labels in the distribution for the key. X N+1,Y N+1 Evaluator 1 Evaluator 2 Evaluator 3 Evaluator 4 X 1,Y 1 to X N, Y N to X 2N, Y 2N X 2N+1,Y 2N+1 to X 3N, Y 3N X 3N+1,Y 3N+1 to X 4N, Y 4N

Maximizers Input Objective function values for assigned keys Output Optimal feature and threshold for assigned nodes in trees. Computation # of keys * Avg # of features per key * Avg # of Evaluator 1 Maximizer 1 Maximizer 2 Combiner 1 Evaluator 2 F*, T* Combiner 2 Evaluator 3 Combiner 3 Evaluator 4 thresholds per feature X 1,Y 1 X N+1,Y N+1 to X N, Y N to X 2N, Y 2N X 2N+1,Y 2N+1 to X 3N, Y 3N X 3N+1,Y 3N+1 to X 4N, Y 4N