Multi-Label Learning with Millions of Labels for Query Recommendation Rahul Agrawal Microsoft AdCenter Yashoteja Prabhu Microsoft Research India Archit Gupta IIT Delhi Manik Varma Microsoft Research India
Recommending Advertiser Bid Phrases geico auto insurance geico car insurance geico insurance www geico com care geicos geico com need cheap auto insurance wisconsin cheap car insurance quotes cheap auto insurance florida all state car insurance coupon code
Query Rewriting geico auto insurance geico car insurance Absolutely cheapest car insurance geico insurance www geico com care geicos geico com need cheap auto insurance wisconsin cheap car insurance quotes cheap auto insurance florida all state car insurance coupon code
Ranking & Relevance Meta Stream geico auto insurance geico car insurance geico insurance www geico com care geicos geico com need cheap auto insurance wisconsin cheap car insurance quotes cheap auto insurance florida geico twitter
Recommending Advertiser Bid Phrases geico auto insurance geico car insurance geico insurance www geico com care geicos geico com need cheap auto insurance wisconsin cheap car insurance quotes cheap auto insurance florida all state car insurance coupon code
Learning to Predict a Set of Queries italian restaurant f : X 2 Y need cheap auto insurance geico online quote car insurance iphone X: Ads Y: Queries
Learning to Predict a Set of Queries f ( ) need cheap auto insurance geico car insurance
Multi-Label Learning Challenges f ( ) need cheap auto insurance geico car insurance Infinite number of labels (queries) Training data acquisition Efficient training Cost of prediction
Binary Classification & Ranking h : (X, Y) {, } h(, geico) h(, iphone) Infinite number of labels (queries) Training data acquisition Efficient training Cost of prediction
Binary Classification h : (X, Y) {, } italian restaurant need cheap auto insurance geico online quote car insurance Infinite number of labels (queries) Training data acquisition Efficient training Cost of prediction iphone
Binary Classification KEX h : (X, Y) {, } switching to geico geico online quote car insurance Infinite number of labels (queries) Training data acquisition Efficient training Cost of prediction
Query Recommendations by KEX
Query Recommendations by KEX h(, car insurance)? h(, iphone)?
Query Recommendations by KEX plastic ponies simone plastics clothing and accessories sylvia pony clothing couture playground plastic recycling children's clothing
Multi-Label Learning Formulation italian restaurant f : X 2 Y need cheap auto insurance geico online quote car insurance iphone X: Ads Y: Queries
Learning with Millions of Labels italian restaurant f : X 2 Y need cheap auto insurance geico online quote car insurance iphone X: Ads Y: 10 Million Queries
Multi-Label Random Forests We develop Multi-Label Random Forests with logarithmic prediction costs that make predictions in a few milliseconds. We train on 200 M points, 100 M categories and 10 M features in 28 hours on a grid with 1000 compute nodes. We develop a tree growing criterion which learns from positive data alone. We generate training data automatically from click logs. We develop a sparse SSL formulation to infer beliefs about the state of missing and noisy labels.
Training Data Missing Labels No annotator can mark all the relevant labels for a data point. We have missing labels during Training Validation Testing. Even fundamental ML techniques such as validation can go awry. One can t design error metrics invariant to missing labels.
Training Data and Features iphone color material TF-IDF Bag of Words Features
Training Labels case for iphone best iphone case apple iphone 3g metallic slim fit case best iphone nn4 cases iphone cases best iphone cases apple iphone 4g cases best iphone nn4 case iphone 3gs cases iphone 4s case case iphone otterbox universal defender case iphone nn4 black silicone black plastic sena iphone cases apple iphone 4g premium soft silicone rubber black phone protector skin cover case apple iphone nn4 cases belkin grip vue tint case iphone nn4 clear black white premium bumper case apple iphone nn4 att bunny rabbit silicone case skin iphone nn4 stand tail holder iphone color material iphone case iphone 4g cases iphone case speck iphone case best case iphone 4s iphone 4gs cases iphone nn4 case switcheasy neo case iphone 3g black best case iphone nn4 iphone 4s defender series case 3g iphone cases waterproof iphone case best iphone 3g cases iphone case design TF-IDF Bag of Words Features iphone cases 4g apple iphone cases waterproof iphone cases best iphone 4s case iphone cases 3g best iphone 3g case amazonbasics protective tpu case screen protector att verizon iphone nn4 iphone 4s clear best iphone 4s cases
Training Labels
Missing and Noisy Labels best italian restaurants philadelphia italian restaurants italian restaurant italian restaurants arkansas italian restaurants connecticut italian restaurants idaho italian restaurants phoenix italian restaurant chains italian restaurant connecticut italian restaurant district columbia thai restaurant thai restaurants restaurants mexican restaurants
Missing and Noisy Labels
Frequency Biased Training Data Most labels will have very few positive training examples Zipf's Law
Multi-Label Prediction Costs Linear prediction costs are infeasible geico car insurance pizza iphone cases 1-vs-All Classification
Label and Feature Space Compression 10M Dimensional Label Space car motor vehicle auto iphone cases iphone case cases iphone 6M Dimensional Feature Space Car Ads iphone Case Ads 1K Dimensional Embedding Space
Hierarchical Prediction Prediction in logarithmic time
Gating Tree Based Prediction Prediction in logarithmic time Is the word insurance present in the ad? Yes No Is the word geico present in the ad? Yes No 0.4 0.3 0.2 0.1 0
Ensemble of Randomized Gating Trees.4.3.2.1 0 0.4 0.3 0.2 0.1 0 0.4 0.3 0.2 0.1 0
Efficient Training We seek classifiers and optimization algorithms that Are massively parallelizable Don t need to load the feature vectors (1 Tb) into RAM Don t need to load the label matrix (100 Gb) into RAM Number of training points Number of labels Dimensionality of feature vector 200 Million 100 Million 10 Million Number of cores 500 1000 RAM per core Training time 2 Gb 28 hours
Multi-Label Random Forests The splitting cost needs to be calculated in a 2 10M space Is the word insurance present? 0.2 0.15 0.1 0.05 0 0.2 0.15 0.1 0.05 0
Learning from Positively Labeled Data Split condition : x f > t f, t = argmin f,t n l k p l l k (1 p l l k ) + n r k p r l k (1 p r l k ) p l k = i p l k ad i p(ad i ) 0.6 0.4 0.2 x f > t 0.8 0.6 0.4 0.2 0 l1 l2 l3 0 l1 l2 l3
Multi-Label Random Forests x 1, y 1 = {l 2, l 3 } 1 0 l1 l2 l3 (x 1, y 1 ) (x 2, y 2 ) (x 3, y 3 ) x 2, y 2 = {l 1, l 3 } x 3, y 3 = {l 1, l 2, l 3 } 1 0 1 0 l1 l2 l3 l1 l2 l3 p(y) 0.5 0 l1 l2 l3
Query Recommendation Data Sets Data set statistics Data Set # of Training Points (M) # of Test Points (M) # of Dimensions (M) # of Labels (M) Wikipedia 1.53 0.66 1.89 0.97 Ads1 8.00 0.50 1.58 1.22 Web 40.00 1.50 2.62 1.22 Ads2 90.00 5.00 5.80 9.70
Performance Evaluation Precision@k We use loss functions where the penalty incurred for predicting the real (but unknown) ground truth is never more than that of predicting any other labelling L y, y Observed L y, y Observed y Y Hamming Loss Precision at k We found Precision at 10 to be robust for our application.
Query Recommendation Results 30.00 25.00 20.00 15.00 10.00 MLRF KEX 5.00 0.00 Wikipedia Ads1 Web Ads2 Percentage of top 10 predictions that were clicked queries
Query Recommendation Results 60.00 50.00 40.00 30.00 20.00 MLRF KEX 10.00 0.00 Wikipedia Ads1 Web Ads2 Percentage of top 10 predictions that were relevant
Geico Car Insurance KEX MLRF geico auto insurance geico car insurance geico insurance www geico com care geicos geico com need cheap auto insurance wisconsin cheap car insurance quotes cheap auto insurance florida all state car insurance coupon code
Domino s Pizza KEX MLRF dominos dominos pizza domino pizza domino pasta bowls domino pizza coupons domino pizza deals domino pizza locations domino pizza menu domino pizza online
Simone & Sylvia Kid s Clothing KEX plastic ponies simone plastics clothing and accessories sylvia pony clothing couture playground Plastic recycling children's clothing MLRF toddlers clothes toddlers clothing toddler costumes children clothes sale children clothes designer children clothes cute children clothes retro clothing retro baby clothes baby clothing
KCS Flowers KEX funeral flowers sympathy funeral flowers web home bleitz funeral home funeral flowers discount yarington's funeral home harvey funeral home green lake funeral home howden kennedy funeral home arranging flowers MLRF flowers delivery funeral arrangements birthday flowers funeral flowers funeral planning flowers valentines free delivery flowers cheap flowers florists cheap flowers funeral
Vistaprint Designer T-Shirts KEX embroidered apparel custom apparel readymade apparel customizable apparel customizable apparel leading print online business cards apparel and accessories own text MLRF custom t shirts funny t shirts hanes beefy t shirts hanes t shirts long sleeve t shirts personalized t shirts printed t shirts retro gamer t shirts t shirts buy custom t shirts
Metlife Auto Insurance KEX metlife auto home insurance auto home insurance auto insurance massachusetts metlife agent driver discount additional cost saving benefits car discount auto quote MLRF metlife auto insurance auto Insurance car Insurance automobile Insurance geico insurance cheap car insurance metlife auto insurance broker insurance home insurance
Wanta Thai Restaurant KEX authentic thai restaurant delicious thai food thai cuisine thai restaurant thai food wanta best thai restaurant thai eateries thai contemporary thai MLRF thai restaurant thai restaurants mexican restaurants cheap hotels hotels fast food restaurants restaurants coupons best web hosting restaurants vegetarian foods new york restaurants
best italian restaurants philadelphia italian restaurants italian restaurant italian restaurants arkansas italian restaurants connecticut italian restaurants idaho italian restaurants phoenix italian restaurant chains italian restaurant connecticut italian restaurant district columbia thai restaurant thai restaurants restaurants mexican restaurants
Compensating for Missing Labels 0.5 Case-mate phone cases 0.7 Auto insurance quotes Esurance 0.8 American family insurance 0.9 Progressive insurance Allstate auto insurance Maggiano s restaurant
Training on Belief Vectors 1 x 1, y 1 = l 2, l 3, f 1 0 l1 l2 l3 (x 1, f 1 ) (x 2, f 2 ) (x 3, f 3 ) x 2, y 2 = l 1, l 3, f 2 x 3, y 3 = l 1, l 2, l 3, f 3 0 1 0 l1 l2 l3 l1 l2 l3 1 p(f) 0.5 0 l1 l2 l3
Sparse Semi-Supervised Learning Graph-based SSL optimizes label belief smoothness and fidelity to original labels 1 F* = Min Tr F 2 Ft I D 1 2 W D 1 2 F + λ 2 s. t. F 0 K F Y 2 W MXM D MXM Y MXL F MXL λ M L K Document-document similarity matrix Diagonal matrix representing the row sums of W 0/1 label matrix Real valued label belief matrix Trade-off Hyperparameter Number of documents Number of labels Sparsity constant
Sparse Semi-Supervised Learning Graph-based SSL optimizes label belief smoothness and fidelity to original labels F* = Min F 1 Σ 2 i=1..lσ j=1..m l=1..m s. t. F 0 K w jl ( F ij D jj F il D ll ) 2 + λ 2 Σ i=1..m j=1..l (F ij Y ij ) 2 W MXM D MXM Y MXL F MXL λ M L K Document-document similarity matrix Diagonal matrix representing the row sums of W 0/1 label matrix Real valued label belief matrix Trade-off Hyperparameter Number of documents Number of labels Sparsity constant
Iterative Hard Thresholding Sparse SSL formulation F* = Min F J F = 1 2 Tr Ft I D 1 2 W D 1 2 F + λ 2 s. t. F 0 K F Y 2 The iterative hard thresholding algorithm converges to a global/local optimum F 0 F t+ 1 2 = Y = 1 λ+1 D 1 2 W D 1 2F t + F t+1 = Top K (F 1 t+ ) 2 λ λ+1 Y
Iterative Hard Thresholding If Y ij {0, 1} and W is positive definite then The sequence F 0, F 1, converges to a stationary point F. J(F 0 ) J(F 1 ) J(F ) If F 0 < K then F is a globally optimal solution If F 0 = K then F is a locally optimal solution J F J F + Min( λ 2 K + Y λ + 1 0, 2 ML K α K (F ) Y 0 )
Semi-Supervised Learning Results Precision@10 as judged by automatically generated click labels as well as by human experts. Data Set MLRF Click Labels (%) Human Verification (%) MLRF+ SSL KEX MLRF MLRF+ SSL Wikipedia 15.72 18.53 11.63 24.46 27.17 17.51 KEX Ads1 18.13 19.88 11.96 45.86 47.53 41.95 Bing 22.51 25.32 18.42 50.47 51.83 47.69 Ads2 15.91 17.12 12.45 41.28 43.78 36.69
Query Expansion Results Query expansion techniques can help both KEX and MLRF Data Set Click Labels (%) Human Verification (%) MLRF+ SSL+KSP KEX+KSP MLRF+ SSL+KSP KEX+KSP Wikipedia 18.01 10.81 31.48 22.14 Ads1 21.54 12.38 51.08 43.27 Web 26.66 19.88 53.69 48.13 Ads2 19.24 14.35 46.77 40.07
Query Recommendation Results Edit distance [Ravi et al. WSDM 2010] Data Set Click Labels (%) KEX KEX+KSP MLRF MLRF+SSL MLRF+SSL+ KSP Wikipedia 0.81 0.78 0.71 0.66 0.63 Ads1 0.83 0.76 0.71 0.65 0.61 Web 0.73 0.68 0.65 0.62 0.58 Ads2 0.77 0.73 0.69 0.63 0.59
Conclusions Query recommendation can be posed as multi-label learning. Learning with millions of labels can be tractable and accurate. Other applications Query expansion. Document and ad relevance and ranking. Fine-grained query intent classification.
Deepak Bapna Prateek Jain A. Kumaran Mehul Parsana Krishna Leela Poola Adarsh Prasad Varun Singla Acknowledgements
Advantages of an ML Approach Can generalize to other domains such as images on Flickr or videos on YouTube.
System Architecture We leverage the Map/Reduce framework. Trees are grown in parallel breadth-wise. Number of compute nodes Evaluators 500 Combiners 100 Maximizers 25 Evaluator 1 Maximizer 1 Maximizer 2 Combiner 1 Evaluator 2 F*, T* Combiner 2 Evaluator 3 Combiner 3 Evaluator 4 Our objective is to balance the compute load across machines while minimizing data flow X 1,Y 1 to X N, Y N X N+1,Y N+1 to X 2N, Y 2N X 2N+1,Y 2N+1 to X 3N, Y 3N X 3N+1,Y 3N+1 to X 4N, Y 4N
Evaluators Input N training instances Set of keys Tree ID, Node ID, Feature ID and threshold Output Partial label distributions for the keys Evaluator 1 Maximizer 1 Maximizer 2 Combiner 1 Evaluator 2 F*, T* Combiner 2 Evaluator 3 Combiner 3 Evaluator 4 Computation N * # of keys X 1,Y 1 to X N, Y N X N+1,Y N+1 to X 2N, Y 2N X 2N+1,Y 2N+1 to X 3N, Y 3N X 3N+1,Y 3N+1 to X 4N, Y 4N
Combiners Input Partial label distributions for assigned keys F*, T* Maximizer 1 Maximizer 2 Output Objective function values for the keys. Combiner 1 Combiner 2 Combiner 3 Computation # of keys * Avg # of Evaluators / key * # of labels in the distribution for the key. X N+1,Y N+1 Evaluator 1 Evaluator 2 Evaluator 3 Evaluator 4 X 1,Y 1 to X N, Y N to X 2N, Y 2N X 2N+1,Y 2N+1 to X 3N, Y 3N X 3N+1,Y 3N+1 to X 4N, Y 4N
Maximizers Input Objective function values for assigned keys Output Optimal feature and threshold for assigned nodes in trees. Computation # of keys * Avg # of features per key * Avg # of Evaluator 1 Maximizer 1 Maximizer 2 Combiner 1 Evaluator 2 F*, T* Combiner 2 Evaluator 3 Combiner 3 Evaluator 4 thresholds per feature X 1,Y 1 X N+1,Y N+1 to X N, Y N to X 2N, Y 2N X 2N+1,Y 2N+1 to X 3N, Y 3N X 3N+1,Y 3N+1 to X 4N, Y 4N