Multi-Label Learning with Millions of Labels for Query Recommendation

Transcription

1 Multi-Label Learning with Millions of Labels for Query Recommendation Rahul Agrawal Microsoft AdCenter Yashoteja Prabhu Microsoft Research India Archit Gupta IIT Delhi Manik Varma Microsoft Research India

2 Recommending Advertiser Bid Phrases geico auto insurance geico car insurance geico insurance www geico com care geicos geico com need cheap auto insurance wisconsin cheap car insurance quotes cheap auto insurance florida all state car insurance coupon code

3 Query Rewriting geico auto insurance geico car insurance Absolutely cheapest car insurance geico insurance www geico com care geicos geico com need cheap auto insurance wisconsin cheap car insurance quotes cheap auto insurance florida all state car insurance coupon code

4 Ranking & Relevance Meta Stream geico auto insurance geico car insurance geico insurance www geico com care geicos geico com need cheap auto insurance wisconsin cheap car insurance quotes cheap auto insurance florida geico twitter

5 Recommending Advertiser Bid Phrases geico auto insurance geico car insurance geico insurance www geico com care geicos geico com need cheap auto insurance wisconsin cheap car insurance quotes cheap auto insurance florida all state car insurance coupon code

6 Learning to Predict a Set of Queries italian restaurant f : X 2 Y need cheap auto insurance geico online quote car insurance iphone X: Ads Y: Queries

7 Learning to Predict a Set of Queries f ( ) need cheap auto insurance geico car insurance

8 Multi-Label Learning Challenges f ( ) need cheap auto insurance geico car insurance Infinite number of labels (queries) Training data acquisition Efficient training Cost of prediction

9 Binary Classification & Ranking h : (X, Y) {, } h(, geico) h(, iphone) Infinite number of labels (queries) Training data acquisition Efficient training Cost of prediction

10 Binary Classification h : (X, Y) {, } italian restaurant need cheap auto insurance geico online quote car insurance Infinite number of labels (queries) Training data acquisition Efficient training Cost of prediction iphone

11 Binary Classification KEX h : (X, Y) {, } switching to geico geico online quote car insurance Infinite number of labels (queries) Training data acquisition Efficient training Cost of prediction

12 Query Recommendations by KEX

13 Query Recommendations by KEX h(, car insurance)? h(, iphone)?

14 Query Recommendations by KEX plastic ponies simone plastics clothing and accessories sylvia pony clothing couture playground plastic recycling children's clothing

15 Multi-Label Learning Formulation italian restaurant f : X 2 Y need cheap auto insurance geico online quote car insurance iphone X: Ads Y: Queries

16 Learning with Millions of Labels italian restaurant f : X 2 Y need cheap auto insurance geico online quote car insurance iphone X: Ads Y: 10 Million Queries

17 Multi-Label Random Forests We develop Multi-Label Random Forests with logarithmic prediction costs that make predictions in a few milliseconds. We train on 200 M points, 100 M categories and 10 M features in 28 hours on a grid with 1000 compute nodes. We develop a tree growing criterion which learns from positive data alone. We generate training data automatically from click logs. We develop a sparse SSL formulation to infer beliefs about the state of missing and noisy labels.

18 Training Data Missing Labels No annotator can mark all the relevant labels for a data point. We have missing labels during Training Validation Testing. Even fundamental ML techniques such as validation can go awry. One can t design error metrics invariant to missing labels.

19 Training Data and Features iphone color material TF-IDF Bag of Words Features

20 Training Labels case for iphone best iphone case apple iphone 3g metallic slim fit case best iphone nn4 cases iphone cases best iphone cases apple iphone 4g cases best iphone nn4 case iphone 3gs cases iphone 4s case case iphone otterbox universal defender case iphone nn4 black silicone black plastic sena iphone cases apple iphone 4g premium soft silicone rubber black phone protector skin cover case apple iphone nn4 cases belkin grip vue tint case iphone nn4 clear black white premium bumper case apple iphone nn4 att bunny rabbit silicone case skin iphone nn4 stand tail holder iphone color material iphone case iphone 4g cases iphone case speck iphone case best case iphone 4s iphone 4gs cases iphone nn4 case switcheasy neo case iphone 3g black best case iphone nn4 iphone 4s defender series case 3g iphone cases waterproof iphone case best iphone 3g cases iphone case design TF-IDF Bag of Words Features iphone cases 4g apple iphone cases waterproof iphone cases best iphone 4s case iphone cases 3g best iphone 3g case amazonbasics protective tpu case screen protector att verizon iphone nn4 iphone 4s clear best iphone 4s cases

21 Training Labels

22 Missing and Noisy Labels best italian restaurants philadelphia italian restaurants italian restaurant italian restaurants arkansas italian restaurants connecticut italian restaurants idaho italian restaurants phoenix italian restaurant chains italian restaurant connecticut italian restaurant district columbia thai restaurant thai restaurants restaurants mexican restaurants

23 Missing and Noisy Labels

24 Frequency Biased Training Data Most labels will have very few positive training examples Zipf's Law

25 Multi-Label Prediction Costs Linear prediction costs are infeasible geico car insurance pizza iphone cases 1-vs-All Classification

26 Label and Feature Space Compression 10M Dimensional Label Space car motor vehicle auto iphone cases iphone case cases iphone 6M Dimensional Feature Space Car Ads iphone Case Ads 1K Dimensional Embedding Space

27 Hierarchical Prediction Prediction in logarithmic time

28 Gating Tree Based Prediction Prediction in logarithmic time Is the word insurance present in the ad? Yes No Is the word geico present in the ad? Yes No

29 Ensemble of Randomized Gating Trees

30 Efficient Training We seek classifiers and optimization algorithms that Are massively parallelizable Don t need to load the feature vectors (1 Tb) into RAM Don t need to load the label matrix (100 Gb) into RAM Number of training points Number of labels Dimensionality of feature vector 200 Million 100 Million 10 Million Number of cores RAM per core Training time 2 Gb 28 hours

31 Multi-Label Random Forests The splitting cost needs to be calculated in a 2 10M space Is the word insurance present?

32 Learning from Positively Labeled Data Split condition : x f > t f, t = argmin f,t n l k p l l k (1 p l l k ) + n r k p r l k (1 p r l k ) p l k = i p l k ad i p(ad i ) x f > t l1 l2 l3 0 l1 l2 l3

33 Multi-Label Random Forests x 1, y 1 = {l 2, l 3 } 1 0 l1 l2 l3 (x 1, y 1 ) (x 2, y 2 ) (x 3, y 3 ) x 2, y 2 = {l 1, l 3 } x 3, y 3 = {l 1, l 2, l 3 } l1 l2 l3 l1 l2 l3 p(y) l1 l2 l3

34 Query Recommendation Data Sets Data set statistics Data Set # of Training Points (M) # of Test Points (M) # of Dimensions (M) # of Labels (M) Wikipedia Ads Web Ads

35 Performance Evaluation We use loss functions where the penalty incurred for predicting the real (but unknown) ground truth is never more than that of predicting any other labelling L y, y Observed L y, y Observed y Y Hamming Loss Precision at k We found Precision at 10 to be robust for our application.

36 Query Recommendation Results MLRF KEX Wikipedia Ads1 Web Ads2 Percentage of top 10 predictions that were clicked queries

37 Query Recommendation Results MLRF KEX Wikipedia Ads1 Web Ads2 Percentage of top 10 predictions that were relevant

38

39 Geico Car Insurance KEX MLRF geico auto insurance geico car insurance geico insurance www geico com care geicos geico com need cheap auto insurance wisconsin cheap car insurance quotes cheap auto insurance florida all state car insurance coupon code

40

41 Domino s Pizza KEX MLRF dominos dominos pizza domino pizza domino pasta bowls domino pizza coupons domino pizza deals domino pizza locations domino pizza menu domino pizza online

42

43 Simone & Sylvia Kid s Clothing KEX plastic ponies simone plastics clothing and accessories sylvia pony clothing couture playground Plastic recycling children's clothing MLRF toddlers clothes toddlers clothing toddler costumes children clothes sale children clothes designer children clothes cute children clothes retro clothing retro baby clothes baby clothing

44

45 KCS Flowers KEX funeral flowers sympathy funeral flowers web home bleitz funeral home funeral flowers discount yarington's funeral home harvey funeral home green lake funeral home howden kennedy funeral home arranging flowers MLRF flowers delivery funeral arrangements birthday flowers funeral flowers funeral planning flowers valentines free delivery flowers cheap flowers florists cheap flowers funeral

46

47 Vistaprint Designer T-Shirts KEX embroidered apparel custom apparel readymade apparel customizable apparel customizable apparel leading print online business cards apparel and accessories own text MLRF custom t shirts funny t shirts hanes beefy t shirts hanes t shirts long sleeve t shirts personalized t shirts printed t shirts retro gamer t shirts t shirts buy custom t shirts

48

49 Metlife Auto Insurance KEX metlife auto home insurance auto home insurance auto insurance massachusetts metlife agent driver discount additional cost saving benefits car discount auto quote MLRF metlife auto insurance auto Insurance car Insurance automobile Insurance geico insurance cheap car insurance metlife auto insurance broker insurance home insurance

50

51 Wanta Thai Restaurant KEX authentic thai restaurant delicious thai food thai cuisine thai restaurant thai food wanta best thai restaurant thai eateries thai contemporary thai MLRF thai restaurant thai restaurants mexican restaurants cheap hotels hotels fast food restaurants restaurants coupons best web hosting restaurants vegetarian foods new york restaurants

52

53 best italian restaurants philadelphia italian restaurants italian restaurant italian restaurants arkansas italian restaurants connecticut italian restaurants idaho italian restaurants phoenix italian restaurant chains italian restaurant connecticut italian restaurant district columbia thai restaurant thai restaurants restaurants mexican restaurants

54

55 Compensating for Missing Labels 0.5 Case-mate phone cases 0.7 Auto insurance quotes Esurance 0.8 American family insurance 0.9 Progressive insurance Allstate auto insurance Maggiano s restaurant

56 Training on Belief Vectors 1 x 1, y 1 = l 2, l 3, f 1 0 l1 l2 l3 (x 1, f 1 ) (x 2, f 2 ) (x 3, f 3 ) x 2, y 2 = l 1, l 3, f 2 x 3, y 3 = l 1, l 2, l 3, f l1 l2 l3 l1 l2 l3 1 p(f) l1 l2 l3

57 Sparse Semi-Supervised Learning Graph-based SSL optimizes label belief smoothness and fidelity to original labels 1 F* = Min Tr F 2 Ft I D 1 2 W D 1 2 F + λ 2 s. t. F 0 K F Y 2 W MXM D MXM Y MXL F MXL λ M L K Document-document similarity matrix Diagonal matrix representing the row sums of W 0/1 label matrix Real valued label belief matrix Trade-off Hyperparameter Number of documents Number of labels Sparsity constant

58 Sparse Semi-Supervised Learning Graph-based SSL optimizes label belief smoothness and fidelity to original labels F* = Min F 1 Σ 2 i=1..lσ j=1..m l=1..m s. t. F 0 K w jl ( F ij D jj F il D ll ) 2 + λ 2 Σ i=1..m j=1..l (F ij Y ij ) 2 W MXM D MXM Y MXL F MXL λ M L K Document-document similarity matrix Diagonal matrix representing the row sums of W 0/1 label matrix Real valued label belief matrix Trade-off Hyperparameter Number of documents Number of labels Sparsity constant

59 Iterative Hard Thresholding Sparse SSL formulation F* = Min F J F = 1 2 Tr Ft I D 1 2 W D 1 2 F + λ 2 s. t. F 0 K F Y 2 The iterative hard thresholding algorithm converges to a global/local optimum F 0 F t+ 1 2 = Y = 1 λ+1 D 1 2 W D 1 2F t + F t+1 = Top K (F 1 t+ ) 2 λ λ+1 Y

60 Iterative Hard Thresholding If Y ij {0, 1} and W is positive definite then The sequence F 0, F 1, converges to a stationary point F. J(F 0 ) J(F 1 ) J(F ) If F 0 < K then F is a globally optimal solution If F 0 = K then F is a locally optimal solution J F J F + Min( λ 2 K + Y λ + 1 0, 2 ML K α K (F ) Y 0 )

61 Semi-Supervised Learning Results as judged by automatically generated click labels as well as by human experts. Data Set MLRF Click Labels (%) Human Verification (%) MLRF+ SSL KEX MLRF MLRF+ SSL Wikipedia KEX Ads Bing Ads

62 Query Expansion Results Query expansion techniques can help both KEX and MLRF Data Set Click Labels (%) Human Verification (%) MLRF+ SSL+KSP KEX+KSP MLRF+ SSL+KSP KEX+KSP Wikipedia Ads Web Ads

63 Query Recommendation Results Edit distance [Ravi et al. WSDM 2010] Data Set Click Labels (%) KEX KEX+KSP MLRF MLRF+SSL MLRF+SSL+ KSP Wikipedia Ads Web Ads

64 Conclusions Query recommendation can be posed as multi-label learning. Learning with millions of labels can be tractable and accurate. Other applications Query expansion. Document and ad relevance and ranking. Fine-grained query intent classification.

65 Deepak Bapna Prateek Jain A. Kumaran Mehul Parsana Krishna Leela Poola Adarsh Prasad Varun Singla Acknowledgements

66 Advantages of an ML Approach Can generalize to other domains such as images on Flickr or videos on YouTube.

67 System Architecture We leverage the Map/Reduce framework. Trees are grown in parallel breadth-wise. Number of compute nodes Evaluators 500 Combiners 100 Maximizers 25 Evaluator 1 Maximizer 1 Maximizer 2 Combiner 1 Evaluator 2 F*, T* Combiner 2 Evaluator 3 Combiner 3 Evaluator 4 Our objective is to balance the compute load across machines while minimizing data flow X 1,Y 1 to X N, Y N X N+1,Y N+1 to X 2N, Y 2N X 2N+1,Y 2N+1 to X 3N, Y 3N X 3N+1,Y 3N+1 to X 4N, Y 4N

68 Evaluators Input N training instances Set of keys Tree ID, Node ID, Feature ID and threshold Output Partial label distributions for the keys Evaluator 1 Maximizer 1 Maximizer 2 Combiner 1 Evaluator 2 F*, T* Combiner 2 Evaluator 3 Combiner 3 Evaluator 4 Computation N * # of keys X 1,Y 1 to X N, Y N X N+1,Y N+1 to X 2N, Y 2N X 2N+1,Y 2N+1 to X 3N, Y 3N X 3N+1,Y 3N+1 to X 4N, Y 4N

69 Combiners Input Partial label distributions for assigned keys F*, T* Maximizer 1 Maximizer 2 Output Objective function values for the keys. Combiner 1 Combiner 2 Combiner 3 Computation # of keys * Avg # of Evaluators / key * # of labels in the distribution for the key. X N+1,Y N+1 Evaluator 1 Evaluator 2 Evaluator 3 Evaluator 4 X 1,Y 1 to X N, Y N to X 2N, Y 2N X 2N+1,Y 2N+1 to X 3N, Y 3N X 3N+1,Y 3N+1 to X 4N, Y 4N

70 Maximizers Input Objective function values for assigned keys Output Optimal feature and threshold for assigned nodes in trees. Computation # of keys * Avg # of features per key * Avg # of Evaluator 1 Maximizer 1 Maximizer 2 Combiner 1 Evaluator 2 F*, T* Combiner 2 Evaluator 3 Combiner 3 Evaluator 4 thresholds per feature X 1,Y 1 X N+1,Y N+1 to X N, Y N to X 2N, Y 2N X 2N+1,Y 2N+1 to X 3N, Y 3N X 3N+1,Y 3N+1 to X 4N, Y 4N