Introduction to Machine Learning Lecture 16. Mehryar Mohri Courant Institute and Google Research

Transcription

1 Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

2 Ranking Mehryar Mohri - Introduction to Machine Learning 2

3 Motivation Very large data sets: too large to display or process. limited resources, need priorities. ranking more desirable than classification. Applications: search engines, information extraction. decision making, auctions, fraud detection. Can we learn to predict ranking accurately? 3

4 Score-Based Setting Single stage: learning algorithm receives labeled sample of pairwise preferences; returns scoring function h: U R. Drawbacks: h induces a linear ordering for full set U. does not match a query-based scenario. Advantages: efficient algorithms. good theory: VC bounds, margin bounds, stability bounds (FISS 03, RCMS 05, AN 05, AGHHR 05, CMR 07). 4

5 Preference-Based Setting Definitions: U: universe, full set of objects. V : finite query subset to rank, V U. τ : target ranking for V (random variable). Two stages: can be viewed as reduction. learn preference function h: U U [0, 1]. given V, use h to determine ranking σof V. Running-time: measured in terms of calls to h. 5

6 Related Problem Rank aggregation: given ncandidates and k voters each giving a ranking of the candidates, find ordering as close as possible to these. closeness measured in number of pairwise misrankings. k =4 problem NP-hard even for (Dwork et al., 2001). 6

7 Score-based ranking This Talk Preference-based ranking 7

8 with Score-Based Ranking Training data: sample of i.i.d. labeled pairs drawn from U U according to some distribution D, S = (x 1,x 1,y 1),...,(x m,x m,y m) U U { 1, 0, +1}, +1 if x i > pref x i y i = 0 if x i = pref x i or no information 1 if x i < pref x i. Problem: find hypothesis h:u R in H with small generalization error R D (h) = Pr (x,x ) D f(x, x ) h(x ) h(x) < 0. 8

9 Empirical error: R(h) = 1 m Notes m 1 yi (h(x i ) h(x i))<0. i=1 The relation x R x f(x, x )=1may be nontransitive (needs not even be anti-symmetric). Problem different from classification. 9

10 Distributional Assumptions Distribution over points: labels for pairs. points (literature). squared number of examples O(m 2 ). Distribution over pairs: m m pairs. label for each pair received. independence assumption. same (linear) number of examples. 10

11 Boosting for Ranking Use weak ranking algorithm and create stronger ranking algorithm. Ensemble method: combine base rankers returned by weak ranking algorithm. Finding simple relatively accurate base rankers often not hard. How should base rankers be combined? 11

12 CD RankBoost (Freund et al., 2003; Rudin et al., 2005) H {0, 1} X. 0 t + + t + t =1, s t (h) = Pr sgn f(x, x )(h(x ) h(x)) = s. (x,x ) D t RankBoost(S =((x 1,x 1,y 1 )...,(x m,x m,y m ))) 1 for i 1 to m do 2 D 1 (x i,x i ) 1 m 3 for t 1 to T do 4 h t base ranker in H with smallest t + t = E i Dt y i ht (x i ) h t(x i ) 5 α t 1 2 log + t t 6 Z t 0 t +2[+ t t ] 1 2 normalization factor 7 for i 1 to m do 8 D t+1 (x i,x i ) D t(x i,x i )exp α t y i h t (x i ) h t(x i ) Z t 9 ϕ T T t=1 α th t 10 return h =sgn(ϕ T ) 12

13 Distributions D t originally uniform. Notes over pairs of sample points: at each round, the weight of a misclassified example is increased. observation:, since D t+1 (x, x )= e y[ϕ t (x ) ϕ t (x)] S Q t s=1 Z s D t+1 (x, x )= D t(x, x )e yα t[h t (x ) h t (x)] Z t = 1 S weight assigned to base classifier h t : α t directy depends on the accuracy of at round t. h t e y P t s=1 α s[h s (x ) h s (x)] t s=1 Z. s 13

14 Coordinate Descent RankBoost Objective Function: convex and differentiable. F (α) = (x,x,y) S e y[ϕ T (x ) ϕ T (x)] = (x,x,y) S exp y T t=1 α t [h t (x ) h t (x)]. e x 0 1pairwise loss 14

15 Direction: unit vector with e t e t =argmin t df (α + ηe t ) dη. η=0 Since F (α + ηe, t )= e y P T s=1 α s[h s (x ) h s (x)] e yη[h t(x ) h t (x)] df (α + ηe t ) dη = η=0 (x,x,y) S = (x,x,y) S (x,x,y) S = [ + t t ] m y[h t (x ) h t (x)] exp y Thus, direction corresponding to base classifier selected by the algorithm. T s=1 y[h t (x ) h t (x)]d T +1 (x, x ) m T Z s. s=1 α s [h s (x ) h s (x)] T Z s s=1 15

16 Step size: obtained via df (α + ηe t ) dη (x,x,y) S (x,x,y) S (x,x,y) S =0 [ + t e η t e η ]=0 y[h t (x ) h t (x)] exp y T s=1 y[h t (x ) h t (x)]d T +1 (x, x ) m α s [h s (x ) h s (x)] e y[h t(x ) h t (x)]η =0 T Z s e y[h t(x ) h t (x)]η =0 s=1 y[h t (x ) h t (x)]d T +1 (x, x )e y[h t(x ) h t (x)]η =0 η = 1 2 log + t t. Thus, step size matches base classifier weight used in algorithm. 16

17 L1 Margin Definitions Definition: the margin of a pair is ρ(x, x )= y(ϕ(x ) ϕ(x)) m t=t α t (x, x ) with label Definition: the margin of the hypothesis h for a sample S =((x 1,x 1,y 1)...,(x m,x m,y m)) is the minimum margin for pairs in S with non-zero labels: ρ = = y T t=1 α t[h t (x ) h t (x)] α 1 min (x,x,y) S y=0 y α h(x) α 1. y =0 = y α h(x) α 1. 17

18 Ranking Margin Bound (Cortes and MM, 2011) Theorem: let H be a family of real-valued functions. Fix ρ>0, then, for any δ>0, with probability at least 1 δ over the choice of a sample of size m, the following holds for all h H: R(h) R ρ (h)+ 2 ρ R D 1 m (H)+R D 2 m (H) + log 1 δ 2m. 18

19 RankBoost Margin But, RankBoot does not maximize the margin. smooth-margin RankBoost (Rudin et al., 2005): G(α) = log F (α) α 1. Empirical performance not reported. 19

20 Ranking with SVMs Optimization problem: application of SVMs. min w,ξ Decision function: 1 2 w2 + C m i=1 subject to: y i w Φ(x i ) Φ(x i) 1 ξ i ξ i ξ i 0, i [1,m]. h: x w Φ(x)+b. see for example (Joachims, 2002) 20

21 Notes The algorithm coincides with SVMs using feature mapping (x, x ) Ψ(x, x )=Φ(x ) Φ(x). Can be used with kernels. Algorithm directly based on margin bound. 21

22 Bipartite Ranking Training data: sample of negative points drawn according to S + =(x 1,...,x m ) U. sample of positive points drawn according to S =(x 1,...,x m ) U. Problem: find hypothesis h:u R in H with small generalization error D D + R D (h) = Pr h(x )<h(x). x D,x D + 22

23 Notes More efficient algorithm in this special case (Freund et al., 2003). Connection between AdaBoost and RankBoost (Cortes & MM, 04; Rudin et al., 05). if constant base ranker used. relationship between objective functions. Bipartite ranking results typically reported in terms of AUC. 23

24 ROC Curve (Egan, 1975) Definition: the receiver operating characteristic (ROC) curve is a plot of the true positive rate (TP) vs. false positive rate (FP). TP: % positive points correctly labeled positive. FP: % negative points incorrectly labeled positive. 1 True positive rate h(x 3 ) h(x 14 ) h(x 5 ) h(x 1 ) h(x 23 ) sorted scores θ False positive rate 24

25 Area under the ROC Curve (AUC) Definition: the AUC is the area under the ROC curve. Measure of ranking quality. 1 (Hanley and McNeil, 1982) True positive rate AUC Equivalently, AUC(h) = 1 mm m i=1 =1 R(h) False positive rate m j=1 1 h(xi )>h(x j ) = Pr [h(x) >h(x )] x D b + x D b 25

26 AdaBoost and CD RankBoost Objective functions: comparison. F Ada (α) = exp ( y i f(x i )) x i S S + = exp (+f(x i )) + exp ( f(x i )) x i S x i S + = F (α)+f + (α). F Rank (α) = exp [f(xj ) f(x i )] (i,j) S S + = exp (+f(x i )) exp ( f(x i )) (i,j) S S + = F (α)f + (α). 26

27 AdaBoost and CD RankBoost Property: AdaBoost (non-separable case). constant base learner h=1 equal contribution of positive and negative points (in the limit). consequence: AdaBoost asymptotically achieves optimum of CD RankBoost objective. Observations: if F + (α)=f (α), d(f Rank )=F + d(f )+F d(f + ) = F + d(f )+d(f + ) = F + d(f Ada ). (Rudin et al., 2005) 27

28 with Bipartite RankBoost - Efficiency Decomposition of distribution: for (x, x ) (S,S + ), Thus, D(x, x )=D (x)d + (x ). D t+1 (x, x )= D t(x, x )e α t[h t (x ) h t (x)] Z t Z t, = = D t, (x)e α th t (x) Z t, x S D t, (x)e αtht(x) Z t,+ = D t,+ (x )e αtht(x ), Z t,+ x S + D t,+ (x )e αtht(x ). 28

29 Ranking Classification Bipartite case: can we learn to rank by training a classifier on positive and negative sets? different objective functions: AUC vs. 0/1 loss. preliminary analysis (Cortes and MM, 2004): different results for imbalanced data sets, on average over all classifications. example, stochastic case: A BC + B AC - C AB 29

30 Score-based ranking This Talk Preference-based ranking 30

31 Preference-Based Setting Definitions: U: universe, full set of objects. V : finite query subset to rank, V U. τ : target ranking for V (random variable). Two stages: can be viewed as reduction. learn preference function h: U U [0, 1]. given V, use h to determine ranking σof V. Running-time: measured in terms of calls to h. 31

32 Preference-Based Ranking Problem Training data: pairs to D: subsets ranked by different labelers. (V,τ ) sampled i.i.d. according (V 1, τ 1 ), (V 2, τ 2 ),...,(V m, τ m) V i U. preference function h: U U [0, 1]. Problem: for any query set V U, use h to return ranking σ close to target τ h,v with small average error R(h, σ) = learn classifier E [L(σ (V,τ h,v,τ )]. ) D 32

33 Preference Function h(u, v) close to when u preferred to, close to otherwise. For the analysis, h(u, v) {0, 1}. Assumed pairwise consistent: May be non-transitive, e.g., 1 v 0 h(u, v)+h(v,u) =1. h(u, v) =h(v,w) =h(w,v) =1. Output of classifier or black-box. 33

34 Loss Functions (for fixed (V,τ )) Preference loss: L(h, τ )= Ranking loss: L(σ, τ )= 2 n(n 1) 2 n(n 1) h(u, v)τ (v, u). u=v σ(u, v)τ (v, u). u=v 34

35 Preference regret: (Weak) Regret R class (h) = E V,τ [L(h V,τ )] E min V h E [L( h, τ )]. τ V Ranking regret: R rank(a) = E V,τ,s [L(A s(v ), τ )] E V min σ S(V ) E [L( σ, τ )]. τ V 35

36 Deterministic Algorithm Stage one: standard classification. Learn preference function h: U U [0, 1]. Stage two: sort-by-degree using comparison function h. sort by number of points ranked below. quadratic time complexity O(n 2 ). (Balcan et al., 07) 36

37 Randomized Algorithm (Ailon & MM, 08) Stage one: standard classification. Learn preference function h: U U [0, 1]. Stage two: randomized QuickSort (Hoare, 61) using as comparison function. comparison function non-transitive unlike textbook description. but, time complexity shown to be O(n log n) in general. h 37

38 Randomized QS v h(v, u)=1 h(u, v)=1 left recursion u random pivot right recursion 38

39 Deterministic Algo. - Bipartite Case (V = V + V ) (Balcan et al., 07) Bounds: for deterministic sort-by-degree algorithm expected loss: regret: E V,τ [L(A(V ), τ )] 2 E [L(h, τ )]. V,τ R rank(a(v )) 2 R class(h). Time complexity: Ω( V 2 ). 39

40 Randomized Algo. - Bipartite Case (V = V + V ) (Ailon & MM, 08) Bounds: for randomized QuickSort (Hoare, 61). expected loss (equality): regret: E V,τ,s [L(Qh s (V ), τ )] = E [L(h, τ )]. V,τ Time complexity: full set: O(n log n). top k: R rank(q h s ( )) R class(h). O(n + k log k). 40

41 Proof Ideas p uv QuickSort decomposition: w {u,v} Bipartite property: p uvw h(u, w)h(w, v)+h(v, w)h(w, u) u τ (u, v)+τ (v, w)+τ (w, u) = =1. τ (v, u)+τ (w, v)+τ (u, w). v w 41

42 Lower Bound Theorem: for any deterministic algorithm A, there is a bipartite distribution for which thus, factor of 2 = Proof: take simple case that h induces a cycle. up to symmetry, A returns best in deterministic case. randomization necessary for better bound. u, v, w w, v, u. or R rank (A) 2 R class (h). U =V ={u, v, w} v and assume u h w 42

43 Lower Bound If A returns, then choose τ as: u, v, w u u, v w - + If A returns w, v, u, then choose τ as: v h L[h, τ ]= 1 3 ; L[A, τ ]= 2 3. w w, v u

44 Guarantees - General Case Loss bound for QuickSort: E V,τ,s [L(Qh s (V ), τ )] 2 E [L(h, τ )]. V,τ Comparison with optimal ranking (see (CSS 99)): E s [L(Q h s (V ), σ optimal )] 2 L(h, σ optimal ) E s [L(h, Q h s (V ))] 3 L(h, σ optimal ), where σ optimal = argmin σ L(h, σ). 44

45 Generalization: Weight Function τ (u, v) =σ (u, v) ω(σ (u), σ (v)). Properties: needed for all previous results to hold, symmetry: ω(i, j) =ω(j, i) for all i, j. monotonicity: ω(i, j), ω(j, k) ω(i, k) for i<j<k. triangle inequality: ω(i, j) ω(i, k)+ω(k, j) for all triplets i, j, k. 45

46 Weight Function - Examples Kemeny: Top-k: Bipartite: w(i, j) =1, i, j. 1 if i k or j k; w(i, j) = 0 otherwise. 1 if i k and j>k; w(i, j) = 0 otherwise. k-partite: can be defined similarly. 46

47 (Strong) Regret Definitions Ranking regret: R rank (A) = E [L(A s(v ), τ )] min V,τ,s σ E [L( σ V, τ )]. V,τ Preference regret: R class (h) = E V,τ [L(h V, τ )] min h E [L( h V, τ )]. V,τ All previous regret results hold if for V 1,V 2 {u, v}, E [τ (u, v)] = τ V 1 E τ [τ (u, v)] V 2 for all u, v (pairwise independence on irrelevant alternatives). 47

48 References Agarwal, S., Graepel, T., Herbrich, R., Har-Peled, S., & Roth, D. (2005). Generalization bounds for the area under the roc curve. JMLR 6, Agarwal, S., and Niyogi, P. (2005). Stability and generalization of bipartite ranking algorithms. COLT (pp ). Nir Ailon and Mehryar Mohri. An efficient reduction of ranking to classification. In Proceedings of COLT Helsinki, Finland, July Omnipress. Balcan, M.-F., Bansal, N., Beygelzimer, A., Coppersmith, D., Langford, J., and Sorkin, G. B. (2007). Robust reductions from ranking to classification. In Proceedings of COLT (pp ). Springer. Cohen, W. W., Schapire, R. E., and Singer, Y. (1999). Learning to order things. J. Artif. Intell. Res. (JAIR), 10, Cossock, D., and Zhang, T. (2006). Subset ranking using regression. COLT (pp ). Mehryar Mohri - Courant Seminar 48 Courant Institute, NYU

49 References Corinna Cortes and Mehryar Mohri. AUC Optimization vs. Error Rate Minimization. In Advances in Neural Information Processing Systems (NIPS 2003), MIT Press. Cortes, C., Mohri, M., and Rastogi, A. (2007a). An Alternative Ranking Problem for Search Engines. Proceedings of WEA 2007 (pp. 1 21). Rome, Italy: Springer. Corinna Cortes and Mehryar Mohri. Confidence Intervals for the Area under the ROC Curve. In Advances in Neural Information Processing Systems (NIPS 2004), MIT Press. Crammer, K., and Singer, Y. (2001). Pranking with ranking. Proceedings of NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada] (pp ). MIT Press. Cynthia Dwork, Ravi Kumar, Moni Naor, and D. Sivakumar. Rank Aggregation Methods for the Web. WWW 10, ACM Press. J. P. Egan. Signal Detection Theory and ROC Analysis. Academic Press, Yoav Freund, Raj Iyer, Robert E. Schapire and Yoram Singer. An efficient boosting algorithm for combining preferences. JMLR 4: , Mehryar Mohri - Courant Seminar 49 Courant Institute, NYU

50 References J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers, s , Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the Web, Stanford Digital Library Technologies Project, Lehmann, E. L. (1975). Nonparametrics: Statistical methods based on ranks. San Francisco, California: Holden-Day. Cynthia Rudin, Corinna Cortes, Mehryar Mohri, and Robert E. Schapire. Margin-Based Ranking Meets Boosting in the Middle. In Proceedings of The 18th Annual Conference on Computational Learning Theory (COLT 2005), s 63-78, Thorsten Joachims. Optimizing search engines using clickthrough data. Proceedings of the 8th ACM SIGKDD, s , Mehryar Mohri - Courant Seminar 50 Courant Institute, NYU