Introduction to Machine Learning Lecture 16. Mehryar Mohri Courant Institute and Google Research
|
|
- Daniela Page
- 7 years ago
- Views:
Transcription
1 Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research
2 Ranking Mehryar Mohri - Introduction to Machine Learning 2
3 Motivation Very large data sets: too large to display or process. limited resources, need priorities. ranking more desirable than classification. Applications: search engines, information extraction. decision making, auctions, fraud detection. Can we learn to predict ranking accurately? 3
4 Score-Based Setting Single stage: learning algorithm receives labeled sample of pairwise preferences; returns scoring function h: U R. Drawbacks: h induces a linear ordering for full set U. does not match a query-based scenario. Advantages: efficient algorithms. good theory: VC bounds, margin bounds, stability bounds (FISS 03, RCMS 05, AN 05, AGHHR 05, CMR 07). 4
5 Preference-Based Setting Definitions: U: universe, full set of objects. V : finite query subset to rank, V U. τ : target ranking for V (random variable). Two stages: can be viewed as reduction. learn preference function h: U U [0, 1]. given V, use h to determine ranking σof V. Running-time: measured in terms of calls to h. 5
6 Related Problem Rank aggregation: given ncandidates and k voters each giving a ranking of the candidates, find ordering as close as possible to these. closeness measured in number of pairwise misrankings. k =4 problem NP-hard even for (Dwork et al., 2001). 6
7 Score-based ranking This Talk Preference-based ranking 7
8 with Score-Based Ranking Training data: sample of i.i.d. labeled pairs drawn from U U according to some distribution D, S = (x 1,x 1,y 1),...,(x m,x m,y m) U U { 1, 0, +1}, +1 if x i > pref x i y i = 0 if x i = pref x i or no information 1 if x i < pref x i. Problem: find hypothesis h:u R in H with small generalization error R D (h) = Pr (x,x ) D f(x, x ) h(x ) h(x) < 0. 8
9 Empirical error: R(h) = 1 m Notes m 1 yi (h(x i ) h(x i))<0. i=1 The relation x R x f(x, x )=1may be nontransitive (needs not even be anti-symmetric). Problem different from classification. 9
10 Distributional Assumptions Distribution over points: labels for pairs. points (literature). squared number of examples O(m 2 ). Distribution over pairs: m m pairs. label for each pair received. independence assumption. same (linear) number of examples. 10
11 Boosting for Ranking Use weak ranking algorithm and create stronger ranking algorithm. Ensemble method: combine base rankers returned by weak ranking algorithm. Finding simple relatively accurate base rankers often not hard. How should base rankers be combined? 11
12 CD RankBoost (Freund et al., 2003; Rudin et al., 2005) H {0, 1} X. 0 t + + t + t =1, s t (h) = Pr sgn f(x, x )(h(x ) h(x)) = s. (x,x ) D t RankBoost(S =((x 1,x 1,y 1 )...,(x m,x m,y m ))) 1 for i 1 to m do 2 D 1 (x i,x i ) 1 m 3 for t 1 to T do 4 h t base ranker in H with smallest t + t = E i Dt y i ht (x i ) h t(x i ) 5 α t 1 2 log + t t 6 Z t 0 t +2[+ t t ] 1 2 normalization factor 7 for i 1 to m do 8 D t+1 (x i,x i ) D t(x i,x i )exp α t y i h t (x i ) h t(x i ) Z t 9 ϕ T T t=1 α th t 10 return h =sgn(ϕ T ) 12
13 Distributions D t originally uniform. Notes over pairs of sample points: at each round, the weight of a misclassified example is increased. observation:, since D t+1 (x, x )= e y[ϕ t (x ) ϕ t (x)] S Q t s=1 Z s D t+1 (x, x )= D t(x, x )e yα t[h t (x ) h t (x)] Z t = 1 S weight assigned to base classifier h t : α t directy depends on the accuracy of at round t. h t e y P t s=1 α s[h s (x ) h s (x)] t s=1 Z. s 13
14 Coordinate Descent RankBoost Objective Function: convex and differentiable. F (α) = (x,x,y) S e y[ϕ T (x ) ϕ T (x)] = (x,x,y) S exp y T t=1 α t [h t (x ) h t (x)]. e x 0 1pairwise loss 14
15 Direction: unit vector with e t e t =argmin t df (α + ηe t ) dη. η=0 Since F (α + ηe, t )= e y P T s=1 α s[h s (x ) h s (x)] e yη[h t(x ) h t (x)] df (α + ηe t ) dη = η=0 (x,x,y) S = (x,x,y) S (x,x,y) S = [ + t t ] m y[h t (x ) h t (x)] exp y Thus, direction corresponding to base classifier selected by the algorithm. T s=1 y[h t (x ) h t (x)]d T +1 (x, x ) m T Z s. s=1 α s [h s (x ) h s (x)] T Z s s=1 15
16 Step size: obtained via df (α + ηe t ) dη (x,x,y) S (x,x,y) S (x,x,y) S =0 [ + t e η t e η ]=0 y[h t (x ) h t (x)] exp y T s=1 y[h t (x ) h t (x)]d T +1 (x, x ) m α s [h s (x ) h s (x)] e y[h t(x ) h t (x)]η =0 T Z s e y[h t(x ) h t (x)]η =0 s=1 y[h t (x ) h t (x)]d T +1 (x, x )e y[h t(x ) h t (x)]η =0 η = 1 2 log + t t. Thus, step size matches base classifier weight used in algorithm. 16
17 L1 Margin Definitions Definition: the margin of a pair is ρ(x, x )= y(ϕ(x ) ϕ(x)) m t=t α t (x, x ) with label Definition: the margin of the hypothesis h for a sample S =((x 1,x 1,y 1)...,(x m,x m,y m)) is the minimum margin for pairs in S with non-zero labels: ρ = = y T t=1 α t[h t (x ) h t (x)] α 1 min (x,x,y) S y=0 y α h(x) α 1. y =0 = y α h(x) α 1. 17
18 Ranking Margin Bound (Cortes and MM, 2011) Theorem: let H be a family of real-valued functions. Fix ρ>0, then, for any δ>0, with probability at least 1 δ over the choice of a sample of size m, the following holds for all h H: R(h) R ρ (h)+ 2 ρ R D 1 m (H)+R D 2 m (H) + log 1 δ 2m. 18
19 RankBoost Margin But, RankBoot does not maximize the margin. smooth-margin RankBoost (Rudin et al., 2005): G(α) = log F (α) α 1. Empirical performance not reported. 19
20 Ranking with SVMs Optimization problem: application of SVMs. min w,ξ Decision function: 1 2 w2 + C m i=1 subject to: y i w Φ(x i ) Φ(x i) 1 ξ i ξ i ξ i 0, i [1,m]. h: x w Φ(x)+b. see for example (Joachims, 2002) 20
21 Notes The algorithm coincides with SVMs using feature mapping (x, x ) Ψ(x, x )=Φ(x ) Φ(x). Can be used with kernels. Algorithm directly based on margin bound. 21
22 Bipartite Ranking Training data: sample of negative points drawn according to S + =(x 1,...,x m ) U. sample of positive points drawn according to S =(x 1,...,x m ) U. Problem: find hypothesis h:u R in H with small generalization error D D + R D (h) = Pr h(x )<h(x). x D,x D + 22
23 Notes More efficient algorithm in this special case (Freund et al., 2003). Connection between AdaBoost and RankBoost (Cortes & MM, 04; Rudin et al., 05). if constant base ranker used. relationship between objective functions. Bipartite ranking results typically reported in terms of AUC. 23
24 ROC Curve (Egan, 1975) Definition: the receiver operating characteristic (ROC) curve is a plot of the true positive rate (TP) vs. false positive rate (FP). TP: % positive points correctly labeled positive. FP: % negative points incorrectly labeled positive. 1 True positive rate h(x 3 ) h(x 14 ) h(x 5 ) h(x 1 ) h(x 23 ) sorted scores θ False positive rate 24
25 Area under the ROC Curve (AUC) Definition: the AUC is the area under the ROC curve. Measure of ranking quality. 1 (Hanley and McNeil, 1982) True positive rate AUC Equivalently, AUC(h) = 1 mm m i=1 =1 R(h) False positive rate m j=1 1 h(xi )>h(x j ) = Pr [h(x) >h(x )] x D b + x D b 25
26 AdaBoost and CD RankBoost Objective functions: comparison. F Ada (α) = exp ( y i f(x i )) x i S S + = exp (+f(x i )) + exp ( f(x i )) x i S x i S + = F (α)+f + (α). F Rank (α) = exp [f(xj ) f(x i )] (i,j) S S + = exp (+f(x i )) exp ( f(x i )) (i,j) S S + = F (α)f + (α). 26
27 AdaBoost and CD RankBoost Property: AdaBoost (non-separable case). constant base learner h=1 equal contribution of positive and negative points (in the limit). consequence: AdaBoost asymptotically achieves optimum of CD RankBoost objective. Observations: if F + (α)=f (α), d(f Rank )=F + d(f )+F d(f + ) = F + d(f )+d(f + ) = F + d(f Ada ). (Rudin et al., 2005) 27
28 with Bipartite RankBoost - Efficiency Decomposition of distribution: for (x, x ) (S,S + ), Thus, D(x, x )=D (x)d + (x ). D t+1 (x, x )= D t(x, x )e α t[h t (x ) h t (x)] Z t Z t, = = D t, (x)e α th t (x) Z t, x S D t, (x)e αtht(x) Z t,+ = D t,+ (x )e αtht(x ), Z t,+ x S + D t,+ (x )e αtht(x ). 28
29 Ranking Classification Bipartite case: can we learn to rank by training a classifier on positive and negative sets? different objective functions: AUC vs. 0/1 loss. preliminary analysis (Cortes and MM, 2004): different results for imbalanced data sets, on average over all classifications. example, stochastic case: A BC + B AC - C AB 29
30 Score-based ranking This Talk Preference-based ranking 30
31 Preference-Based Setting Definitions: U: universe, full set of objects. V : finite query subset to rank, V U. τ : target ranking for V (random variable). Two stages: can be viewed as reduction. learn preference function h: U U [0, 1]. given V, use h to determine ranking σof V. Running-time: measured in terms of calls to h. 31
32 Preference-Based Ranking Problem Training data: pairs to D: subsets ranked by different labelers. (V,τ ) sampled i.i.d. according (V 1, τ 1 ), (V 2, τ 2 ),...,(V m, τ m) V i U. preference function h: U U [0, 1]. Problem: for any query set V U, use h to return ranking σ close to target τ h,v with small average error R(h, σ) = learn classifier E [L(σ (V,τ h,v,τ )]. ) D 32
33 Preference Function h(u, v) close to when u preferred to, close to otherwise. For the analysis, h(u, v) {0, 1}. Assumed pairwise consistent: May be non-transitive, e.g., 1 v 0 h(u, v)+h(v,u) =1. h(u, v) =h(v,w) =h(w,v) =1. Output of classifier or black-box. 33
34 Loss Functions (for fixed (V,τ )) Preference loss: L(h, τ )= Ranking loss: L(σ, τ )= 2 n(n 1) 2 n(n 1) h(u, v)τ (v, u). u=v σ(u, v)τ (v, u). u=v 34
35 Preference regret: (Weak) Regret R class (h) = E V,τ [L(h V,τ )] E min V h E [L( h, τ )]. τ V Ranking regret: R rank(a) = E V,τ,s [L(A s(v ), τ )] E V min σ S(V ) E [L( σ, τ )]. τ V 35
36 Deterministic Algorithm Stage one: standard classification. Learn preference function h: U U [0, 1]. Stage two: sort-by-degree using comparison function h. sort by number of points ranked below. quadratic time complexity O(n 2 ). (Balcan et al., 07) 36
37 Randomized Algorithm (Ailon & MM, 08) Stage one: standard classification. Learn preference function h: U U [0, 1]. Stage two: randomized QuickSort (Hoare, 61) using as comparison function. comparison function non-transitive unlike textbook description. but, time complexity shown to be O(n log n) in general. h 37
38 Randomized QS v h(v, u)=1 h(u, v)=1 left recursion u random pivot right recursion 38
39 Deterministic Algo. - Bipartite Case (V = V + V ) (Balcan et al., 07) Bounds: for deterministic sort-by-degree algorithm expected loss: regret: E V,τ [L(A(V ), τ )] 2 E [L(h, τ )]. V,τ R rank(a(v )) 2 R class(h). Time complexity: Ω( V 2 ). 39
40 Randomized Algo. - Bipartite Case (V = V + V ) (Ailon & MM, 08) Bounds: for randomized QuickSort (Hoare, 61). expected loss (equality): regret: E V,τ,s [L(Qh s (V ), τ )] = E [L(h, τ )]. V,τ Time complexity: full set: O(n log n). top k: R rank(q h s ( )) R class(h). O(n + k log k). 40
41 Proof Ideas p uv QuickSort decomposition: w {u,v} Bipartite property: p uvw h(u, w)h(w, v)+h(v, w)h(w, u) u τ (u, v)+τ (v, w)+τ (w, u) = =1. τ (v, u)+τ (w, v)+τ (u, w). v w 41
42 Lower Bound Theorem: for any deterministic algorithm A, there is a bipartite distribution for which thus, factor of 2 = Proof: take simple case that h induces a cycle. up to symmetry, A returns best in deterministic case. randomization necessary for better bound. u, v, w w, v, u. or R rank (A) 2 R class (h). U =V ={u, v, w} v and assume u h w 42
43 Lower Bound If A returns, then choose τ as: u, v, w u u, v w - + If A returns w, v, u, then choose τ as: v h L[h, τ ]= 1 3 ; L[A, τ ]= 2 3. w w, v u
44 Guarantees - General Case Loss bound for QuickSort: E V,τ,s [L(Qh s (V ), τ )] 2 E [L(h, τ )]. V,τ Comparison with optimal ranking (see (CSS 99)): E s [L(Q h s (V ), σ optimal )] 2 L(h, σ optimal ) E s [L(h, Q h s (V ))] 3 L(h, σ optimal ), where σ optimal = argmin σ L(h, σ). 44
45 Generalization: Weight Function τ (u, v) =σ (u, v) ω(σ (u), σ (v)). Properties: needed for all previous results to hold, symmetry: ω(i, j) =ω(j, i) for all i, j. monotonicity: ω(i, j), ω(j, k) ω(i, k) for i<j<k. triangle inequality: ω(i, j) ω(i, k)+ω(k, j) for all triplets i, j, k. 45
46 Weight Function - Examples Kemeny: Top-k: Bipartite: w(i, j) =1, i, j. 1 if i k or j k; w(i, j) = 0 otherwise. 1 if i k and j>k; w(i, j) = 0 otherwise. k-partite: can be defined similarly. 46
47 (Strong) Regret Definitions Ranking regret: R rank (A) = E [L(A s(v ), τ )] min V,τ,s σ E [L( σ V, τ )]. V,τ Preference regret: R class (h) = E V,τ [L(h V, τ )] min h E [L( h V, τ )]. V,τ All previous regret results hold if for V 1,V 2 {u, v}, E [τ (u, v)] = τ V 1 E τ [τ (u, v)] V 2 for all u, v (pairwise independence on irrelevant alternatives). 47
48 References Agarwal, S., Graepel, T., Herbrich, R., Har-Peled, S., & Roth, D. (2005). Generalization bounds for the area under the roc curve. JMLR 6, Agarwal, S., and Niyogi, P. (2005). Stability and generalization of bipartite ranking algorithms. COLT (pp ). Nir Ailon and Mehryar Mohri. An efficient reduction of ranking to classification. In Proceedings of COLT Helsinki, Finland, July Omnipress. Balcan, M.-F., Bansal, N., Beygelzimer, A., Coppersmith, D., Langford, J., and Sorkin, G. B. (2007). Robust reductions from ranking to classification. In Proceedings of COLT (pp ). Springer. Cohen, W. W., Schapire, R. E., and Singer, Y. (1999). Learning to order things. J. Artif. Intell. Res. (JAIR), 10, Cossock, D., and Zhang, T. (2006). Subset ranking using regression. COLT (pp ). Mehryar Mohri - Courant Seminar 48 Courant Institute, NYU
49 References Corinna Cortes and Mehryar Mohri. AUC Optimization vs. Error Rate Minimization. In Advances in Neural Information Processing Systems (NIPS 2003), MIT Press. Cortes, C., Mohri, M., and Rastogi, A. (2007a). An Alternative Ranking Problem for Search Engines. Proceedings of WEA 2007 (pp. 1 21). Rome, Italy: Springer. Corinna Cortes and Mehryar Mohri. Confidence Intervals for the Area under the ROC Curve. In Advances in Neural Information Processing Systems (NIPS 2004), MIT Press. Crammer, K., and Singer, Y. (2001). Pranking with ranking. Proceedings of NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada] (pp ). MIT Press. Cynthia Dwork, Ravi Kumar, Moni Naor, and D. Sivakumar. Rank Aggregation Methods for the Web. WWW 10, ACM Press. J. P. Egan. Signal Detection Theory and ROC Analysis. Academic Press, Yoav Freund, Raj Iyer, Robert E. Schapire and Yoram Singer. An efficient boosting algorithm for combining preferences. JMLR 4: , Mehryar Mohri - Courant Seminar 49 Courant Institute, NYU
50 References J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers, s , Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the Web, Stanford Digital Library Technologies Project, Lehmann, E. L. (1975). Nonparametrics: Statistical methods based on ranks. San Francisco, California: Holden-Day. Cynthia Rudin, Corinna Cortes, Mehryar Mohri, and Robert E. Schapire. Margin-Based Ranking Meets Boosting in the Middle. In Proceedings of The 18th Annual Conference on Computational Learning Theory (COLT 2005), s 63-78, Thorsten Joachims. Optimizing search engines using clickthrough data. Proceedings of the 8th ACM SIGKDD, s , Mehryar Mohri - Courant Seminar 50 Courant Institute, NYU
An Alternative Ranking Problem for Search Engines
An Alternative Ranking Problem for Search Engines Corinna Cortes 1, Mehryar Mohri 2,1, and Ashish Rastogi 2 1 Google Research, 76 Ninth Avenue, New York, NY 10011 2 Courant Institute of Mathematical Sciences,
More informationFoundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu
Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.
More informationA General Boosting Method and its Application to Learning Ranking Functions for Web Search
A General Boosting Method and its Application to Learning Ranking Functions for Web Search Zhaohui Zheng Hongyuan Zha Tong Zhang Olivier Chapelle Keke Chen Gordon Sun Yahoo! Inc. 701 First Avene Sunnyvale,
More informationActive Learning with Boosting for Spam Detection
Active Learning with Boosting for Spam Detection Nikhila Arkalgud Last update: March 22, 2008 Active Learning with Boosting for Spam Detection Last update: March 22, 2008 1 / 38 Outline 1 Spam Filters
More informationFilterBoost: Regression and Classification on Large Datasets
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley Machine Learning Department Carnegie Mellon University Pittsburgh, PA 523 jkbradle@cs.cmu.edu Robert E. Schapire Department
More informationInteractive Machine Learning. Maria-Florina Balcan
Interactive Machine Learning Maria-Florina Balcan Machine Learning Image Classification Document Categorization Speech Recognition Protein Classification Branch Prediction Fraud Detection Spam Detection
More informationSimple and efficient online algorithms for real world applications
Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,
More informationAdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz
AdaBoost Jiri Matas and Jan Šochman Centre for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Presentation Outline: AdaBoost algorithm Why is of interest? How it works? Why
More informationBALANCE LEARNING TO RANK IN BIG DATA. Guanqun Cao, Iftikhar Ahmad, Honglei Zhang, Weiyi Xie, Moncef Gabbouj. Tampere University of Technology, Finland
BALANCE LEARNING TO RANK IN BIG DATA Guanqun Cao, Iftikhar Ahmad, Honglei Zhang, Weiyi Xie, Moncef Gabbouj Tampere University of Technology, Finland {name.surname}@tut.fi ABSTRACT We propose a distributed
More informationSemi-Supervised Support Vector Machines and Application to Spam Filtering
Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery
More informationAnalysis of Representations for Domain Adaptation
Analysis of Representations for Domain Adaptation Shai Ben-David School of Computer Science University of Waterloo shai@cs.uwaterloo.ca John Blitzer, Koby Crammer, and Fernando Pereira Department of Computer
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationModel Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
More informationREVIEW OF ENSEMBLE CLASSIFICATION
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationEnsemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationA Support Vector Method for Multivariate Performance Measures
Thorsten Joachims Cornell University, Dept. of Computer Science, 4153 Upson Hall, Ithaca, NY 14853 USA tj@cs.cornell.edu Abstract This paper presents a Support Vector Method for optimizing multivariate
More information1 Introduction. 2 Prediction with Expert Advice. Online Learning 9.520 Lecture 09
1 Introduction Most of the course is concerned with the batch learning problem. In this lecture, however, we look at a different model, called online. Let us first compare and contrast the two. In batch
More informationDUOL: A Double Updating Approach for Online Learning
: A Double Updating Approach for Online Learning Peilin Zhao School of Comp. Eng. Nanyang Tech. University Singapore 69798 zhao6@ntu.edu.sg Steven C.H. Hoi School of Comp. Eng. Nanyang Tech. University
More informationA ranking SVM based fusion model for cross-media meta-search engine *
Cao et al. / J Zhejiang Univ-Sci C (Comput & Electron) 200 ():903-90 903 Journal of Zhejiang University-SCIENCE C (Computers & Electronics) ISSN 869-95 (Print); ISSN 869-96X (Online) www.zju.edu.cn/jzus;
More informationAdaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
More informationL25: Ensemble learning
L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna
More informationServer Load Prediction
Server Load Prediction Suthee Chaidaroon (unsuthee@stanford.edu) Joon Yeong Kim (kim64@stanford.edu) Jonghan Seo (jonghan@stanford.edu) Abstract Estimating server load average is one of the methods that
More informationBoosting. riedmiller@informatik.uni-freiburg.de
. Machine Learning Boosting Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de
More informationLoss Functions for Preference Levels: Regression with Discrete Ordered Labels
Loss Functions for Preference Levels: Regression with Discrete Ordered Labels Jason D. M. Rennie Massachusetts Institute of Technology Comp. Sci. and Artificial Intelligence Laboratory Cambridge, MA 9,
More informationComparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
More informationTrading regret rate for computational efficiency in online learning with limited feedback
Trading regret rate for computational efficiency in online learning with limited feedback Shai Shalev-Shwartz TTI-C Hebrew University On-line Learning with Limited Feedback Workshop, 2009 June 2009 Shai
More informationIntroduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu
Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction Logistics Prerequisites: basics concepts needed in probability and statistics
More informationActive Learning of Label Ranking Functions
Active Learning of Label Ranking Functions Klaus Brinker kbrinker@uni-paderborn.de International Graduate School of Dynamic Intelligent Systems, University of Paderborn, 33098 Paderborn, Germany Abstract
More informationA Structural SVM Based Approach for Optimizing Partial AUC
Harikrishna Narasimhan harikrishna@csa.iisc.ernet.in Shivani Agarwal shivani@csa.iisc.ernet.in Department of Computer Science and Automation, Indian Institute of Science, Bangalore, India Abstract The
More informationIntroduction to Online Learning Theory
Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationData Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationCase-based Multilabel Ranking
Case-based Multilabel Ranking Klaus Brinker and Eyke Hüllermeier Data and Knowledge Engineering, Otto-von-Guericke-Universität Magdeburg, Germany {brinker,huellerm}@iti.cs.uni-magdeburg.de Abstract We
More informationPractical Online Active Learning for Classification
Practical Online Active Learning for Classification Claire Monteleoni Department of Computer Science and Engineering University of California, San Diego cmontel@cs.ucsd.edu Matti Kääriäinen Department
More informationAdapting Codes and Embeddings for Polychotomies
Adapting Codes and Embeddings for Polychotomies Gunnar Rätsch, Alexander J. Smola RSISE, CSL, Machine Learning Group The Australian National University Canberra, 2 ACT, Australia Gunnar.Raetsch, Alex.Smola
More informationDistributed Machine Learning and Big Data
Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya
More informationOnline Algorithms: Learning & Optimization with No Regret.
Online Algorithms: Learning & Optimization with No Regret. Daniel Golovin 1 The Setup Optimization: Model the problem (objective, constraints) Pick best decision from a feasible set. Learning: Model the
More informationLarge Scale Learning to Rank
Large Scale Learning to Rank D. Sculley Google, Inc. dsculley@google.com Abstract Pairwise learning to rank methods such as RankSVM give good performance, but suffer from the computational burden of optimizing
More information! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.
Approximation Algorithms 11 Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of three
More informationCase Study Report: Building and analyzing SVM ensembles with Bagging and AdaBoost on big data sets
Case Study Report: Building and analyzing SVM ensembles with Bagging and AdaBoost on big data sets Ricardo Ramos Guerra Jörg Stork Master in Automation and IT Faculty of Computer Science and Engineering
More informationSupport Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano
More informationMACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
More informationClass #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
More informationAcknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues
Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the
More informationGovernment of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence
Government of Russian Federation Federal State Autonomous Educational Institution of High Professional Education National Research University «Higher School of Economics» Faculty of Computer Science School
More informationCS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott
More informationList of Publications by Claudio Gentile
List of Publications by Claudio Gentile Claudio Gentile DiSTA, University of Insubria, Italy claudio.gentile@uninsubria.it November 6, 2013 Abstract Contains the list of publications by Claudio Gentile,
More informationThe K-armed Dueling Bandits Problem
The K-armed Dueling Bandits Problem Yisong Yue Dept of Computer Science Cornell University Ithaca, NY 4853 yyue@cscornelledu Josef Broder Center for Applied Math Cornell University Ithaca, NY 4853 jbroder@camcornelledu
More informationA Neural Support Vector Network Architecture with Adaptive Kernels. 1 Introduction. 2 Support Vector Machines and Motivations
A Neural Support Vector Network Architecture with Adaptive Kernels Pascal Vincent & Yoshua Bengio Département d informatique et recherche opérationnelle Université de Montréal C.P. 6128 Succ. Centre-Ville,
More informationA Potential-based Framework for Online Multi-class Learning with Partial Feedback
A Potential-based Framework for Online Multi-class Learning with Partial Feedback Shijun Wang Rong Jin Hamed Valizadegan Radiology and Imaging Sciences Computer Science and Engineering Computer Science
More informationPredict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons
More informationClassification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
More informationRevenue Optimization against Strategic Buyers
Revenue Optimization against Strategic Buyers Mehryar Mohri Courant Institute of Mathematical Sciences 251 Mercer Street New York, NY, 10012 Andrés Muñoz Medina Google Research 111 8th Avenue New York,
More informationHow To Train A Classifier With Active Learning In Spam Filtering
Online Active Learning Methods for Fast Label-Efficient Spam Filtering D. Sculley Department of Computer Science Tufts University, Medford, MA USA dsculley@cs.tufts.edu ABSTRACT Active learning methods
More informationSupervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
More informationLecture 8 February 4
ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt
More informationA Learning Algorithm For Neural Network Ensembles
A Learning Algorithm For Neural Network Ensembles H. D. Navone, P. M. Granitto, P. F. Verdes and H. A. Ceccatto Instituto de Física Rosario (CONICET-UNR) Blvd. 27 de Febrero 210 Bis, 2000 Rosario. República
More informationThe Relationship Between Precision-Recall and ROC Curves
Jesse Davis jdavis@cs.wisc.edu Mark Goadrich richm@cs.wisc.edu Department of Computer Sciences and Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, 2 West Dayton Street,
More informationHow Boosting the Margin Can Also Boost Classifier Complexity
Lev Reyzin lev.reyzin@yale.edu Yale University, Department of Computer Science, 51 Prospect Street, New Haven, CT 652, USA Robert E. Schapire schapire@cs.princeton.edu Princeton University, Department
More informationMining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods
Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing
More informationWE DEFINE spam as an e-mail message that is unwanted basically
1048 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999 Support Vector Machines for Spam Categorization Harris Drucker, Senior Member, IEEE, Donghui Wu, Student Member, IEEE, and Vladimir
More informationInternational Journal of Advanced Research in Computer Science and Software Engineering
Volume 3, Issue 8, August 213 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Global Ranking
More informationDecompose Error Rate into components, some of which can be measured on unlabeled data
Bias-Variance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Decomposition for Regression Bias-Variance Decomposition for Classification Bias-Variance
More informationGetting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise
More informationEnsemble Approach for the Classification of Imbalanced Data
Ensemble Approach for the Classification of Imbalanced Data Vladimir Nikulin 1, Geoffrey J. McLachlan 1, and Shu Kay Ng 2 1 Department of Mathematics, University of Queensland v.nikulin@uq.edu.au, gjm@maths.uq.edu.au
More informationLearning to Rank Revisited: Our Progresses in New Algorithms and Tasks
The 4 th China-Australia Database Workshop Melbourne, Australia Oct. 19, 2015 Learning to Rank Revisited: Our Progresses in New Algorithms and Tasks Jun Xu Institute of Computing Technology, Chinese Academy
More informationChapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
More informationOn Adaboost and Optimal Betting Strategies
On Adaboost and Optimal Betting Strategies Pasquale Malacaria School of Electronic Engineering and Computer Science Queen Mary, University of London Email: pm@dcs.qmul.ac.uk Fabrizio Smeraldi School of
More informationAUC Optimization vs. Error Rate Minimization
AUC Optiization vs. Error Rate Miniization Corinna Cortes and Mehryar Mohri AT&T Labs Research 180 Park Avenue, Florha Park, NJ 0793, USA {corinna, ohri}@research.att.co Abstract The area under an ROC
More informationLeast-Squares Intersection of Lines
Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a
More informationIntroduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
More informationSupport Vector Machine (SVM)
Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationAn Unsupervised Learning Algorithm for Rank Aggregation
ECML 07 An Unsupervised Learning Algorithm for Rank Aggregation Alexandre Klementiev, Dan Roth, and Kevin Small Department of Computer Science University of Illinois at Urbana-Champaign 201 N. Goodwin
More informationChapter 11. 11.1 Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling
Approximation Algorithms Chapter Approximation Algorithms Q. Suppose I need to solve an NP-hard problem. What should I do? A. Theory says you're unlikely to find a poly-time algorithm. Must sacrifice one
More informationTraining Methods for Adaptive Boosting of Neural Networks for Character Recognition
Submission to NIPS*97, Category: Algorithms & Architectures, Preferred: Oral Training Methods for Adaptive Boosting of Neural Networks for Character Recognition Holger Schwenk Dept. IRO Université de Montréal
More informationOptimizing Search Engines using Clickthrough Data
Optimizing Search Engines using Clickthrough Data Thorsten Joachims Cornell University Department of Computer Science Ithaca, NY 14853 USA tj@cs.cornell.edu ABSTRACT This paper presents an approach to
More informationSearch Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
More informationThe Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
More informationBusiness Analytics for Non-profit Marketing and Online Advertising. Wei Chang. Bachelor of Science, Tsinghua University, 2003
Business Analytics for Non-profit Marketing and Online Advertising by Wei Chang Bachelor of Science, Tsinghua University, 2003 Master of Science, University of Arizona, 2006 Submitted to the Graduate Faculty
More informationDirect Loss Minimization for Structured Prediction
Direct Loss Minimization for Structured Prediction David McAllester TTI-Chicago mcallester@ttic.edu Tamir Hazan TTI-Chicago tamir@ttic.edu Joseph Keshet TTI-Chicago jkeshet@ttic.edu Abstract In discriminative
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationE-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of e-commerce
More informationArtificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support
More informationSparse Online Learning via Truncated Gradient
Sparse Online Learning via Truncated Gradient John Langford Yahoo! Research jl@yahoo-inc.com Lihong Li Department of Computer Science Rutgers University lihong@cs.rutgers.edu Tong Zhang Department of Statistics
More informationCS570 Data Mining Classification: Ensemble Methods
CS570 Data Mining Classification: Ensemble Methods Cengiz Günay Dept. Math & CS, Emory University Fall 2013 Some slides courtesy of Han-Kamber-Pei, Tan et al., and Li Xiong Günay (Emory) Classification:
More informationHow To Rank With Ancient.Org And A Random Forest
JMLR: Workshop and Conference Proceedings 14 (2011) 77 89 Yahoo! Learning to Rank Challenge Web-Search Ranking with Initialized Gradient Boosted Regression Trees Ananth Mohan Zheng Chen Kilian Weinberger
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationTree based ensemble models regularization by convex optimization
Tree based ensemble models regularization by convex optimization Bertrand Cornélusse, Pierre Geurts and Louis Wehenkel Department of Electrical Engineering and Computer Science University of Liège B-4000
More informationConvex analysis and profit/cost/support functions
CALIFORNIA INSTITUTE OF TECHNOLOGY Division of the Humanities and Social Sciences Convex analysis and profit/cost/support functions KC Border October 2004 Revised January 2009 Let A be a subset of R m
More informationTable 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationBasics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar
More informationSupport Vector Machine. Tutorial. (and Statistical Learning Theory)
Support Vector Machine (and Statistical Learning Theory) Tutorial Jason Weston NEC Labs America 4 Independence Way, Princeton, USA. jasonw@nec-labs.com 1 Support Vector Machines: history SVMs introduced
More informationData Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
More informationFUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
More informationlarge-scale machine learning revisited Léon Bottou Microsoft Research (NYC)
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven
More informationEMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA
EMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA Andreas Christmann Department of Mathematics homepages.vub.ac.be/ achristm Talk: ULB, Sciences Actuarielles, 17/NOV/2006 Contents 1. Project: Motor vehicle
More informationKnowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
More informationBeating the NCAA Football Point Spread
Beating the NCAA Football Point Spread Brian Liu Mathematical & Computational Sciences Stanford University Patrick Lai Computer Science Department Stanford University December 10, 2010 1 Introduction Over
More information