How To Rank With Ancient.Org And A Random Forest
|
|
|
- Erin Barker
- 5 years ago
- Views:
Transcription
1 JMLR: Workshop and Conference Proceedings 14 (2011) Yahoo! Learning to Rank Challenge Web-Search Ranking with Initialized Gradient Boosted Regression Trees Ananth Mohan Zheng Chen Kilian Weinberger Department of Computer Science & Engineering Washington University in St. Louis St. Louis, MO 63130, USA Editor: Olivier Chapelle, Yi Chang, Tie-Yan Liu Abstract In May 2010 Yahoo! Inc. hosted the Learning to Rank Challenge. This paper summarizes the approach by the highly placed team Washington University in St. Louis. We investigate Random Forests (RF) as a low-cost alternative algorithm to Gradient Boosted Regression Trees (GBRT) (the de facto standard of web-search ranking). We demonstrate that it yields surprisingly accurate ranking results comparable to or better than GBRT. We combine the two algorithms by first learning a ranking function with RF and using it as initialization for GBRT. We refer to this setting as igbrt. Following a recent discussion by Li et al. (2007), we show that the results of igbrt can be improved upon even further when the web-search ranking task is cast as classification instead of regression. We provide an upper bound of the Expected Reciprocal Rank (Chapelle et al., 2009) in terms of classification error and demonstrate that igbrt outperforms GBRT and RF on the Microsoft Learning to Rank and Yahoo Ranking Competition data sets with surprising consistency. Keywords: Ranking, Decision Trees, Boosting, Random Forests 1. Introduction The success of search engines such as Google, Yahoo! and Bing has lead to an increased interest in algorithms for automated web search ranking. Web search ranking is often treated as a supervised machine learning problem (Burges et al., 2005; Li et al., 2007; Zheng et al., 2007b): each query-document pair is represented by a high-dimensional feature vector and its label indicates the document s degree of relevance to the query. A machine learning algorithm is trained to predict the relevance from the feature vector, and during test time the documents are ranked according to these predictions. The past years have seen many different approaches to web search ranking, including adaptations of support vector machines (Joachims, 2002; Chapelle and Keerthi, 2010), neural networks (Burges et al., 2005) and gradient boosted regression trees (GBRT) (Zheng et al., 2007b). The latter has arguably established itself as the current state-of-the-art learning paradigm (Li et al., 2007; Gao et al., 2009; Burges, 2010). Irrespective of which learning algorithm is used, the various ranking settings fall into three categories: point-wise, pair-wise, and list-wise. Point-wise algorithms predict the c 2011 A. Mohan, Z. Chen & K. Weinberger.
2 Mohan Chen Weinberger relevance of a document to a query by minimizing a regression loss (e.g. the squared loss). Pair-wise approaches learn a classifier that predicts if one document is more relevant than another; these approaches include RankBoost (Freund et al., 2003), FRank (Tsai et al., 1999), and GBRank (Zheng et al., 2007a). List-wise approaches, such as AdaRank (J., 2007), PermuRank (Xu et al., 2008), tend to iteratively optimize a specialized ranking performance measure, for example NCDG. In this paper we describe the point-wise ranking approach of the team Washington University in St. Louis for the Yahoo Learning To Rank Challenge in May Most of the decisions we made throughout the competition were heavily influenced by our limited computational resources. We focussed on developing a light-weight algorithm that can be trained fully on a single multi-core desktop within reasonable amount of time. We investigate the use of Random Forests (RF) (Breiman, 2001) for web search ranking and demonstrate that it can be a powerful low-cost alternative to GBRT. Although RF is also based on tree averaging, it has several clear advantages over GBRT: 1. It is particularly insensitive to parameter choices; 2. It is known to be very resistant to overfitting; 3. It is embarrassingly parallelizable. In addition, we demonstrate on several real world benchmark data sets that RF can match (or outperform) GBRT with surprising consistency. As a second contribution, we address a particular weakness of GBRT. In gradient boosting, there exists an inherent trade-off between the step-size and early stopping. To obtain the true global minimum, the step-size needs to be infinitesimally small and the number of iterations very large. Of course, this is unrealistic and common practice is to use a reasonably small step-size ( 0.1) with roughly 1000 iterations. We show that, as RF often outperforms GBRT and is very resistant to overfitting, its predictions can be used as a starting-point for the gradient boosting. In this setting GBRT starts-off at a point that is very close to the global minimum and merely refines the already good predictions. We refer to the resulting algorithm as initialized Gradient Boosted Regression Trees (igbrt). igbrt is insensitive to parameters and when web-search ranking is cast as a classification problem, as suggested by Li et al. (2007), it outperforms both GBRT and Random Forests on all Yahoo and Microsoft benchmark datasets. As a final contribution, in addition to the empirical evaluation, we provide an upper bound of the ERR metric with respect to the classification error of a ranker. This paper is organized as follows: In section 2 we briefly review the web-searching ranking setting and define necessary notation. In section 3 we introduce RF. In section 4 we introduce GBRT. Both algorithms are combined in section 5 as initialized gradient boosted regression trees (igbrt). In section 6 we review the classification setting introduced by Li et al. (2007), adapt igbrt to this framework and provide an upper bound of the ERR in terms of the classification error. Finally, we provide an extensive evaluation of all algorithms in section Notation and Setup We assume that we are provided with data of triples D = {(x 1, q 1, y 1 ),..., (x n, q n, y n )}, consisting of documents (x i R f ), queries (q i {1,..., n q }) and labels (y i {0,..., 4}). The label y i indicates to what degree document x i is relevant to the query q i and ranges from y i = 0 ( irrelevant ) to y i = 4 ( perfect match ). There are fewer queries than samples 78
3 Web-Search Ranking with Initialized Gradient Boosted Regression Trees Algorithm 1 Random Forests Input: D = {(x 1, y 1),..., (x n, y n)}, Parameters: K : 0 < K f, M RF : M RF > 0 for t = 1 to M RF do D t D #Sample with replacement, D t = D. h t Cart(D t, K, ) #Build full (d = ) Cart with K f randomly chosen features at each split. end for T ( ) = 1 MRF M RF t=1 h t( ). return T ( ) (n q < n). In this paper, our algorithmic setup is not affected by the number of queries. We assume that a document vector x i is a f dimensional feature vector that incorporates all the sufficient statistics about the query and the document as features. For example, one feature could be the number of occurrences of the query in the document. To simplify notation we will assume that all documents belong to the same query, i.e. n q = 1 and with a slight abuse of notation let D = {(x 1, y 1 ),..., (x n, y n )}. However, the techniques work for sets with multiple queries, and in fact the data we use for experiments does contain many queries. Point-wise machine-learned ranking trains a predictor T ( ) such that T (x i ) y i. If T ( ) is accurate, the ordering π h of all documents according to values of T (x i ) should be close to the desired ordering π y according to y i. We can evaluate the quality of the ranking function either with the root mean squared error (RMSE) or with ranking specific metrics such as normalized discounted cumulative gain (NDCG) (Järvelin and Kekäläinen, 2002) or expected reciprocal rank (ERR) (Chapelle et al., 2009). (Please note that for RMSE lower values are better, whereas for NDCG and ERR higher values indicate better performance.) All the algorithms in this paper are based on Classification and Regression Trees (CART) (Breiman, 1984). We assume that we have an efficient implementation of a slightly modified version of CART that greedily builds a regression tree to minimize the squared-loss, but at each split uniformly samples k features and only evaluates those as candidates for splitting. Cart(S, k, d) argmin h T d (z i,r i ) S (h(z i ) r i ) 2. (1) The three parameters of our CART algorithm (1) are: 1. a set S D; 2. an integer k f which determines how many uniformly picked features are considered at each split; 3. an integer d > 0 that defines the maximum depth of the resulting tree (in (1) T d denotes the set of all CART trees of maximum depth d). The Yahoo Learning to Rank Challenge was based on two data sets of unequal size: Set 1 with and Set 2 with documents. We use the smaller Set 2 for illustration throughout the paper. In section 7 we report a thorough evaluation on both Yahoo data sets and the five folds of the Microsoft MSLR data set. 3. Random Forests In this section we briefly introduce Random Forests (Breiman, 2001). The fundamental concept underlying Random Forests is bagging (Breiman, 1984). In bagging, a learning algorithm is applied multiple times to a subset of D and the results are averaged. Each time 79
4 Mohan Chen Weinberger Algorithm 2 Gradient Boosted Regression Trees (Squared Loss) Input: data set D = {(x 1, y 1),..., (x n, y n)}, Parameters: α, M B, d Initialization: r i = y i for i = 1 to n for t = 1 to M B do T t Cart({(x 1, r 1),..., (x n, r n)}, f, d) #Build Cart of depth d, with all f features and targets {r i} for i = 1 to n do r i r i αt t(x i) #Update the residual of each sample x i. end for end for T ( ) = α M RF t=1 T t( ). #Combine the Regression Trees T 1,..., T MB. return T ( ) the algorithm is trained, n = D data points are sub-sampled with replacement from D, so that the individual classifiers vary slightly. This process reduces overfitting by averaging classifiers that are trained on different data sets from the same underlying distribution. Random Forests is essentially bagging applied to CART with full depth (d = ), where at each split only K uniformly chosen features are evaluated to find the best splitting point. The construction of a single tree is independent from earlier trees, thus making Random Forests an inherently parallel algorithm. Algorithm 1 implements the regression version of the Random Forests algorithm. Only two parameters need to be tuned. M RF specifies the number of trees in the forest and K determines the number of features that each node considers to find the best split. As Random Forests is based on bagging, it does not overfit with increasing M RF, so we set it to be very large (M RF = 10000). The algorithm is only sensitive to one parameter, K. A common rule of thumb is to set K to 10% of the number of features (i.e. K = 0.1f). We follow this rule throughout the paper. 4. Gradient Boosted Regression Trees Similar to Random Forests, Gradient Boosted Regression Trees (GBRT) (Friedman, 2001) is a machine learning technique that is also based on tree averaging. However, instead of training many full (d = ) high variance trees that are averaged to avoid overfitting, GBRT sequentially adds small trees (d 4), each with high bias. In each iteration, the new tree to be added focuses explicitly on the documents that are responsible for the current remaining regression error. Empirical results have shown that GBRT is especially wellsuited for web-search ranking (Zheng et al., 2007b; Burges, 2010). In fact, all the winning teams of the Yahoo Learning to Rank Challenge used some variation of GBRT 1. Let T (x i ) denote the current prediction of sample x i. Furthermore, assume we have a continuous, convex and differentiable loss function L(T (x 1 ),..., T (x n )) which reaches its minimum if T (x i ) = y i for all x i. Throughout the paper we use the square loss: L = 1 n 2 (T (x i) y i ) 2. In each iteration, a new tree h( ) is added to the existing classifier T ( ). The best h( ) is found with a first-order Taylor expansion of L(T + h t ), which is minimized with respect to h t ( ). See Zheng et al. (2007b) for a detailed derivation. 1. See the official ICML workshop homepage 80
5 Web-Search Ranking with Initialized Gradient Boosted Regression Trees Algorithm 3 Initialized Gradient Boosted Regression Trees (Squared Loss) Input: data set D = {(x 1, y 1),..., (x n, y n)}, Parameters: α, M B, d, K RF, M RF F RandomF orests(d, K RF, M RF ) Initialization: r i = y i F (x i) for i = 1 to n for t = 1 to M B do T t Cart({(x 1, r 1),..., (x n, r n)}, f, d) #Build Cart of depth d, with all f features, and targets {r i}. for i = 1 to n do r i r i αt t(x i) #Update the residual of each sample x i. end for end for T ( ) = F ( ) + α M B t=1 Tt( ). #Combine the Regression Trees T1,..., TM with the RF F. return T ( ) Intuitively, GBRT performs gradient descent in the instance space x 1,..., x n i.e. during each iteration the current prediction T (x i ) is updated with a gradient step T (x i ) T (x i ) α L T (x i ), (2) where α > 0 denotes the learning rate. The negative gradient the prediction of the regression tree h t (x i ) which satisfies: L T (x i ) is approximated with h t argmin h T d (h t (x i ) r i ) 2, where: r i = L T (x i ). (3) In the case where L is the squared loss, the gradient for a document x i becomes the residual from the previous iteration, i.e. r i = y i T (x i ). Each iteration t, we use the standard CART algorithm (1), with K = f, to find a solution to (3). GBRT depends on three parameters: The learning rate α > 0, the tree-depth d, and the number of iterations M B. Based on experiments from previous research (Friedman, 2001; Zheng et al., 2007b), we set d = 4 and pick M B and α with a validation data set. Smaller learning rates tend to result in better accuracy but require more iterations. If M B is too large, the algorithm starts overfitting. Figure 1 shows how Boosting compares to RF on the Yahoo Learning to Rank Challenge data Set 2 for various settings of α and M B. The figure shows a clear trend that RF outperforms all settings of GBRT according to all three metrics (ERR, NDCG, RMSE). Although not shown here, boosting for even more iterations, up to M B = 5000, did not change that result at any time on the contrary, GBRT started to overfit, widening the performance gap between RF and GBRT. Similar results were obtained on the much larger Yahoo Set 1, whereas the results on the Microsoft Learning to Rank data set were mixed with no clear winner (see Table 3 in section 7). RF was averaged over M RF = iterations and K = Initialized Gradient Boosted Regression Trees GBRT, as described in the previous section, is traditionally initialized with the all-zero function T 0 (x i ) = 0. Consequently, the initial residual is r i = y i. As the loss function L is 81
6 Mohan Chen Weinberger RMSE α = 0.02 α = 0.04 α = 0.06 α = 0.08 α = 0.1 RF Iterations M NDCG Iterations M ERR Iterations M Figure 1: Results of GBRT (with varying step-size α) compared to RF (bold black line) on the Yahoo Set 2. RF outperform boosted regression trees with respect to RMSE, ERR and NDCG. M RF RMSE ERR NDCG Table 1: Performance of igbrt with varying initializations on Yahoo Set 2. convex, the gradient descent approach from GBRT should converge to its global minimum irrespective of its initialization (with an appropriate learning-rate α) (Shor, 1970). However these theoretical results tend to not hold in practice for two reasons: 1. in each iteration the gradient is only approximated; 2. for true convergence, the learning-rate α needs to be infinitesimally small requiring an unrealistically large number of iterations M B 0. We therefore propose to initialize GBRT with the predictions of RF from section 3. RF is a good initialization for several reasons: 1. RF is known to be very resistant towards overfitting and therefore makes a good optimization starting point; 2. RF is insensitive to parameter settings and does not require additional parameter tuning. Algorithm 3 shows the pseudo-code implementation of igbrt. The main differences to GBRT (Algorithm 2) lie in the initial settings of r i, which are set to the residual of the RF predictions: r i y i F (x i ), and the final boosted classifier which is added to the initial 82
7 Web-Search Ranking with Initialized Gradient Boosted Regression Trees RMSE α = 0.02 α = 0.04 α = 0.06 α = 0.08 α = 0.1 RF Iterations M NDCG Iterations M ERR Iterations M Figure 2: Results of igbrt on Yahoo test set 2. The predictions were initialized with the predictions of Random Forests (RF). igbrt improves over RF on all metrics, with some overfitting in the settings with large learning-rates (α = 0.1). results of RF. Figure 2 shows the traces of igbrt under various step-sizes (on the test partition of yahoo Set 2). In contrast to the standard GBRT, igbrt improves consistently over Random Forests. In fact, igbrt outperforms RF and GBRT for all settings of α and M B 1000 in RMSE, NDCG, and ERR (except for a very brief period in ERR with α = 0.1). Table 1 shows ranking results of igbrt, under varying amounts of trees M RF for the Random Forests initialization. The column with M RF = 0 is identical to standard GBRT. The number of boosting iterations M B was chosen on a validation data set (up to M B = 1000). The table shows clearly that the boosted results were heavily influenced by their initialization. Strikingly, even with only 100 averaged trees as initialization, igbrt outperforms standard GBRT (M RF = 0) with respect to all three metrics. 6. Classification vs. Regression So far, all our algorithms used regression to approximate the relevance of a document. Recently, Li et al. (2007) proposed a learning to rank paradigm that is based on classification instead of regression. Instead of learning a function T (x i ) y i, the authors utilize the fact that the original relevance scores are discrete, y i {0, 1, 2, 3, 4}, and generate four binary classification problems indexed by c = 1,..., 4. The c th classification problem predicts if the 83
8 Mohan Chen Weinberger document is less relevant than c. More formally, we denote the binary label of document x i for problem c {1,..., 4} as b c i (y i < c). For each of these binary classification problems, we train a classifier T c ( ). We carefully choose classifiers T c ( ) to return well defined probabilities (i.e. 0 T c ( ) 1) and T c (x) can be interpreted as the probability of document x being less relevant than c. More formally, T c (x) = P (rel(x) < c). If we define the constant functions T 0 ( ) = 0 and T 5 ( ) = 1 (by definition relevance is non-negative and all documents are less relevant than 5), we can combine all classifiers T 0,..., T 5 to compute the probability that a document x i has a relevance of r {0,..., 4}: P (rel(x i ) = r) = P (rel(x i ) < r + 1) P (rel(x i ) < r) = T r+1 (x i ) T r (x i ). Li et al. (2007) show that GBRT is well-suited for this setup. Regression trees, minimizing the squared-loss, predict the average label of documents at a leaf. If these contain only binary labels {0, 1}, the predictions are within the interval [0, 1]. The same holds for Random Forests, which are essentially averaged regression trees. We can therefore use RF and igbrt as binary classifiers for this framework 2. We evaluate the classification paradigm in the following section. In an attempt to explain our empirical results, which clearly favor classification over regression, we show that the ERR error is bounded by the classification error in Appendix A. Our current bound shows a clear relationship between ERR and classification performance, however is probably too loose to be of considerable practical value. Theorem 1 Given n documents indexed by {1,, n}. Suppose a classifier, assigns a relevance score to each document, denoted by ŷ 1,..., ŷ n {0, 1, 2, 3, 4}. A ranker, π, ranks documents according to ŷ i such that π(i) < π(j) if ŷ i > ŷ j (ties are broken arbitrarily). Let g be a perfect ranker and let the ERR scores of g and π be ERR g and ERR π, respectively. The ERR error of π, ERR g ERR π, is bounded by the square root of the classification error: ERR g ERR π 15π n 1 yi ŷ i 6 7. Results We evaluate all algorithms on several data sets from the Yahoo Ranking competition (two sets from different countries) and the five splits of the Microsoft MSLR data sets 3. Both data sets contain pre-defined train/validation/test splits. Table 2 summarizes various statistics about all the data sets. We experimented with several ways to deal with missing features (which were present in all data sets): splitting three ways during tree construction and substituting missing 2. Depending on the step-size, there might be small violations of the probability assumptions in the case of boosting. However, in our experience this does not seem to hurt the performance
9 Web-Search Ranking with Initialized Gradient Boosted Regression Trees Yahoo LTRC MSLR Folds TRAIN Set 1 Set 2 F1 F2 F3 F4 F5 # Features # Documents # Queries Avg # Doc per Query % Features Missing TEST Set 1 Set 2 F1 F2 F3 F4 F5 # Documents # Queries Avg # Doc per Query % Features Missing Table 2: Statistics of the Yahoo Competition and Microsoft Learning to Rank data sets. ERR Regr./ Yahoo LTRC MSLR Folds method Class. Set 1 Set 2 F1 F2 F3 F4 F5 GBRT R RF R igbrt R GBRT C RF C igbrt C NDCG Regr./ Yahoo LTRC MSLR Folds method Class. Set 1 Set 2 F1 F2 F3 F4 F5 GBRT R RF R igbrt R GBRT C RF C igbrt C Table 3: Performance of Gradient Boosted Regression Trees (GBRT), Random Forests (RF) and Initialized Gradient Boosted Regression Trees (igbrt). All results are evaluated in ERR (upper table) and NDCG (lower table). We investigated both, the regression setting (R - in second column) and classification (C). The parameters were set by cross validation on the pre-defined validation data. features with zeros, infinity, the mean or the median feature value. Results on validation splits showed that substituting zeros for missing values tends to outperform the alternative approaches across most data sets. We trained the respective rankers on the training set, then ranked the validation and test sets. The performance of the learners are judged by ERR and NDCG. For Boosting, we used learning rates α {0.1, 0.2, 0.3, 0.4, 0.5}. The reported numbers are the performances from the test sets, with the parameter settings that best performed on the validation sets. For RF, we used the rule of thumb and fixed K = 0.1f and M RF =
10 Mohan Chen Weinberger Running Time RF RF GBRT igbrt Iterations M /500 Regression 130m m 103m 181m Classification 768m 77m 388m 842m Table 4: Run Times (in minutes) for RF, GBRT and igbrt on the Yahoo Ranking Competition Set 2. We performed all experiments on a standard workstation with 8 cores 4. Table 4 summarizes the training times. All implementations were parallelized. The RF implementation distributed the construction of the M RF trees onto the various cores, during boosting the splits were performed by different threads in parallel. With a comparable number of trees (M RF = M B ), RF was by far the fastest algorithm despite that its tree-sizes are much larger. This can be attributed to the fact that RF scales linearly with the number cores and resulted in a much better CPU utilization. For best possible ranking results we used RF with M RF = 10000, which was a bit slower than GBRT. Although the differences between M RF = and M RF = 1000 were not necessarily significant, we included them as they did matter for the competition leaderboard. igbrt was the slowest algorithm as it involves both RF and GBRT sequentially (given our limited resources, we set M B = 500). In the classification setting, boosting became faster than RF as the different classification problems could be learned in parallel and the 8 cores were utilized more effectively. Both boosting algorithms required additional time for the parameter search by cross validation. Our own open-source C++ implementation of all algorithms and links to the data sets are available at the url: In addition to regression, we also used the classification setting from section 6 on all three algorithms (igbrt, RF and GBRT). Table 3 summarizes the results under ERR and NDCG. We observed the following trends: 1. Classification reliably outperforms regression across all methods. 2. igbrt (with classification) outperforms RF and GBRT on most of the data sets according to both ERR and NDCG. 3. RF (classification) yielded better results than GBRT on all data sets except MSLR Fold 3. This is particularly impressive, as no parameter sweep was performed for RF. In the Yahoo Competition igbrt(c) would have achieved 11 th place on Set 1 and igbrt(r) 4 th place on Set 2 (with differences in ERR from the winning solution on the order of 10 3 and 10 4 respectively). Please note that the higher ranked submissions tended to average over millions of trees and were trained for weeks on over hundred computers (Burges, 2010), whereas our models could be trained on a single desktop in one day. 8. Conclusion In this paper we compared three algorithms for machine learned web-search ranking in a regression and classification setting. We showed that Random Forests, with parameters picked according to simple rules of thumb, is a very competitive algorithm and reliably outperforms tuned Gradient Boosted Regression Trees. We introduced initialized gradient boosted regression trees (igbrt), which uses GBRT to further refine the results of Random Forests. Finally, we demonstrated that classification tends to be a better paradigm for 4. Intel Xeon [email protected] 86
11 Web-Search Ranking with Initialized Gradient Boosted Regression Trees web-search ranking than regression. In fact, igbrt in a classification setting consistently achieves state-of-the-art performance on all publicly available web-search data sets that we are aware of. Appendix A. Let us define a ranker as a bijective function π : {1,, n} {1,, n}, which maps a given document s index i into its ordering position π i. Also, let us denote the inverse mapping of π by σ i = σ(i) = π 1 (i). Expected Reciprocal Rank (ERR) (Chapelle et al., 2009) was designed to measure the performance for web-search ranking algorithms. It simulates a person who looks through the output of a web-search engine. The model assumes that the web-surfer goes through the documents in order of increasing rank and stops the moment he/she is satisfied. R(y i ) denotes the probability that the person is satisfied with the i th result. The ranker obtains a payoff of 1 i if the person is satisfied with the document in the ith position. ERR computes the expected payoff under this model. The formal definition is as follows: ERR π = C π (i)r(y σi ) = C π (π i )R(y i ) (4) where: C π (i) = 1 i 1 (1 R(y i π 1 (j))) = 1 i 1 (1 R(y σj )) (5) i j=1 and R(y) = 2y 1. Here C π(i) is a product of the payoff 1 i and the probability that the surfer was not satisfied with all previous documents, i 1 j=1 (1 R(y σ j )). Definition 1 Perfect ranker: A ranker, g is a perfect ranker, if it ranks documents according to their original label y i, i.e. g(i) < g(j) if y i > y j. When y i = y j, then document i and document j are arbitrarily ranked. Also denote the inverse mapping of g by φ i = φ(i) = g 1 (i). Now we denote the corresponding ERR score of the perfect ranker g as ERR g. Then for a given ranker π, the ERR error is ERR g ERR π. Theorem 1 states that the ERR error can be bounded by the classification error of the documents. The proof is below: Proof This proof follows a line of reasoning inspired by Li et al. (2007). By the definition of ERR we have ERR π = = C π (π i )R(y i ) = C π (π i ) 2ŷi 1 + j=1 C π (π i ) 2y i 1 C π (π i ) 2y i 2ŷi. (6) According to the rearrangement inequality (because π is the ideal ranking for ŷ i ) we can show that, 87
12 Mohan Chen Weinberger C π (i)r(ŷ σi ) Combining (6) and (7) leads to ERR π Consequently we obtain C π (π i ) 2ŷi 1 C g (g i ) 2ŷi 1 + = C g (g i ) 2y i 1 = ERR g + ERR g ERR π 1 C g (i)r(ŷ φi ) C g (g i ) 2ŷi 1. (7) C π (π i ) 2y i 2ŷi C g (g i ) 2y i 2ŷi (C π (π i ) C g (g i )) 2y i 2ŷi. + (C π (π i ) C g (g i ))(2 y i 2ŷi ). C π (π i ) 2y i 2ŷi Applying Cauchy-Schwarz inequality leads to ( ERR g ERR π 1 2)1/2 ( ) 1/2 π (π i ) C g (g i )) (2 (C y i 2ŷi ) 2 15π n 1 yi ŷ i, 6 where we use the fact that n (C π(π i ) C g (g i )) 2 n 1 i 2 π2 6, and = 15. References L. Breiman. Classification and regression trees. Chapman & Hall/CRC, L. Breiman. Random forests. Machine learning, 45(1):5 32, C. Burges. From RankNet to LambdaRank to LambdaMART: An Overview. Microsoft Research Technical Report MSR-TR , C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning, ICML 05, pages 89 96, New York, NY, USA, ACM. 88
13 Web-Search Ranking with Initialized Gradient Boosted Regression Trees O. Chapelle and S. S. Keerthi. Efficient algorithms for ranking with svms. Inf. Retr., 13: , June ISSN O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan. Expected reciprocal rank for graded relevance. In Proceeding of the 18th ACM conference on Information and knowledge management, CIKM 09, pages , New York, NY, USA, ACM. ISBN Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. J. Mach. Learn. Res., 4: , ISSN J. H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, pages , ISSN J. Gao, Q. Wu, C. Burges, K. Svore, Y. Su, N. Khan, S. Shah, and H. Zhou. Model adaptation via model interpolation and boosting for web search ranking. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2, pages Association for Computational Linguistics, Xu. J. A boosting algorithm for information retrieval. In Proceedings of the 30th Annual ACM Conference on Research and Development in Information Retrieval, K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS), 20(4):446, T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM International conference on Knowledge discovery and data mining (SIGKDD), page 142. ACM, P. Li, C. Burges, and Q. Wu. Learning to rank using classification and gradient boosting. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS), N. Shor. Convergence rate of the gradient descent method with dilatation of the space. Cybernetics and Systems Analysis, 6(2): , M. Tsai, T. Liu, H. Chen, and W. Ma. Frank: A ranking method with fidelity loss. Technical Report MSR-TR , Microsoft Research, November J. Xu, T. Liu, M. Lu, H. Li, and W. Ma. Directly optimizing evaluation measures in learning to rank. In Proceedings of the 31th Annual ACM Conference on Research and Development in Information Retrieval. ACM Press, Z. Zheng, H. Zha, K. Chen, and G. Sun. A regression framework for learning ranking functions using relative relevance judgements. Proceedings of the 30th Annual ACM Conference on Research and Development in Information Retrieval, 2007a. Z. Zheng, H. Zha, T. Zhang, O. Chapelle, K. Chen, and G. Sun. A general boosting method and its application to learning ranking functions for web search. Advances in Neural Information Processing Systems, 19, 2007b. 89
A General Boosting Method and its Application to Learning Ranking Functions for Web Search
A General Boosting Method and its Application to Learning Ranking Functions for Web Search Zhaohui Zheng Hongyuan Zha Tong Zhang Olivier Chapelle Keke Chen Gordon Sun Yahoo! Inc. 701 First Avene Sunnyvale,
Model Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
Large Scale Learning to Rank
Large Scale Learning to Rank D. Sculley Google, Inc. [email protected] Abstract Pairwise learning to rank methods such as RankSVM give good performance, but suffer from the computational burden of optimizing
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
Learning to Rank Revisited: Our Progresses in New Algorithms and Tasks
The 4 th China-Australia Database Workshop Melbourne, Australia Oct. 19, 2015 Learning to Rank Revisited: Our Progresses in New Algorithms and Tasks Jun Xu Institute of Computing Technology, Chinese Academy
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Distributed Tuning of Machine Learning Algorithms using MapReduce Clusters
Distributed Tuning of Machine Learning Algorithms using MapReduce Clusters Yasser Ganjisaffar University of California, Irvine Irvine, CA, USA [email protected] Rich Caruana Microsoft Research Redmond,
Making Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research
REVIEW OF ENSEMBLE CLASSIFICATION
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -
Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create
Heritage Provider Network Health Prize Round 3 Milestone: Team crescendo s Solution
Heritage Provider Network Health Prize Round 3 Milestone: Team crescendo s Solution Rie Johnson Tong Zhang 1 Introduction This document describes our entry nominated for the second prize of the Heritage
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
Introducing diversity among the models of multi-label classification ensemble
Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and
Getting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand [email protected] ABSTRACT Ensemble Selection uses forward stepwise
How To Make A Credit Risk Model For A Bank Account
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző [email protected] 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
Advanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
Context Models For Web Search Personalization
Context Models For Web Search Personalization Maksims N. Volkovs University of Toronto 40 St. George Street Toronto, ON M5S 2E4 [email protected] ABSTRACT We present our solution to the Yandex Personalized
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
The Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
Employer Health Insurance Premium Prediction Elliott Lui
Employer Health Insurance Premium Prediction Elliott Lui 1 Introduction The US spends 15.2% of its GDP on health care, more than any other country, and the cost of health insurance is rising faster than
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
Applied Multivariate Analysis - Big data analytics
Applied Multivariate Analysis - Big data analytics Nathalie Villa-Vialaneix [email protected] http://www.nathalievilla.org M1 in Economics and Economics and Statistics Toulouse School of
Predicting borrowers chance of defaulting on credit loans
Predicting borrowers chance of defaulting on credit loans Junjie Liang ([email protected]) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm
Ensemble Data Mining Methods
Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods
Data Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
Tree based ensemble models regularization by convex optimization
Tree based ensemble models regularization by convex optimization Bertrand Cornélusse, Pierre Geurts and Louis Wehenkel Department of Electrical Engineering and Computer Science University of Liège B-4000
Adaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
Identifying Best Bet Web Search Results by Mining Past User Behavior
Identifying Best Bet Web Search Results by Mining Past User Behavior Eugene Agichtein Microsoft Research Redmond, WA, USA [email protected] Zijian Zheng Microsoft Corporation Redmond, WA, USA [email protected]
II. RELATED WORK. Sentiment Mining
Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014
Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview
Bootstrapping Big Data
Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu
Yahoo! Learning to Rank Challenge Overview
JMLR: Workshop and Conference Proceedings 14 (2011) 1 24 Yahoo! Learning to Rank Challenge Yahoo! Learning to Rank Challenge Overview Olivier Chapelle Yi Chang Yahoo! Labs Sunnyvale, CA [email protected]
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
A Learning Algorithm For Neural Network Ensembles
A Learning Algorithm For Neural Network Ensembles H. D. Navone, P. M. Granitto, P. F. Verdes and H. A. Ceccatto Instituto de Física Rosario (CONICET-UNR) Blvd. 27 de Febrero 210 Bis, 2000 Rosario. República
Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk [email protected] Tom Kelsey ID5059-19-B &
How To Perform An Ensemble Analysis
Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier
Multiple Kernel Learning on the Limit Order Book
JMLR: Workshop and Conference Proceedings 11 (2010) 167 174 Workshop on Applications of Pattern Analysis Multiple Kernel Learning on the Limit Order Book Tristan Fletcher Zakria Hussain John Shawe-Taylor
Logistic Regression for Spam Filtering
Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Leveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
D-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
New Ensemble Combination Scheme
New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,
A Logistic Regression Approach to Ad Click Prediction
A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi [email protected] Satakshi Rana [email protected] Aswin Rajkumar [email protected] Sai Kaushik Ponnekanti [email protected] Vinit Parakh
Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
Invited Applications Paper
Invited Applications Paper - - Thore Graepel Joaquin Quiñonero Candela Thomas Borchert Ralf Herbrich Microsoft Research Ltd., 7 J J Thomson Avenue, Cambridge CB3 0FB, UK [email protected] [email protected]
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
Random forest algorithm in big data environment
Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
Steven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore
Steven C.H. Hoi School of Computer Engineering Nanyang Technological University Singapore Acknowledgments: Peilin Zhao, Jialei Wang, Hao Xia, Jing Lu, Rong Jin, Pengcheng Wu, Dayong Wang, etc. 2 Agenda
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 [email protected]
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
L25: Ensemble learning
L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna
A Decision Theoretic Approach to Targeted Advertising
82 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000 A Decision Theoretic Approach to Targeted Advertising David Maxwell Chickering and David Heckerman Microsoft Research Redmond WA, 98052-6399 [email protected]
Data quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Implementing a Resume Database with Online Learning to Rank
Implementing a Resume Database with Online Learning to Rank Emil Ahlqvist August 26, 2015 Master s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Jan-Erik Moström Supervisor at Knowit Norrland:
Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering
IEICE Transactions on Information and Systems, vol.e96-d, no.3, pp.742-745, 2013. 1 Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering Ildefons
Comparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
Simple and efficient online algorithms for real world applications
Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,
Data Mining with R. Decision Trees and Random Forests. Hugh Murrell
Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge
Learning to Rank By Aggregating Expert Preferences
Learning to Rank By Aggregating Expert Preferences Maksims N. Volkovs University of Toronto 40 St. George Street Toronto, ON M5S 2E4 [email protected] Hugo Larochelle Université de Sherbrooke 2500
The Need for Training in Big Data: Experiences and Case Studies
The Need for Training in Big Data: Experiences and Case Studies Guy Lebanon Amazon Background and Disclaimer All opinions are mine; other perspectives are legitimate. Based on my experience as a professor
Classification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
Classification On The Clouds Using MapReduce
Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal [email protected] Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal [email protected]
Multiple Linear Regression in Data Mining
Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple
Chapter 12 Bagging and Random Forests
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts
BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL
The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University
Scaling Learning to Rank to Big Data
Master s thesis Computer Science (Chair: Databases, Track: Information System Engineering) Faculty of Electrical Engineering, Mathematics and Computer Science (EEMCS) Scaling Learning to Rank to Big Data
Studying Auto Insurance Data
Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from http://www.statsci.org/data/general/motorins.
Predicting the Stock Market with News Articles
Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is
Machine Learning Methods for Causal Effects. Susan Athey, Stanford University Guido Imbens, Stanford University
Machine Learning Methods for Causal Effects Susan Athey, Stanford University Guido Imbens, Stanford University Introduction Supervised Machine Learning v. Econometrics/Statistics Lit. on Causality Supervised
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: ([email protected]) TAs: Pierre-Luc Bacon ([email protected]) Ryan Lowe ([email protected])
How To Cluster On A Search Engine
Volume 2, Issue 2, February 2012 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: A REVIEW ON QUERY CLUSTERING
Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods
Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing
FilterBoost: Regression and Classification on Large Datasets
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley Machine Learning Department Carnegie Mellon University Pittsburgh, PA 523 [email protected] Robert E. Schapire Department
Why Ensembles Win Data Mining Competitions
Why Ensembles Win Data Mining Competitions A Predictive Analytics Center of Excellence (PACE) Tech Talk November 14, 2012 Dean Abbott Abbott Analytics, Inc. Blog: http://abbottanalytics.blogspot.com URL:
Question 2 Naïve Bayes (16 points)
Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the
Data Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.7 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Linear Regression Other Regression Models References Introduction Introduction Numerical prediction is
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell
THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether
Better credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
Learning to Rank with Nonsmooth Cost Functions
Learning to Rank with Nonsmooth Cost Functions Christopher J.C. Burges Microsoft Research One Microsoft Way Redmond, WA 98052, USA [email protected] Robert Ragno Microsoft Research One Microsoft Way
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
Large Margin DAGs for Multiclass Classification
S.A. Solla, T.K. Leen and K.-R. Müller (eds.), 57 55, MIT Press (000) Large Margin DAGs for Multiclass Classification John C. Platt Microsoft Research Microsoft Way Redmond, WA 9805 [email protected]
Data Mining Techniques Chapter 6: Decision Trees
Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................
Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification
Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde
Cross Validation. Dr. Thomas Jensen Expedia.com
Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract
