Distributed Tuning of Machine Learning Algorithms using MapReduce Clusters
|
|
|
- Maurice Banks
- 10 years ago
- Views:
Transcription
1 Distributed Tuning of Machine Learning Algorithms using MapReduce Clusters Yasser Ganjisaffar University of California, Irvine Irvine, CA, USA Rich Caruana Microsoft Research Redmond, WA Thomas Debeauvais University of California, Irvine Irvine, CA, USA Cristina Videira Lopes University of California, Irvine Irvine, CA, USA Sara Javanmardi University of California, Irvine Irvine, CA, USA ABSTRACT Obtaining the best accuracy in machine learning usually requires carefully tuning learning algorithm parameters for each problem. Parameter optimization is computationally challenging for learning methods with many hyperparameters. In this paper we show that MapReduce Clusters are particularly well suited for parallel parameter optimization. We use MapReduce to optimize regularization parameters for boosted trees and random forests on several text problems: three retrieval ranking problems and a Wikipedia vandalism problem. We show how model accuracy improves as a function of the percent of parameter space explored, that accuracy can be hurt by exploring parameter space too aggressively, and that there can be significant interaction between parameters that appear to be independent. Our results suggest that MapReduce is a two-edged sword: it makes parameter optimization feasible on a massive scale that would have been unimaginable just a few years ago, but also creates a new opportunity for overfitting that can reduce accuracy and lead to inferior learning parameters. Categories and Subject Descriptors H.4.m [Information Systems]: Miscellaneous Machine Learning General Terms Algorithms, Experimentation Keywords Machine learning, Tuning, MapReduce, Hyper parameter, Optimization Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. LDMTA 11, August 21 24, 2011, San Diego, CA Copyright 2011 ACM /11/08...$ INTRODUCTION The ultimate goal of machine learning is to automatically build models from data without requiring tedious and time consuming human involvement. This goal has not yet been achieved. One of the difficulties is that learning algorithms require parameter tuning in order to adapt them to the particulars of a training set [14, 7]. The number of nearest neighbors in KNN or the step size in stochastic gradient descent are examples of these parameters. Researchers are expected to maximize the performance of their algorithm by optimizing over parameter values. There is little literature in the machine learning community about how to optimize hyperparameters efficiently and without overfitting. As just one example, support vector machine classification requires an initial learning phase in which the training data are used to adjust the classification parameters. There are many papers about SVM algorithms and kernel methods, but few of them address the parameter tuning phase of learning which is critical to achieve high quality results. Because of the difficulty of tuning parameters optimally, sometimes researchers use complex learning algorithms before experimenting adequately with simpler alternatives with better tuned parameters [14], and reported results are more difficult to reproduce because of the influence of parameter settings [16]. Despite decades of research into global optimization and the publishing of several parameter optimization algorithms [13, 18, 5], it seems that most machine learning researchers still prefer to carry out this optimization by hand or by grid search [12]. Grid search uses a predefined set of values for each parameter and determines which combination of these values yields the best results. Grid search is computationally expensive and takes substantial time when performed on a single machine. However, given the independence of these experiments, they can easily be performed in parallel. Thus parallel grid search is easy and scalable. With the increasing availability of inexpensive cloud computing environments grid search parameter tuning tuning on large data sets has become more practical. Cloud computing makes massive parameter exploration so easy that a new problem arises: the final accuracy of the model on test data can be reduced by running too many experiments and overfitting to the validation data. In this work, we use the MapReduce framework [6] to efficiently perform parallel grid search for exploring thousands of combinations of parameter values. We propose a general
2 approach that can be used for tuning parameters in different machine learning algorithms. The approach is easiest to understand when each tuning task (evaluating the performance of the algorithm for a single combination of parameter values) can be handled by a single node in the cluster, but also works when multiple processors are needed for each training run. For single node training runs, the node loads the training and validation data from the distributed file system of the cluster and learns a model on the training data and reports the validation data. The next section describes the details of this method. The main advantage of the MapReduce framework for parameter tuning is that it can be used for very large optimizations at low cost. In section 2, we describe how easy it is to perform massive parallel grid search on MapReduce clusters. The MapReduce framework allows us to focus on the learning task and not worry about parallelization issues such as communications between processes, fault tolerance and etc. We report the results of our experiments on tuning the parameters of two machine learning algorithms that are used in two different kinds of tasks. The first learns ranking models for information retrieval in search engines. The goal is to train an accurate model that given a feature vector which is extracted from a query document pair, assigns a score to the document which is then used for ordering documents based on their relevance to the query. The second kind of task is a binary classification task that uses Random Forests [2] to detect vandalistic edits in Wikipedia. Wikipedia contributors can edit pages to add, update or remove content. An edit is considered vandalistic if valuable content is removed or if erroneous or spam content is added. In the first part of the paper we describe how to use MapReduce for parameter optimization. The second part of the paper describes the application of MapReduce to optimize the parameters of boosted tree and random forest models on the four text problems. The third part of the paper discusses lessons learned from performing massive parameter optimization on these problems. We show the increase in accuracy that results from large-scale parameter optimization, the loss in accuracy that can result from searching parameter space too thoroughly, and how massive parameter optimization uncovers interactions between parameters that one might not expect. 2. MAPREDUCE FOR PARAMETER TUN- ING We use cross fold validation and evaluate the performance of the algorithm for the same set of parameters on different folds of the data and use the average over these folds for deciding which combination of parameters should be used. Some learning algorithms involve random processes that may result in significantly different results for different runs of the algorithm with different random seeds. In these cases, we repeat the process several times for different random seeds and use the average results for picking the best parameter values. In summary, if there are N combinations of parameters that need to be evaluated, we perform K S experiments for each combination, where K is the number of data folds and S is the number of different random seeds that we use. Using this approach we would be able to report the average and variance of our evaluation metrics for each of the N combinations. However, this setup requires a total of N K S experiments that may take a long time on relatively large data sets. Given that all the experiments are independent from each other, a cluster of machines can be used for performing these experiments in parallel and then aggregating results. MapReduce clusters are particularly well suited for this purpose. The master node can initiate Map a map task for each of the N K S tasks. Each map task uses the parameter values that are assigned to it to learn a model on the training data that is assigned to it and then measures the performance of the model on the corresponding validation data. Algorithm shows the description of a Map task. The MapReduce framework automatically combines the K S results which are computed for the same set of parameter values for different folds of the data and different random seeds. These results are then passed to reducer tasks. Reducers just compute the mean and standard deviation of the list of measurements and emit them in their output. Algorithm shows the description of a Reducer task. Algorithm 1: Map: ( Θ,k,s) ( Θ,ρ) input : Θ, list of parameter values input : k, fold number input : s, random seed Train learning model M on training data of fold k using the list of parameter values Θ and the random seed s Compute ρ: the performance of model M on validation data of fold k output: ( Θ,ρ) Algorithm 2: Reduce: ( Θ, (ρ 1,...ρ m)) ( Θ, ρ, σ) input : Θ, parameter values input :(ρ 1,...ρ m), performance values of different folds and random seeds computed for parameter values Θ by mapper tasks Compute average ( ρ) and standard deviation (σ) of input performance values. output: ( Θ, ρ, σ) In the next two sections, we show how we used this architecture for two different learning tasks. 3. TASK 1: LEARNING TO RANK We use LambdaMART [20] for learning a ranking model for task 1. LambdaMART is a ranking algorithm that uses Gradient boosting [9] to optimize a ranking cost function. Gradient boosting produces an ensemble of weak models (typically regression trees) that together form a strong model. The ensemble is built in a stage-wise process by performing gradient descent in function space. The final model maps an input feature vector x R d toascoref (x) R: F m(x) =F m 1(x)+γ mh m(x) where each h i is a function modeled by a single regression tree and the γ i R is the weight associated with the i th
3 regression tree. Both the h i and the γ i are learned during training. A given tree h i maps a given feature vector x to a real value by passing x down the tree, where the path (left or right) at a given node is determined by the value of a particular feature in the feature vector and the output is a fixed value associated with the leaf that is reached by following the path. Gradient boosting usually requires regularization to avoid overfitting. Different kinds of regularization techniques can be used to reduce overfitting in boosted trees. One common regularization parameter is the number of trees in the model, M. Increasing M reduces the error on training set, but setting it too high often leads to overfitting. An optimal value of M often is selected by monitoring prediction error on a separate validation data set. Another regularization approach is to control the complexity of the individual trees via a number of user-chosen parameters. For example, Max Number of Leaves per tree limits the size of individual trees thus preventing them from overfitting to the training data. Another user set parameter for controlling tree size is the minimum number of observations allowed in leaves. This parameter is used in the tree building process by ignoring splits that lead to nodes containing fewer than this number of training set observations. This prevents adding leaves that contain statistically small samples of training data. Another important regularization technique is shrinkage which modifies the boosting update rule as follows: F m(x) =F m 1(x)+ηγ mh m(x), 0 <η 1, where parameter η is called the learning rate. Small learning rates can dramatically improve a model s generalization ability over gradient boosting without shrinkage (η = 1), however they result in slower convergence and more boosting iterations and therefore larger models. In [8], Friedman proposed a modification of the gradient boosting algorithm which was motivated by Breiman s bagging method. He proposed that at each iteration of the algorithm, a base learner should be fit on a sub-sample of the training set drawn at random without replacement. Friedman observed a substantial improvement in gradient boosting s accuracy with this modification. Sub-sample size is some constant fraction s of the size of the training set. When s = 1, the algorithm is deterministic. Smaller values of s introduce randomness into the algorithm and help prevent overfitting, acting as a kind of regularization. The algorithm also becomes faster, because regression trees have to be fit to smaller data sets at each iteration. Also similar to Random Forests [2], more randomness can be introduced by sampling features that are available to the algorithm on each tree split. On each split, the algorithm selects the best feature from a random subset of features instead of the best overall feature. In our experiments, we add both observation sub-sampling and feature sampling as two new parameters that need to be tuned for the LambdaMART algorithm. These parameters can take values between 0 and 1, where 1 means no sampling and values less than 1 introduce sampling randomness. Taken together, all of these parameters present a large hyperparameter search space over which to optimize learning performance. 3.1 Data sets For our experiments we work with three public data sets: TD2004 and MQ2007 from LETOR data sets [17] and the recently published MSLR-WEB10K data set from Microsoft Research [1]. Table 1 summarizes the properties of these data sets. The three data sets contain different number of queries and have diverse properties. Therefore we expect different parameter values after tuning the same models on each of them. 3.2 Evaluation Metric For model comparison we use NDCG@k which is a popular information retrieval metric [11]. NDCG@k is a measure for evaluating top k positions of a ranked list using multiple levels of relevance judgment. It is defined as follows, NDCG@k = N 1 k j=1 2 r j 1 log 2 (1 + j), where N 1 is a normalization factor chosen so that a perfect ordering of the results will receive the score of one and r j denotes the relevance level of the document ranked at the j-th position. 3.3 Parameter Tuning We use grid search to test 1,008 different combinations of parameters on the smaller data sets and 162 combinations on the larger data set. Table 2 shows the values we tried for each of the parameters. Since MSLR-WEB10K contains more features for each query url pair, we need more complex trees (trees with more leaves) on this data set. We do not directly control the best number of trees for the LambdaMART models via a user set parameter. Instead, as iterations of boosting continue, the prediction accuracy of the model is checked on a separate validation set. Boosting continues until there has been no improvement in accuracy for 250 iterations. The algorithm then returns the number of iterations that yielded maximum accuracy on the validation set. Each combination of parameters is tested on 5 folds of each data set. On each fold we use 3 different random seeds to get more accurate results. This requires 1, = 15, 120experimentsoneachofthesmallerdatasetsand =2, 430 experiments on MSLR-WEB10K. We used a MapReduce cluster of 40 nodes for these experiments. It takes only 8 hours to run all of these experiments on this cluster. Table 3 shows the best configurations based on on each data set. The best performing configurations on all three data sets use feature sampling. The smaller data sets also get better results by sub-sampling of training queries on each iteration. We conjecture that sub-sampling queries helps when the training data is small because it helps avoid overfitting by not allowing trees to see all queries on each iteration of boosting. This adds diversity to the individual trees which is then reduced when boosting averages tree predictions. With very large data sets this is less critical because individual trees cannot themselves significantly overfit a large data set when tree size is limited.
4 Table 1: Properties of data sets used for experiments in Task 1 Data set Queries Query URL Pairs Features Relevance Labels TD , {0, 1} MQ2007 1,692 69, {0, 1, 2} MSLR-WEB10K 10,000 1,200, {0, 1, 2, 3, 4} Table 3: Best three combinations of parameters found after parameter tuning for the Ranking Tasks Max Leaves Min Obs. Per Leaf Learning Rate Sub sampling Feature Sampling (a) TD2004 data set Max Leaves Min Obs. Per Leaf Learning Rate Sub sampling Feature Sampling (b) MQ2007 data set Max Leaves Min Obs. Per Leaf Learning Rate Sub sampling Feature Sampling (c) MSLR-WEB10K data set 4. TASK 2: DETECTING VANDALISTIC ED- ITS IN WIKIPEDIA The goal of this task is to detect vandalistic edits in Wikipedia articles. Deletion of valuable content or insertion of obviously erroneous content or spam are examples of vandalism in Wikipedia. We consider vandalism detection as a binary classification problem. We use the PAN 2010 corpus [15] for training the classifier and evaluating its performance. This data set is comprised of 32, 452 edits on 28, 468 different articles. It was annotated by 753 annotators recruited from Amazon s Mechanical Turk, who cast more than 190, 000 votes so that each edit has been reviewed by at least three of them. The corpus is split into a training set of 15, 000 edits and a test set of 18, 000 edits. To learn and predict vandalistic edits we extract 66 features for each sample in this data set. The ratio of vandalistic edits in this data set is 7.4%. Therefore we have about 13 times more negative samples than positive (vandalistic) samples. Hence, we need to use learning algorithms which are robust to imbalanced data, or to transform the data to make it less imbalanced. Random forests [2] are known to be reasonably robust to imbalanced datasets and therefore we used this algorithm for our classification task. For evaluation, we use AUC which has been reported to be a robust metric for imbalance problems [19]. 4.1 Algorithms Several different methods have been proposed to further improve the effectiveness of random forests on imbalanced data. Oversampling the minority class, undersampling the majority class, or a combination of these have been used for this purpose. We use N + to represent the number of vandalistic samples and N the number of legitimate samples. In each of the 3 folds of the PAN train set we have N and N We use N + b and N b for referring to the number of positives and negatives in a bag. Chen et al. [4] introduced Balanced Random Forests (BRF) in which they undersample majority on each bag, while all of the minority cases are included in all the bags: N + b = N b = N + Based on Chen smethod, Hidoet al. [10] proposed a Roughly Balanced Random Forest (RBRF) which is similar to BRF with the exception that the number of negative samples in a roughly balanced bag N b is drawn from a negative binomial distribution centered on N + insteadofexactlyn +. Bags can also be forced to become balanced by oversampling the minority samples. In our dataset, we would need to oversample the minority by 1300% to reach balance. Again, bags could be roughly balanced by using the above approach. In our experiments, to increase diversity of the bags, we pick both minority and majority cases with replacement. Mixing oversampling of the majority and undersampling of the minority can also result in balanced bags. In our data set, this would be achieved by undersampling the majority
5 Table 2: Values used in grid search for parameter tuning of the Ranking Task Parameter Values Max Number of Leaves 2, 4, 7, 10, 15, 20, 25 Min Percentage of Obs. per Leaf 0.12, 0.25, 0.50 Learning rate 0.05, 0.1, 0.2, 0.3 Sub-sampling rate 0.3, 0.5, 1.0 Feature Sampling rate 0.1, 0.3, 0.5, 1.0 (a) TD2004 and MQ2007 data sets Parameter Values Max Number of Leaves 10, 40, 70 Min Percentage of Obs. per Leaf 0.12, 0.25, 0.50 Learning rate 0.05, 0.1, 0.2 Sub-sampling rate 0.5, 1.0 Feature Sampling rate 0.3, 0.5, 1.0 (b) MSLR-WEB10K data set by approximately 50% and oversampling the minority by approximately 700%. Oversampling the minority can also be achieved by generating synthetic data. We used SMOTE [3] for this purpose. SMOTE randomly picks a positive sample p and creates a synthetic positive sample p between p and one of the nearest neighbors of p. SMOTE has two hyper parameters: k, the number of nearest neighbors to look at, and r, the oversampling rate of minority cases (e.g., r = 100% doubles the minority size). The combination of various under and oversampling parameters with the SMOTE parameters yiels a large configuration space. 4.2 Parameter tuning To train a random forest classifier, we need to tune two free parameters: the number of trees in the model and the the number of features selected to split each node. Our experiments show that on this problem classification performance is sensitive to the number of trees but less sensitive to the number of features in each split. This result is consistent with Breiman s observation [2] on the insensitivity of random forests to the number of features selected in each split. To tune the number of trees, we partition the train set into three folds and use 3 fold cross validation. To find the minimum number of trees consistent with excellent performance, we need to sweep a large range of model sizes. Hence, we need to design an efficient process for this purpose. For each fold, we create a pool of N =10, 000 trees, each trained on a random sample of the training data in that fold. Then we use this pool for creating random forests of different sizes. For example, to create a random forest with 20 trees, we randomly select 20 trees from this pool of N trees. However, since this random selection can be done in C(N,20) different ways, each combination may result in a different AUC. We repeat the random selection of trees r =20times and we report the mean and variance of the F r results (where F isthenumberoffolds). The advantage of this approach is that we can calculate Table4: Valuesusedingridsearchforparameter tuning of Task 2 Parameter Values Oversampling rate none, 700%, 1300% Undersampling rate none, 50%, 8% Balance type exact, rough (a) Bagging strategies Parameter Values Nearest neighbors 1, 3, 5, 7, 9, 11, 13, 19, 25, 31, 37 Oversampling rate none, 100%, 200%,..., 1300% (b) SMOTE configurations the mean and variance of AUC very efficiently for forests with different sizes without the need to train a huge number of trees independently. Otherwise, to report the mean and variance of AUC for random forests of size k =1to T,we would need to train r+2 r+3 r+...+t r = r T (T +1)/2 trees for each fold, which is 40 million trees in our case. Using this approach we only need to train N trees per fold which is 30 thousand trees. Table 4 shows the values that we tried for each of the parameters for both oversampling/undersampling and SMOTE. The rates for oversampling and undersampling are picked based on the strategies mentioned in the previous section. For the SMOTE experiments, we tried 11 values for nearest neighbors and 14 values for oversampling rate. We used a MapReduce cluster of 40 nodes for these experiments. It takes about 75 minutes to run these experiments once on this cluster. Table 5 shows the best strategy in termsofvalidationaucaswevarythenumberoftreesin the forest. For very small forests, undersampling the majority works best. For medium size forests, oversampling the minority class works best; and for large forests, normal random forest (no oversampling and no undersampling) is best. We had not expected that the best method to use to handle the imbalanced data problem would vary so strongly with the number of trees in the random forest. In particular, we had not expected that undersampling the majority class with no oversampling of the minority class would be best for small random forests, but that as the forests become larger oversampling of the minority class would become beneficial. We also did not expect that the best way to handle class imbalance would be to do nothing as the number of trees in the forest grew large. The best on this problem results from using a large ( 500) number of trees (expected) and taking no steps to correct class imbalance (unexpected). 5. DISCUSSION The MapReduce clusters allowed us to tune model parameters very thoroughly. It would be interesting to know whether we really needed that many experiments to achieve high accuracy, and if it is possible to perform too many experiments. We use the results of our parameter tuning experiments to study the effect of number of experiments on improve-
6 Table 5: Best combinations of parameters found after parameter tuning for Task 2. The best strategy in terms of AUC depends on the number of trees in the forest. For very small forests, undersampling the majority works best. For medium size forests, oversampling the minority works best; and for large forests, normal random forest (no oversampling and no undersampling) is the winner Number of trees Best Strategy AUC 1 No oversampling, 8% undersampling, Roughly balanced Bags No oversampling, 8% undersampling, Roughly balanced Bags % oversampling with SMOTE and considering 19 neighbors % oversampling, 50% undersampling, Roughly balanced Bags Normal Random Forest Normal Random Forest Normal Random Forest Normal Random Forest ment in the evaluation metrics. For each dataset, we create a pool of configurations that we evaluated during parameter tuning experiments. Then we randomly select different numbers of these configurations. On each random selection, we pick the configuration with best validation performance and then record validation and test performance for that configuration. To get more accurate results, we repeat this random process 10,000 times and report average numbers. Figure 1 shows the results for the three ranking Tasks. As expected, for all three data sets, validation NDCG improves monotonically as we perform more experiments. On MSLR-WEB10K, the largest data set, there is less discrepancy between validation and test scores. The discrepancy between validation and test is largest on TD2004, the smallest data set, because the validation sets which are held aside form the training data must also be small. On TD2004, the data set with the smallest validation sets, accuracy on the test set peaks after only about 100 parameter configurations, and then slowly drops. Accuracy on the validation set is still rising significantly at 100 iterations, suggesting that hyperparameter optimization is overfitting to the validation sets. A similar effect is observed on MQ2007, but overfitting does not begin on this problem until about 400 parameter configurations have been tried. And on MSLR-WEB10K, we again observe overfitting to the validation sets after fewer than 25 configurations have been tested. Figure 2 shows the effect of validation set size on the overfitting of the hyperparameters on the MQ2007 problem. The leftmost graph is for validation sets that contain only 10% of the number of points used for this problem in Figure 1; the middle graph is for 50%, and the rightmost graph is for 100%. When the validation set is very small (left graph), the gap between validation and test set performance is very large, overfitting to the validation set occurs after only a few dozen hyperparameter configurations have been tested, and the loss in accuracy on the test set is considerable if the full configuration space is explored. On problems like this, exploiting the full power of a MapReduce Cluster for parameter optimization can significantly hurt generalization instead of helping it. As the size of the validation set grows to 50% of maximum size, the discrepancy between the validation and test set performance drops significantly, overfitting to the validation set does not begin until 100 configurations have been explored and is modest at first, and the loss in generalization that results from exploring the full configuration space is less. As the validation set becomes larger (right graph), the discrepancy between validation and test is further reduced, overfitting to the validation set does not occur until 400 or more parameter configurations have been tested, and the drop in generalization accuracy on the test set as more of the parameter space is explored is small. Surprisingly, there is still a drop in accuracy of a few tenths of a point if the full space is explored even with full-size validation sets from a 5-fold cross-validation averaged across the five folds on a problem with 10k training examples, suggesting that care must still be exercised when peak accuracy is required. When there is randomness in the algorithm and the validation sets are not infinite, as more parameter combinations are tried search begins to find parameter combinations that look better on the validation set because of this randomness. If one is not careful, the computational power provided by MapReduce Clusters is so great that it is possible to overdo parameter tuning and find parameter combinations that work not better than the hyperparameters that would have been found by less thorough search. One way to avoid overfitting at the hyperparameter learning stage is to use a 2nd held-out validation set to detect when parameter tuning begins to overfit and early-stop the parameter optimization. Holding out a 2nd validation set will reduce the size of the primary hyperparameter tuning validation sets, making overfitting more likely. But as we have seen, even large cross-validated validation sets do not completely protect from overfitting when hyperparameter optimization is exhaustive, so care must be exercised to prevent hyperparameter optimization from becoming counterproductive. 6. CONCLUSION MapReduce clusters provide a remarkably convenient and affordable resource for tuning machine learning hyperparameters. Using these resources one can quickly run massive parameter tuning experiments that would have been infeasible just a few years ago. In this paper we show how to map parameter optimization experiments to the MapReduce environment. We then demonstrate the kinds of results that can be obtained with this framework by applying it to four text learning problems. Using the framework we were able to uncover interactions between hyperparameters that we had not expected. For example, the best method for dealing with imbalanced classes in the classification problem depends on how many trees will be included in the random forest model. When there will be relatively few trees in the forest (less than X) it is important to reduce class imbalance
7 TD2004 data set MQ2007 data set MSLR WEB10K data set Figure 1: As more combinations of parameters are tested during parameter tuning of LambdaMART, improves on the validation sets, but the test set curves show overfitting of the hyperparameters eventually occurs. 10% of the set 50% of the set 100% of the set Figure 2: The effect of validation set size on hyperparameter overfitting on the MQ2007 problem. by over sampling the rare class, under sampling the majority class, or by using a method such as SMOTE to induce new rare class samples. But when the random forest grows large and contains 500 or more trees, the best results are obtained by not modifying the natural statistics in the raw data. Another surprising result is that even when performing cross validation using large validation sets, MapReduce makes it easy to explore hyperparameter space too thoroughly and overfit to the validation data. Experiments suggest that the risk is overwhelming with small validations sets, and that some risk of overfitting remains even when validation sets grow large. While we expected this result for small validation sets, we had not expected the effect to remain with larger validation sets. We conclude that while MapReduce clusters provide an incredibly convenient resource for machine learning hyperparameter optimization, one must proceed with caution or risk selecting parameters that are as inferior as those that would have been found when the parameter space could not have been explored as thoroughly. Acknowledgments Authors would like to thank Amazon for a research grant that allowed us to use their MapReduce cluster. This work has been also partially supported by NSF grant OCI REFERENCES [1] Microsoft learning to rank datasets. [2] L. Breiman. Random forests. Machine Learning, 45:5 32, /A: [3] N.V.Chawla,K.W.Bowyer,L.O.Hall,andW.P. Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16: , June [4] B. Chen, Liaw. Using random forest to learn imbalanced data. Technical report, Stanford, [5] I. Czogiel, K. Luebke, and C. Weihs. Response surface methodology for optimizing hyper parameters. Technical report, Universitat Dortmund Fachbereich Statistik, [6] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of ACM, 51: , January [7] T. Eitrich and B. Lang. Efficient optimization of support vector machine learning parameters for unbalanced datasets. Journal of Computational and Applied Mathematics, 196: , November [8] J. H. Friedman. Stochastic gradient boosting. Technical report, Technical report, Dept. Statistics, Stanford Univ., 1999.
8 [9] J. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29: , [10] S. Hido and H. Kashima. Roughly balanced bagging for imbalanced data. In SIAM Data Mining, pages , [11] K. Järvelin and J. Kekäläinen. IR evaluation methods for retrieving highly relevant documents. In SIGIR 00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 41 48, New York, NY, USA, ACM. [12] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In Proceedings of the 24th international conference on Machine learning, ICML 07, pages , New York, NY, USA, ACM. [13] A. Nareyek. Choosing search heuristics by non-stationary reinforcement learning. Applied Optimization, 86: , [14] M. Postema, T. Menzies, and X. Wu. A decision support tool for tuning parameters in a machine learning algorithm. In PACES/SPICIS 97 Proceedings, pages , [15] M. Potthast, A. Barrón-Cedeño, A. Eiselt, B. Stein, and P. Rosso. Overview of the 2nd international competition on plagiarism detection. In Proceedings of the CLEF 10 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, [16] F. Poulet. Multi-way Distributed SVM algorithms. In Parallel and Distributed computing for Machine Learning. In conjunction with the 14th European Conference on Machine Learning (ECML 03) and 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 03), Cavtat-Dubrovnik, Croatia, September [17] T. Qin, T.-Y. Liu, J. Xu, and H. Li. LETOR: A benchmark collection for research on learning to rank for information retrieval. Information Retrieval, 13: , /s y. [18] C. C. SkiścimandB.L.Golden.Optimizationby simulated annealing: A preliminary computational study for the tsp. In Proceedings of the 15th conference on Winter Simulation - Volume 2, WSC 83, pages , Piscataway, NJ, USA, IEEE Press. [19] G. M. Weiss. Mining with rarity: a unifying framework. SIGKDD Explor. Newsl., 6:7 19, June [20] Q. Wu, C. Burges, K. Svore, and J. Gao. Ranking, boosting and model adaptation. Technical report, Microsoft Technical Report MSR-TR , 2008.
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
Predicting borrowers chance of defaulting on credit loans
Predicting borrowers chance of defaulting on credit loans Junjie Liang ([email protected]) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
The Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Using Random Forest to Learn Imbalanced Data
Using Random Forest to Learn Imbalanced Data Chao Chen, [email protected] Department of Statistics,UC Berkeley Andy Liaw, andy [email protected] Biometrics Research,Merck Research Labs Leo Breiman,
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
Random forest algorithm in big data environment
Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest
Random Forest Based Imbalanced Data Cleaning and Classification
Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem
How To Rank With Ancient.Org And A Random Forest
JMLR: Workshop and Conference Proceedings 14 (2011) 77 89 Yahoo! Learning to Rank Challenge Web-Search Ranking with Initialized Gradient Boosted Regression Trees Ananth Mohan Zheng Chen Kilian Weinberger
Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods
Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing
Advanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
Bootstrapping Big Data
Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu
On the effect of data set size on bias and variance in classification learning
On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent
Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk [email protected] Tom Kelsey ID5059-19-B &
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
D-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
Getting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand [email protected] ABSTRACT Ensemble Selection uses forward stepwise
Making Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
Data Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
Machine Learning Big Data using Map Reduce
Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
Selecting Data Mining Model for Web Advertising in Virtual Communities
Selecting Data Mining for Web Advertising in Virtual Communities Jerzy Surma Faculty of Business Administration Warsaw School of Economics Warsaw, Poland e-mail: [email protected] Mariusz Łapczyński
Leveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
Decision Tree Learning on Very Large Data Sets
Decision Tree Learning on Very Large Data Sets Lawrence O. Hall Nitesh Chawla and Kevin W. Bowyer Department of Computer Science and Engineering ENB 8 University of South Florida 4202 E. Fowler Ave. Tampa
ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS
CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS A.Divya *1, A.M.Saravanan *2, I. Anette Regina *3 MPhil, Research Scholar, Muthurangam Govt. Arts College, Vellore, Tamilnadu, India Assistant
Large Scale Learning to Rank
Large Scale Learning to Rank D. Sculley Google, Inc. [email protected] Abstract Pairwise learning to rank methods such as RankSVM give good performance, but suffer from the computational burden of optimizing
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 [email protected]
Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ
Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ David Cieslak and Nitesh Chawla University of Notre Dame, Notre Dame IN 46556, USA {dcieslak,nchawla}@cse.nd.edu
The Need for Training in Big Data: Experiences and Case Studies
The Need for Training in Big Data: Experiences and Case Studies Guy Lebanon Amazon Background and Disclaimer All opinions are mine; other perspectives are legitimate. Based on my experience as a professor
E-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce
Introducing diversity among the models of multi-label classification ensemble
Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and
FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
A semi-supervised Spam mail detector
A semi-supervised Spam mail detector Bernhard Pfahringer Department of Computer Science, University of Waikato, Hamilton, New Zealand Abstract. This document describes a novel semi-supervised approach
How To Solve The Kd Cup 2010 Challenge
A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China [email protected] [email protected]
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
II. RELATED WORK. Sentiment Mining
Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract
Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.
Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada
Towards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 [email protected] Abstract Spam identification is crucial
Fraud Detection for Online Retail using Random Forests
Fraud Detection for Online Retail using Random Forests Eric Altendorf, Peter Brende, Josh Daniel, Laurent Lessard Abstract As online commerce becomes more common, fraud is an increasingly important concern.
Evaluation of Feature Selection Methods for Predictive Modeling Using Neural Networks in Credits Scoring
714 Evaluation of Feature election Methods for Predictive Modeling Using Neural Networks in Credits coring Raghavendra B. K. Dr. M.G.R. Educational and Research Institute, Chennai-95 Email: [email protected]
Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm
Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Martin Hlosta, Rostislav Stríž, Jan Kupčík, Jaroslav Zendulka, and Tomáš Hruška A. Imbalanced Data Classification
Chapter 7. Feature Selection. 7.1 Introduction
Chapter 7 Feature Selection Feature selection is not used in the system classification experiments, which will be discussed in Chapter 8 and 9. However, as an autonomous system, OMEGA includes feature
Ensembles and PMML in KNIME
Ensembles and PMML in KNIME Alexander Fillbrunn 1, Iris Adä 1, Thomas R. Gabriel 2 and Michael R. Berthold 1,2 1 Department of Computer and Information Science Universität Konstanz Konstanz, Germany [email protected]
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
REVIEW OF ENSEMBLE CLASSIFICATION
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014
Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,
Fortgeschrittene Computerintensive Methoden
Fortgeschrittene Computerintensive Methoden Einheit 3: mlr - Machine Learning in R Bernd Bischl Matthias Schmid, Manuel Eugster, Bettina Grün, Friedrich Leisch Institut für Statistik LMU München SoSe 2014
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
A Logistic Regression Approach to Ad Click Prediction
A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi [email protected] Satakshi Rana [email protected] Aswin Rajkumar [email protected] Sai Kaushik Ponnekanti [email protected] Vinit Parakh
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Cross-Validation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
Better credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
A Decision Theoretic Approach to Targeted Advertising
82 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000 A Decision Theoretic Approach to Targeted Advertising David Maxwell Chickering and David Heckerman Microsoft Research Redmond WA, 98052-6399 [email protected]
Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information
Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information Andrea Dal Pozzolo, Giacomo Boracchi, Olivier Caelen, Cesare Alippi, and Gianluca Bontempi 15/07/2015 IEEE IJCNN
W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set
http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center
Feature Selection with Monte-Carlo Tree Search
Feature Selection with Monte-Carlo Tree Search Robert Pinsler 20.01.2015 20.01.2015 Fachbereich Informatik DKE: Seminar zu maschinellem Lernen Robert Pinsler 1 Agenda 1 Feature Selection 2 Feature Selection
A Novel Classification Approach for C2C E-Commerce Fraud Detection
A Novel Classification Approach for C2C E-Commerce Fraud Detection *1 Haitao Xiong, 2 Yufeng Ren, 2 Pan Jia *1 School of Computer and Information Engineering, Beijing Technology and Business University,
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam
Analysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT MapReduce is a programming model
Classification and Regression by randomforest
Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many
BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL
The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University
Predicting Flight Delays
Predicting Flight Delays Dieterich Lawson [email protected] William Castillo [email protected] Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing
Combining Global and Personal Anti-Spam Filtering
Combining Global and Personal Anti-Spam Filtering Richard Segal IBM Research Hawthorne, NY 10532 Abstract Many of the first successful applications of statistical learning to anti-spam filtering were personalized
A Hybrid Approach to Learn with Imbalanced Classes using Evolutionary Algorithms
Proceedings of the International Conference on Computational and Mathematical Methods in Science and Engineering, CMMSE 2009 30 June, 1 3 July 2009. A Hybrid Approach to Learn with Imbalanced Classes using
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: ([email protected]) TAs: Pierre-Luc Bacon ([email protected]) Ryan Lowe ([email protected])
On Cross-Validation and Stacking: Building seemingly predictive models on random data
On Cross-Validation and Stacking: Building seemingly predictive models on random data ABSTRACT Claudia Perlich Media6 New York, NY 10012 [email protected] A number of times when using cross-validation
Beating the MLB Moneyline
Beating the MLB Moneyline Leland Chen [email protected] Andrew He [email protected] 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series
Big Data with Rough Set Using Map- Reduce
Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,
Distributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
Data Mining in Web Search Engine Optimization and User Assisted Rank Results
Data Mining in Web Search Engine Optimization and User Assisted Rank Results Minky Jindal Institute of Technology and Management Gurgaon 122017, Haryana, India Nisha kharb Institute of Technology and Management
SVM Ensemble Model for Investment Prediction
19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of
Regularized Logistic Regression for Mind Reading with Parallel Validation
Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland
Classification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
Logistic Regression for Spam Filtering
Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used
Model Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
Tree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems
Tree Ensembles: The Power of Post- Processing December 2012 Dan Steinberg Mikhail Golovnya Salford Systems Course Outline Salford Systems quick overview Treenet an ensemble of boosted trees GPS modern
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Heritage Provider Network Health Prize Round 3 Milestone: Team crescendo s Solution
Heritage Provider Network Health Prize Round 3 Milestone: Team crescendo s Solution Rie Johnson Tong Zhang 1 Introduction This document describes our entry nominated for the second prize of the Heritage
Task Scheduling in Hadoop
Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed
Comparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski [email protected] Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training
MapReduce Approach to Collective Classification for Networks
MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty
