Environmental Modelling & Software

Size: px
Start display at page:

Download "Environmental Modelling & Software"

Transcription

1 Environmental Modelling & Software 25 (2010) Contents lists available at ScienceDirect Environmental Modelling & Software journal homepage: Predicting the potential habitat of oaks with data mining models and the R system Rafael Pino-Mejías a, *, María Dolores Cubiles-de-la-Vega a, María Anaya-Romero b, Antonio Pascual-Acosta c, Antonio Jordán-López b, Nicolás Bellinfante-Crocci b a Department of Statistics and Operational Research, University of Seville, Avda. Reina Mercedes s/n, Seville, Spain b Department of Cristallography, Mineralogy and Agricultural Chemistry, University of Seville, Avda. Reina Mercedes s/n, Seville, Spain c Andalusian Prospective Center, Avda. Reina Mercedes s/n, Seville, Spain article info abstract Article history: Received 24 June 2009 Received in revised form 12 January 2010 Accepted 17 January 2010 Available online 12 February 2010 Keywords: Habitat modelling Supervised classification R system Data mining models Ensemble models Classification trees Neural networks Oaks Support vector machines Oak forests are essential for the ecosystems of many countries, particularly when they are used in vegetal restoration. Therefore, models for predicting the potential habitat of oaks can be a valuable tool for work in the environment. In accordance with this objective, the building and comparison of data mining models are presented for the prediction of potential habitats for the oak forest type in Mediterranean areas (southern Spain), with conclusions applicable to other regions. Thirty-one environmental input variables were measured and six base models for supervised classification problems were selected: linear and quadratic discriminant analysis, logistic regression, classification trees, neural networks and support vector machines. Three ensemble methods, based on the combination of classification tree models fitted from samples and sets of variables generated from the original data set were also evaluated: bagging, random forests and boosting. The available data set was randomly split into three parts: training set (50%), validation set (25%), and test set (25%). The analysis of the accuracy, the sensitivity, the specificity, together with the area under the ROC curve for the test set reveal that the best models for our oak data set are those of bagging and random forests. All of these models can be fitted by free R programs which use the libraries and functions described in this paper. Furthermore, the methodology used in this study will allow researchers to determine the potential distribution of oaks in other kinds of areas. Ó 2010 Elsevier Ltd. All rights reserved. 1. Introduction Sustainable land management is crucial for the prevention of land degradation, for the reclamation of degraded land for its productive use, for the reaping of benefits of crucial ecosystem services, and for the protection of biodiversity. The use of local knowledge should be encouraged since it can provide insights into proven adaptation techniques and can contribute towards the design of early-warning systems for extreme events. The sustainable management of natural resources requires an exhaustive knowledge of the physical environmental components with particular focus on the relations between these elements and the various plant communities. In this way, human activity in natural areas, and especially in protected areas, must be carefully planned in order to assure their conservation and to allow the socio-economic development of the populations located therein (Halpern and Spies, 1995; Bullock et al., 1999; Anaya-Romero, 2004; Kirilenko et al., 2007). In this sense, the sustainable management of the forests (trees, shrubs, etc.) must be considered as a fundamental aspect for the * Corresponding author. Tel.: þ ; fax: þ address: rafaelp@us.es (R. Pino-Mejías). soil protection based on land-ecological principles (Kimmins, 1987). Taking this into account, the Mediterranean oak is one of the most important woody species in the forest communities of the western Mediterranean basin. This species is a highly common Mediterranean sclerophyll which grows throughout the entire Mediterranean basin region and can be found in the form of pure or mixed stands. Therefore the use of oaks in vegetal restoration ensures the survival of reforestation, even in situations of prolonged drought. In fact, the use of oaks in Spanish reforestation programs has greatly increased in the last 10 years to exceed that of the Pinus species widely used in the past (MAPA, 2006). A suitable methodology for the determination of an appropriate model to predict the potential distribution of oak forest would be welcomed and could help towards the improvement of the process of reforestation under Mediterranean conditions. As regards land cover data, the European Environment Agency (EEA) aims to provide those responsible for and interested in European policy on the environment with quantitative data on land cover data which is consistent and comparable across the continent. Compiling a single CORINE Land Cover database for all European countries, and registering all changes to this land cover /$ see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi: /j.envsoft

2 R. Pino-Mejías et al. / Environmental Modelling & Software 25 (2010) (LC), is crucial since the impact of many environmental problems exceeds national borders, and hence solutions often involve more than one sovereign territory. In support of environmental assessment, the need for regularly updated information on land cover has become important at both European and national levels. The growing interest in such information can be ascribed to the important role of LC in processes taking place on the Earth s surface, such as absorption of solar radiation, utilization of carbon dioxide by plant associations, and evaporation. Landscape changes at national and global levels are becoming even more topical, and their influence acquires new dimensions not only in research, but also in environmental management. That is why harmonized and standardized spatial reference data is considered mandatory for the support of environmental management in the European Union policies (Feranec et al., 2007). The CLC database provides information about land cover that can contribute towards new approaches for the assessment of the European landscape, for instance, in the context of environmental and economic accounting, diversity, and modelling of its properties. These approaches are made possible by the fact that land cover reflects the biophysical state of the real landscape (Feranec et al., 2009). In the present research, data from CLC is considered in order to develop a new tool for the prediction of the potential habitat of oaks across Europe using an exhaustive environmental database. Relationships between variables in ecology are almost always extremely complicated and highly non-linear. Although a large number of evaluation models have already been described and compared by several authors (Franklin, 1995; Guisan and Zimmermann, 2000), it remains to be seen which models perform best, given particular circumstances (Guisan and Zimmermann, 2000; Hirzel et al., 2001; Zaniewski et al., 2002). On the other hand, tools and technologies do exist for efficient land management and administration, but their adaptation needs to be promoted and application expanded. A large number of human livelihoods and ecosystems can benefit from these tools and techniques since these yield multiple benefits. In the present research these issues are tackled empirically. This paper reports the development and comparison of data mining models in the prediction of the potential habitat for the oak forest type (Quercus rotundifolia and Quercus suber) in the Mediterranean pilot area comprising Sierra de Aracena Natural Park, and part of the Western Andevalo nature area, located in southern Spain. The term data mining is actually a part of a wider process that was termed Knowledge Discovery from Data (KDD) by Fayad et al. (1996), oriented to identify patterns in data sets. KDD includes several steps: collecting and cleaning the data, preprocessing, data reduction, and the application of specific algorithms to search for patterns in the data. This last par is usually known as data mining. Other steps in the KDD are the interpretation of the patterns discovered and the reporting of these findings. KDD is an interdisciplinary research field where the collaboration of different areas, such as statistics, artificial intelligence, information systems, machine learning, computational learning theory, and other related sciences has made many different tools available for the data mining process. Another related term, machine learning, is concerned with the design and development of algorithms and techniques capable of learning from experience, and therefore the data mining models used in our study also lie within the machine learning framework. Data mining models have been successfully applied in many fields, such as Medicine, Econometric Analysis, and Image Analysis, where they offer new and valuable tools for classification and regression problems and for the clustering of a set of objects. These models are also useful for the environmental sciences where some particular problems, such as the mixed nature of the data (quantitative and qualitative), and/or high non-linearity complicate the task of modelling the environmental system. There are several references in this journal where data mining techniques have been used. Ekasingh and Ngamsomsuke (2009) have used the C4.5 data mining algorithm to model farmers crop choice in two watersheds in Thailand. Neural networks, rule induction and clustering techniques are used in Dixon et al. (2007) to improve the reliability and efficiency of monitoring and control of anaerobic wastewater treatment plants. Gibert et al. (2006) present a software tool for intelligent data analysis and implicit knowledge management, including several data mining algorithms. Belanche-Muñoz and Blanch (2008) develop some statistical and machine learning models to obtain predictive models for the determination of faecal sources in waters. May et al. (2008) formulate a non-linear procedure to select input variables for artificial neural networks. For each problem, a great family of methods, ranging from classic and simple statistical methods to sophisticated and computer-intensive methods, is available, and therefore a careful process where these methods are correctly fitted may improve the construction of environmental models. However, it must be remarked that the more complex data mining methods are not necessarily superior. Each particular problem has its own most appropriate data mining model and it is possible that simple models yield a better performance for certain data sets. Our problem lies within the two-class supervised learning framework: given a set of multivariate vectors of measurements taken at n geographical points, each belonging to one known class (presence/absence of oaks), the task is to build a classification rule to assign new points to their correct class. Classic statistical models, such as Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), and Logistic Regression (LR) are potential tools for this task. LDA for two groups is based on the earliest approach by Fisher (1936), which computes a linear discriminant function for the definition of the classification rule which is optimum in homoscedastic Gaussian populations. For heteroscedastic Gaussian populations, the decision boundary is described by a quadratic equation (QDA). LR is one of the most widely used statistical models to predict binary outcomes and presents interesting properties concerning the interpretation of the coefficients (Hosmer and Lemeshow, 1989), and hence has been extensively used in the field of medical statistics. Other data mining techniques, such as Classification Trees (CTs), Artificial Neural Networks (ANNs) and Support Vector Machines (SVMs) have been successfully applied to data from different fields. CT has connections with the Social Sciences (Morgan and Sonquist, 1963) and the Artificial Intelligence (AI) communities (Quinlan, 1993), although methods with a more statistical foundation have also been developed. CT stands out for its ease of interpretation of the model obtained. The AI community developed a powerful computational paradigm, ANN (Hertz et al., 1991), which was progressively incorporated into statistical practice (Cheng and Titterington, 1994; Bishop, 1995), although its black-box nature makes the interpretation of the resulting models very difficult. SVM emerged from Statistical Learning Theory, otherwise known as Vapnik-Chernovenkis theory (Cristianini and Shawe-Taylor, 2000; Vapnik, 1998), and currently commands great interest accompanied by similar enthusiasm to that previously experienced for ANN. These models are freely available in the R system (R Development Core Team, 2008) which also provides the user with a powerful statistical programming language. Ihaka and Gentleman (1996) present an introduction to the main characteristics of the R system. The programming resources of this system are highly suitable for programming ensemble methods, where

3 828 R. Pino-Mejías et al. / Environmental Modelling & Software 25 (2010) a certain number of models are constructed by resampling the set of cases and/or the set of predictors, and the models are aggregated by majority voting. Three ensemble methods, bagging, random forests and boosting, are also considered in this paper. There is no universal best learning method. As explained in Witten and Frank (2005), the various data mining methods correspond to different concept description spaces searched with different schemes. Thus, certain description languages and search procedures serve some problems well and other problems badly, thereby making it necessary to perform a careful comparison of the many data mining techniques. Section 2 describes the data set used in our study. Data mining models are presented from the point of view of the currently available R implementations in Section 3. Several practical questions associated with its use are also answered. For a wider discussion of these and other learning techniques, an excellent reference is Hastie et al. (2001), where the different topics on Statistical Learning Theory are described both theoretically and in a practical way. The results obtained are presented in Section 4, and finally, the main conclusions are discussed in Section Data description 2.1. Area under study The area under study is situated in the North of Huelva, in the Southwest of Spain. This zone belongs to the Natural Protected Spaces of Andalusia and it comprises Sierra de Aracena Natural Park and part of Western Andevalo natural area (Fig. 1). The total area spans approximately 4770 km 2. Andevalo is located in the SW of Sierra de Aracena. The elevation and relief are lower and less extreme than in Sierra de Aracena, and the elevation scarcely exceeds 1000 m. The climate is Mediterranean, with great seasonal variance, characterized by rainy cold winters and dry warm summers (Draín, 1979). The soils are shallow (Núñez, 1998; Martínez-Zavala, 2001), with a profile A-C or A-R. Generally, Leptosols are the more frequent soils in the area of Andevalo, and Cambisols and Regosols occupy more extensive areas in Sierra de Aracena. Cork oak (Q. suber) and oak (Q. rotundifolia) constitute the majority of the vegetation. There are also widely reforested areas, usually with Eucalyptus and Pines, in the central and southern parts of the area of study. Occasionally, there are other cultivated broadleaved species, such as Castanea sativa and/or riverside species (Moreira and Fernández Palacios, 1997). The main land use in Sierra de Aracena is forestry, whilst in Andevalo, this is represented by farming Predictor variables The map of the distribution of the oak forest type in the area under study was extracted from the CORINE Land Cover database (scale 1:50,000) for the year 1990 (CLC1990). A high number of classes of oak forest type were considered in the original legend of CLC1990, related with other vegetation associations, and hence all of these classes are grouped together as oak forest type. The map of oak forest type (Fig. 2) was then obtained. From this map the dependent variable of our data set, namely the presence/ absence of oak forest type, was extracted. In this way, data on the current distribution of forest type can be used to predict the potential distribution of oaks, and additionally, this information can be used to validate results from each evaluation model used. Predictor variables were selected according to diverse environmental assessment studies and expert knowledge (e.g., Walter, 1977; MOPT, 1991; De la Rosa et al., 1999; Del Toro, 1996), which suggest several environmental variables that are suspected of being of great physiological importance to plants. These variables were grouped into several thematic categories: lithology, geomorphology, physiography, relief, soil, climate, and geographic location. The data was collected from different sources, including data directly processed by the authors, as described below. Lithological information was extracted from the Geological Map of Spain, published by the Geological and Mining Institute of Spain (IGME) on a scale of 1:50,000. The maps were digitalized and vectorized using Arc/Info software (ESRI, ). The lithological variables considered were: rock origin, acidity, and consolidation (Table 1). Fig. 1. Area under study. Fig. 2. Distribution of oak forest type referred to the year 1990 (CLC1990).

4 R. Pino-Mejías et al. / Environmental Modelling & Software 25 (2010) Soil data were extracted from the maps of geomorphoedaphic units of Sierra de Aracena national park (Núñez, 1998), and western Andevalo (Martínez-Zavala, 2001). The edaphic variables selected were: soil acidity (ph), available chemical elements (Fe, Mn, Cu, Mg, and K), organic matter, cation exchange capacity (CEC), saturation of the exchange complex (S), coarse element fraction, and clay fraction. Geomorphological variables and physiography were also extracted from the data in Núñez (1998) and Martínez-Zavala (2001). The five measured geomorphological variables were: dominant and secondary erosive processes (sheet and rill erosion, rill and gully erosion, gully erosion, water drop erosion, or vertical erosion by rivers), mass movements (presence/absence), sedimentation (by gravity or by flooding), and morphogenesis (fluvial, denudative, endogenous or karstic). Physiography was represented by one variable with the same name, and covered the following categories: river bed, stable plain, erosion surface, plateau, raised strip, karst, hill, butte, and mountain. The variables concerning relief were extracted from a digital terrain model (DTM, resolution m 2 ). The four selected variables were elevation, slope, curvature and orientation; and were derived from the digital elevation model (DEM). Climate data was processed by Anaya-Romero (2004) using a statistical interpolation of data points obtained from a network of weather stations. The five selected variables were: annual precipitation, summer precipitation, annual average temperature, average temperature of the hottest month, average temperature of the coolest month. Five temperature and precipitation maps were obtained by using multiple linear regression models. Dependent variables were collected from 55 weather stations located in the province of Huelva, comprising more than 20 consecutive years of weather data. Independent variables were Digital Elevation Model (DEM), sea distance and Digital Insolation Model (DIM). DIM was calculated following Felicísimo (1994) for 21st June, and was obtained as a sum of DIM calculated for every sunlight hour. The fit of the five models was assessed using numeric and graphic procedures. It is well known (Storch, 1999) that the interpolated variables present a smaller variance than that of the dependent variables. For multiple linear regression, the difference between the two variances is precisely the variance of the error s 2. Hence, the mean square error (MSE), usually presented in ANOVA tables for linear regression, which is an unbiased estimator of s 2, can be employed to judge the uncertainty associated to the interpolated variables. Thus, the multiple correlation coefficient (R) and the MSE for each climate variable model were: R ¼ and MSE ¼ for the logarithm of the annual precipitation; R ¼ and MSE ¼ for the logarithm of summer precipitation; R ¼ and MSE ¼ for the square of annual average temperature; Table 1 Lithological variables and description. Variable Class Description Rock origin Volcanic Extrusive igneous rock solidified near or on the surface of the Earth Plutonic Large mass of intrusive igneous rock believed to have solidified deep within the earth Metamorphic Rock altered by pressure and heat Sedimentary Rock formed from consolidated clay sediments Acidity Acid Siliceous sandstones, slates, quartzites, etc. Basic Limestones and ultrabasic plutonic rocks Consolidation Consolidated Hard and continuous rocks. Not consolidated Loose rocks (clay, sand, marl) or colluvial substrates R ¼ 0.625, and MSE ¼ for the average temperature of the hottest month, and R ¼ and MSE ¼ for the average temperature of the coolest month. Therefore, a reasonable fit was obtained for the climate variables, while a small degree of uncertainty can be expected. Regression equations were then extrapolated in order to obtain maps of temperature and precipitation for the whole area, as well as for the other predictor variables. Finally, geographic location variables were the longitude and latitude of each sample point. In short, 31 variables are considered as potential predictor variables, with approximately 3 points/km 2 of sampling density. Given the large size of the available data set, as explained in the following subsection, there is no risk of numeric failure in the fitting procedures. Therefore, a previous reduction of the number of variables is rendered unnecessary. Furthermore, some of the data mining methods are able to detect the importance of each variable Data preprocessing The grid corresponding to the oak forest type contains dichotomous information (1/0); value 1 indicates presence, and value 0, absence. Moreover, the coverage of each predictor variable was also transformed into grids, which were built with the same cell size, m 2. Due to the high number of data points, a systematic sampling of the area of study was made. Additionally, the spatial distribution of missing and error data was visually verified on a DTM in order to analyze the effect of dropping the absent data. Missing data was found to be distributed in a regular way, without any trend, and hence was eliminated from the analysis. Thus, 13,840 uniformly distributed points were finally selected, where 41.9% of these points presented the oak forest type. Given the large size of our data set, and following the suggestions of Hastie et al. (2001), this data set was randomly split into three disjoint parts: training set (50%), validation set (25%), and test set (25%). The training set was used to fit the models; the validation set was used to determine the parameter configuration of each model; and the test set was employed to estimate the generalization error of the various final models. 3. Data mining models 3.1. Linear and quadratic discriminant analysis Given two multivariate independent samples where p quantitative predictor variables have been observed for n i cases, i ¼ 1,2, n ¼ n 1 þ n 2, then the LDA model supposes that both populations are multivariate Normal with means m 1 and m 2, and have a common covariance matrix S.TheLDAruleclassifiesap-dimensional vector x to class 2 if x t b S 1 bm 2 bm 1 > 1 2 bmt 2 b S 1 bm bmt 1 b S 1 bm 1 þlog bp 1 log bp 2 (1) where the prior probabilities of class memberships p 1 and p 2 are usually estimated by using the class proportions in the training set. LDA provides the minimum misclassification rate, and therefore, it is optimal, under the previously described hypothesis. The environmental data is usually non-gaussian, for example qualitative variables must be coded as dummy 0 1 variables. It therefore makes sense to instead choose the cut point that empirically minimizes the classification error, as suggested by Hastie et al. (2001). This also corresponds to the use of prior class probabilities not necessarily equal to n i /n. We have used the R function lda (Venables and Ripley, 2002), which is available in the MASS library. This function computes the estimated probability for each class, and hence the classification rule could also be formulated by predicting class 1 if the

5 830 R. Pino-Mejías et al. / Environmental Modelling & Software 25 (2010) estimated probability for class 1 is greater than a threshold probability p c. This last value could be selected by the empirical optimization of the classification error. Thus, 99 possible values for p c (0.01, 0.02,., 0.99) are considered in our study, and the value minimizing the classification error over the validation set is selected. When the covariance matrices are not assumed to be equal, quadratic discrimination functions are computed, and hence the QDA rule yields arg max i d i ðxþ; d i ðxþ ¼ 1 2 log b S i 1 2 t x bm b i S 1 x bm i þ log bp i (2) The R function qda (Venables and Ripley, 2002) in the MASS library has been used in our case study. A similar search for the cut point was also realized for QDA model through the same set of 99 threshold probabilities. Both techniques are widely used, and they perform well on diverse classification tasks, as described in the STATLOG project (Michie et al., 1994). They can be efficiently computed and provide interpretable classification rules. The main drawback of LDA is also its simplicity, which can fail to capture complex structures in the data set. However, the quadratic version, which requires a larger number of parameters, tends to overfit on data sets of at most a few hundred points, whereas in this situation, LDA is usually better, since it is robust to departures from the assumptions Logistic regression For a binary response and p quantitative predictors x 1,.,x p, (some of which may be dummy variables for coding qualitative variables), the LR model assumes that the probability of the target response is e b 0þb 1 x 1þ/þb p x p p x 1 ;.; x p ¼ 1 þ e b (3) 0þb 1 x 1þ/þb p x p The glm function in R (Venables and Ripley, 2002) tries to compute the maximum likelihood estimators of the p þ 1 parameters by means of an iterative weighted least-squares (IWLS) algorithm. There are several inferential procedures to test the statistical significance of the whole model and the individual significance of each variable. The model may also be interpreted and a great family of diagnostics and criteria are available to identify influential and outlying observations. LR can be fully embedded into a formal decision framework, but in order to carry out a comparison with the other models, a threshold probability we needs to be specified, which, in fact, corresponds to varying the prior class probabilities. Thus, 99 possible values for this threshold probability (0.01,0.02,.,0.99) are also considered, and that value which minimizes the classification error over the validation set is selected. In the same way as LDA, LR is also optimal under the assumption of multivariate normal distributions with equal covariance matrices, although LR remains optimal in a wider variety of situations. However, LR requires larger data sets to obtain stable results, and a greater variety of statistical concepts, such as odds ratios and interactions, are needed. In addition, complex non-linear relations between the dependent and independent variables could be incorporated through appropriate but non-evident transformations Classification trees A classification tree (CT) is a set of logical if then conditions which drive each case to a final decision. These conditions can i easily be plotted in order to aid the understanding of the model. A binary CT is grown by binary recursive partitioning using the response in the specified formula and by choosing splits from the set of predictor variables. The split which maximizes the reduction in impurity (a measure of diversity for the outcome in a specific set of nodes) is chosen, the data set is then split and the process is repeated. Splitting continues until the terminal nodes are too small to be split. The classification for a vector is computed by a majority class vote in its terminal node. We have used the rpart package of R (Therneau and Atkinson, 2006), which implements the CART methodology as proposed by Breiman et al. (1984). One advantage of this model is that predictor input qualitative variables may be directly used without the aid of dummy variables. CART also includes a fully automatic missing-value handling mechanism. The Gini index (default impurity measure) has been taken as the splitting criterion. Given that large trees can lead to overfitting the data and can mean a loss in the generalization capability for new data, the user must tune a fundamental parameter: the number of terminal nodes, called the size of the tree. Several strategies exist for the selection of the optimal CT, but the availability of a large validation set in our case study led us to initially grow a tree as big as possible, and then to select the sub-tree which minimized the misclassification error over the validation set. The main drawback of classification trees is their high variance, that is, a small change in the data can result in a very different series of splits. This problem can be solved with the aid of bagging techniques. The final tree could also include weak interactions between variables while the strongest interactions may remain undetected Multilayer perceptron Artificial Neural Network (ANN) is a computational paradigm which provides a great variety of mathematical non-linear models, useful for tackling different statistical problems. Several theoretical results support a particular architecture, namely the multilayer perceptron (MLP), for example, the universal approximate property, as in Bishop (1995). To this end, we have considered a three-layered perceptron with the logistic activation function g(u) ¼ e u /(e u þ 1) in the hidden layer, and the identity function as the activation function for the output layer. By denoting H as the size of the hidden layer, {v ih, i ¼ 0,1,2,.,p, h ¼ 1,2,.,H} as the synaptic weights for the connections between the p-sized input and the hidden layer, and {w hj, h ¼ 0,1,2,.,H, j ¼ 1,2,.,q}as the synaptic weights for the connections between the hidden and the q-sized output layer, then the outputs of the neural network from a vector of inputs (x 1,.,x p )become o j ¼ w 0j þ XH h ¼ 1 w hj g v 0h þ Xp i ¼ 1 v ih x i!; j ¼ 1; 2;.; q (4) In our classification problem, the response was coded by a vector z ¼ (z 1, z 2 ) formed by two dummy variables 0 1, one for each class, and hence the number of outputs was q ¼ 2. For an input vector to the net, the classification is that class corresponding to the dummy variable which achieves the maximum of the two predictions of the net. Input qualitative variables must also be coded as dummy 0 1 variables. The nnet R function (Venables and Ripley, 2002) fits singlehidden-layer neural networks by using the BFGS procedure, a quasi-newton method also known as a variable metric algorithm, which attempts to minimize a least-squares criterion which introduces a decay term l in an effort to prevent problems of overfitting. The BFGS algorithm can be found in Bishop (1995). Defining W ¼ (W 1,.,W M ) as the vector of all the M coefficients of the net, and given nq-sized target vectors y 1,.,y n, then the BFGS

6 R. Pino-Mejías et al. / Environmental Modelling & Software 25 (2010) method can be applied to the following non-linear least-squares problem:! X n Min ky i by i k 2 þ l XM Wi 2 (5) W i ¼ i i ¼ i A major disadvantage of MLP is the fact that there is no known procedure which guarentees a global solution, and usually only one of the many possible local minima is obtained at most. Another drawback is its black-box nature, which renders the resulting model very difficult to interpret. The performance of the final model can be improved with the normalization of the quantitative input variables, and therefore for each quantitative variable X i the following value has been computed for each record j: z ij ¼ (x ij x i,min )/(x i,max x i,min ). The R implementation of an MLP model requires the specification of two parameters: the size of the hidden layer (H) and the decay parameter (l), and therefore a search was carried out over a grid defined as {15, 20, 25} {0,0.01, 0.05, 0.1, 0.2,.,1.5}. This grid was defined according to the suggestions of Hastie et al. (2001), and it must be remarked that greater values for H could not be attempted due to the limited memory resources of our personal computers Support vector machines Support Vector Machine (SVM) is a family of supervised machine learning techniques. They were originally introduced by Vapnik and other co-authors (Boser et al., 1992) and several extensions were successively proposed. When used for a twoclass classification problem where the set of binary labelled training patterns is linearly separable, the SVM method separates the two classes with a hyper-plane that is maximally distant from the patterns ( the maximal margin hyper-plane ). If linear separation is not possible, the feature space is enlarged using basis expansions such as polynomials or splines. However, explicit specification of this transformation is not necessary, only a kernel function that computes inner products in the transformed space is required. We have fitted the SVM models with the svm function available in the library e1071 of the R system (Dimitriadou et al., 2006), which offers an interface to the award-winning Cþþ implementation, LIBSVM, by Chan and Lin. The data set is described by n training vectors {x i,y i }, i ¼ 1,2,.,n, where the p- dimensional vectors x i contain the predictor features and the n labels y i { 1,1} identify the class of each vector. From among the several variants of SVM existing in the library e1071, following Meyer (2004), we have used C-classification with the Radial Basis Gaussian kernel function: Kðu; vþ ¼exp gku vk 2 (6) Min w;b;x The primal quadratic programming problem to be solved is: 1 2 wt w þ C Xn x i i ¼ 1 y i wt fðx i Þþb 1 x i x i 0; i ¼ 1; 2;.; n C > 0 is a parameter controlling the trade-off between margin and error, and P n i ¼ i x i is an upper bound on the sum of distances of the wrongly classified cases to their correct plane. The dual problem is 1 Max a 2 at Qa e t a (8) 0 a i C; i ¼ 1; 2;.; n y t a ¼ 0 where e is the n-vector of all ones, and Q is a positive semidefinite matrix defined by Q ij ¼ y i y j K(x i,x j ), i,j ¼ 1,2,.,n, where K(x i,x j ) ¼ f(x i )f(x j ) is the kernel function. No explicit construction of the non-linear mapping f(x) is needed, which is known as the kernel trick. A vector x is classified by the decision function! sign Xn y i a i Kðx i ; xþþb i ¼ 1 (9) which depends on the margins m i ¼ P n i ¼ 1 y i a i Kðx i ; xþþb; i ¼ 1; 2;.; n. The greater the (7) Table 2 Characteristics of the search method for each data mining model. Model Advantages Drawbacks Parameter search results LDA Interpretability Simplicity p c ¼ Cut point ¼ 0.58 Robust model QDA Interpretability Big sample size p c ¼ Cut point ¼ 0.04 LR Direct probability estimation Interpretability Statistical concepts Big sample size Transformations p c ¼ Cut point ¼ 0.47 CT Interpretability Handling of missing value Unstable Weak interactions possibly revealed M ¼ Number of rules or terminal nodes ¼ 87 MLP Universal approximate property Less formal statistical training Prone to overfitting Black box High computational cost H ¼ Number of hidden neurons ¼ 25 l ¼ Decay parameter ¼ 1.5 SVM Global solution for the optimization problem Theoretically well founded Selection of parameters Not clearly interpretable C ¼ Penalization error coefficient ¼ 10 g ¼ Parameter of the kernel function ¼ CTBag Lower variance than CART Out-of-bag estimates High computational cost Not interpretable B ¼ Number of trees ¼ 100 RF Efficient in large data sets Out-of-bag estimates Variable importance measures Prone to overfitting for some data sets mtry ¼ Number of variables to choose the best split ¼ 10 CTBoost Theoretical properties Identification of outliers High computational cost Not interpretable M ¼ Number of trees ¼ 3

7 832 R. Pino-Mejías et al. / Environmental Modelling & Software 25 (2010) absolute value of the margin, the more reliable the computed classification. The exploratory analysis of the margins could help to throw light on the SVM model (Furey et al., 2000). Note that the solution to the quadratic programming problem is global, and avoids the non-optimality of the neural network training algorithms. Two parameters must be tuned: C and g. Wehaveadoptedthe suggestions of Meyer (2004) in performing the selection of the parameters of the SVM model, and then a grid search for C and g over the set {1,10,20,30,40,50, 100,150,.,1000} {0.015,0.016,.,0.020} is conducted on the validation set. The explored values for g have been selected around the default value of the svm function in the R library e1071, defined as 1/p, where p is the number of predictors, some of which may be dummy variables for coding qualitative variables, while the grid for C includes both small and large values. One drawback of the SVM model is the necessity for a correct identification of appropriate values for the required parameters. Another drawback is that although several schemes exist which attempt to interpret the final model, none of them is sufficiently clear. Several methods to combine different classification models have been proposed in recent years. All are based on the combination of models fitted from samples and sets of variables generated from the original data set. Three important ensemble methods are considered here: Bagging, Random Forests and Boosting. Fig. 3. Test sensitivity vs. test false positive rate (100-specificity) for the nine classification models Bagging Bagging (Bootstrap Aggregating) is a method proposed by Breiman (1996) to improve the performance of prediction models. Given a classification model, bagging draws B independent samples with replacement from the available training set (bootstrap samples), fits a model to each bootstrap sample, and finally aggregates the B models by majority voting. Bagging tends to be a very effective procedure when applied to unstable learning algorithms (i.e., small changes in the data can cause large changes in the predicted values, Breiman, 1996), such as classification and regression trees and neural networks. The empirical success of the first publications has been confirmed by theoretical results as can be found in Bühlman and Yu (2002), where bagging is shown to smooth hard decision problems, by yielding a smaller variance and mean squared error. The R package ipred (Peters and Hothorn, 2004) computes bagged tree models (CTBag) and therefore is used in our study, while two values for B, 50 and 100, are considered, whereby the value which minimizes the validation classification error is selected. Bagging provides out-of-bag (OOB) estimates of the misclassification error rate without requiring a test set. Thus, for each case, the OOB aggregated classification is computed over the models which have been trained on bootstrap samples not containing that case. The out-of-bag estimate is defined as the error rate of this OOB classification. Table 3 Accuracy, sensitivity, specificity and 100AUC for each of the classification methods, computed on the test set. Model Accuracy Sensitivity Specificity 100AUC LDA QDA LR CT MLP SVM CTBag RF CTBoost Table 4 Importance measure for the random forest model. Values between 0 and 60 are marked as. Variable Gini decreasement measure Geographic location Longitude Latitude Climate Summer average precipitation Annual average precipitation Average temperature for the colder month Annual average temperature Average temperature for the hottest month Relief Dominant erosion Slope Physiography Morphogenesis Sedimentation Secondary erosion Mass movements Elevation Curvature Orientation Edaphic variables Saturation of the exchange complex Coarse element fraction Available Iron Available Manganese Available Potassium Organic matter ph Cation exchange capacity Clay fraction Copper available Magnesium available Lithologic variables Rock origin Rock acidity Rock consolidation

8 R. Pino-Mejías et al. / Environmental Modelling & Software 25 (2010) Random forests The Random Forests (RF) approach was proposed by Breiman (2001) as a way to combine many different trees. A number of trees are constructed. Each tree is grown over a bootstrap sample of the training data set, and a random selection of variables is considered for the choice of splits in each node. As in bagging, the trees are combined by majority voting, and out-of-bag estimates can also be computed. One important feature of this ensemble method is the availability of some measures to assess the importance of each variable and to identify outlier observations. Breiman (2001) claims that RF rarely overfits, and has shown that Bayes consistency is achieved with a simple version of RF. Moreover, this method runs efficiently on large databases since it is able to handle thousands of input variables. We have used the R package randomforest (Liaw and Wiener, 2002), which builds 500 trees by default. However, the number of variables to randomly select has been chosen by means of a search around the default value (mtry ¼ square root of the number of predictors), namely from mtry 5tomtry þ 5. One drawback is that RF has experienced from overfitting on some data sets used in the machine learning benchmarking Boosting The idea of boosting appeared in the Machine Learning literature in the 1980 s although the first algorithm of this type was proposed by Schapire in 1990 and the second by Freund in Boosting was proposed in order to combine the outputs of many weak classifiers thereby producing a powerful committee, in an attempt to improve the generalization performance of weak algorithms. The various models are fitted to differently reweighted samples. At each step, those observations that were misclassified by the previous classifier have their weights increased, whereas the weights of those correctly classified are decreased. One of the most popular boosting algorithm, due to Freund and Schapire (1997), called AdaBoost.M1, which arose from the original boosting algorithms they had discovered, is presented in the following. Supposing n training vectors {x i,y i }, i ¼ 1,2,.,n, where the p- dimensional vectors x i contain the predictor features and the n Fig. 4. Test predictions for the nine prediction models.

9 834 R. Pino-Mejías et al. / Environmental Modelling & Software 25 (2010) labels y i { 1,1} identify the class of each vector, then the algorithm is: 1. Initialize D 1 (i) ¼ 1/n, i ¼ 1,2,.,n. 2. For m ¼ 1,2,.,M, repeat: a) Fit the model g m from the training set, reweighted by D m. b) Compute the weighted error rate e m ¼ P Dm ½g m ðxþsy P ¼ D m ðiþ (10) i:gmðxiþsyi c) Compute c m ¼ ln((1 e m )/e m ). d) Recompute D mþ1 ðiþ ¼D m ðiþ exp½c m Iðy i Þsg m ðx i ÞŠ where I(u) ¼ 1ifu is true, 0 otherwise. e) Normalize the D mþ1 (i), i ¼ 1,2,.,n, to sum The boosted model is: GðxÞ ¼sign XM m ¼ 1 c m g m ðxþ! (11) The consistency of boosting is of major interest, as evidenced by Breiman (2004), which is accompanied by other three papers and several discussions. Like SVM, boosting tends to maximize the margins. The R package boost (Dettling, 2006) contains a function for boosting trees (adaboost) and is adopted in our case study. The function adaboost considers M ¼ 20, however the boosting models for M ¼ 1,2,3,.,20 have been empirically compared on the validation set. The examination of the weight distribution can help to identify outlier observations in the data set, since difficult cases receive greater weights. 4. Results Table 2 summarizes the characteristics of the search method and offers the parameter configuration resulting from the search over the validation set. It should be borne in mind that the final parameter configuration can be very different for other problems. Moreover, a different split into training, validation and test sets can produce other final parameters, although, as in our problem, the larger the sample size, the more similar the corresponding perfomances. CT and MLP are particularly complex as the former needs a large number of terminal nodes, and the latter relies on hidden layers. CTBoost only requires three classification trees, therefore the decision rule is based on the weighted aggregation of three base trees. Table 3 contains the test accuracy (percentage of correctly classified points), the test sensitivity (percentage of correctly classified points presenting oaks) and the test specificity (percentage of correctly classified points not presenting oaks). The area under the ROC curve (AUC) was also computed, with the aid of the ROCR library available in R (Sing et al., 2005). Two ensemble methods based on classification trees show the best results. The bagged classification tree (CTBag) provides the greatest test accuracy (83.93%) while Random Forests (RF) has a slightly lower accuracy (83.73%). Moreover, both values of the 100AUC are also very similar (90.96% for CTBag and 91.07% for RF). However, the sensitivity and specificity values are reversed: while RF has a test sensitivity (86.02%) greater than its test specificity (80.64%), CTBag exhibits a lower test sensitivity (81.10% against 86.02%). Classic statistical models, such as LDA, QDA and LR, are clearly superceded by these ensemble methods. Note that the classification tree model also outperforms these classic models for the accuracy criterion. Table 3 also shows that boosting is not effective in this study. Moreover, the validation set suggests halting the boosting algorithm after only three iterations. The sensitivity and the false positive rates (100 specificity) have been displayed for the nine models. Fig. 3 shows the resulting graph, whereby the bagged classification tree and the Random Forests method both stand out, since these models are those nearest to the upper left-hand corner. However, a distinctive feature of RF is the availability of measures of importance of the predictor variables. One of these measures computes for each variable the total decrease in node impurities (Gini index) given by the splitting on the variable, averaged over all trees. Table 4 displays these importance values for the predictor variables, where values lower than 60 are represented by. This table reveals that the limiting factors for the potential distribution for the oak forest type in the area under study would be determined mainly by climate, erosion processes and physiography. Secondarily, edaphic variables (saturation of the exchange complex, coarse element fraction, and available iron) would also influence the distribution of these species. Finally, it seems that lithological variables display no repercussions for the potential distribution of this group of species. A more in-depth review of the performances of the compared methods can be obtained through Fig. 4. Each part of this figure exhibits the map of test predictions of the corresponding model, where the colour black denotes predicted presence of oaks, and grey denotes the real presence of oaks. These maps reinforce the previous conclusions, and confirm that CTBag and RF present good agreement between the spatial distribution of the test oaks and the spatial distribution of the predictions. Fig. 4 also shows that in the lower left-hand region, the remaining models fail to adequately predict the presence of oaks. Fig. 5 presents a measure of the relative agreement between the nine models considered. Thus, black denotes five or more predictions of the presence of oaks, dark grey represents three or four predictions, and light grey denotes at most two predictions. This figure confirms that the lower left-hand region is the most difficult to predict, evidencing a clear spatial bias for the majority of the models analyzed, with the exception of CTBag and RF. Fig. 5. Test relative agreement. Black: five or more predictions of the presence of oaks; dark grey: three or four predictions; light grey: at most two predictions.

10 R. Pino-Mejías et al. / Environmental Modelling & Software 25 (2010) Conclusions We have studied a suitable methodology to determine an appropriate model to predict the potential distribution of oaks in an effort to improve the process of reforestation under Mediterranean conditions. Although the diversity of classification models currently available enriches statistical practice in the environmental framework by offering different alternatives to the analyst, the number of decisions to be made is clearly increased. The present research is designed to provide practical experience in order to facilitate the selection of methods. Due to the obvious need for a powerful statistical programming language system for implementing this computing task, a good and cheap choice is to employ the R system, which offers free implementation of both classic and modern classification models. A first conclusion is that none of the classification models studied should be utilized blindly, and therefore none of them can be named as a fully automatic classification model. Each has a certain number of parameters which must be carefully tuned, and requires search procedure to be employed. However, the R programs are able to search for the best configuration of values for the required parameters. Moreover, a correct model comparison also needs strategies for the estimation of the generalization errors. For large data sets, such as that used in this study, the random split into training, validation and test sets is recommended. For data sets of a reduced size, the use of cross-validation schemes for the selection of suitable parameters in the tuning process could be necessary. The analysis of the results for our case study has revealed that bagged classification trees and random forests offer very good performance. However, we cannot claim the universal superiority of these two models. The relative performance depends on the data set and computer resources, together with other factors, such as experience in using the methods. However, it is worth considering different alternative classification models, as shown in this case study. Concerning the quality of the present research, there are several actors which improved the comparison of the prediction models. Firstly, results may depend on the quality of input data (Guisan and Zimmermann, 2000). To this end, biophysical maps of a high accuracy and resolution were selected as input data. This input data was collected from different published sources or was extracted by the authors. This data belongs to several environmental fields of independent variables: lithology, geomorphology, physiography, relief, soil, and climate. There were both qualitative and quantitative variables, all of which were coded and pre-processed in order to obtain a homogeneous digital database. In agreement with other authors (Guisan and Zimmermann, 2000), the variability in predictive capacity obtained by the nine methods used in this paper may suggest that the best predictive model strongly depends upon the type and quality of the input data. Another strength of the present work is the high sample size of input data which contains 65,535 points uniformly distributed in the area under study; this represents 14% of the total area, and approximately 3 points/km 2 of sampling density. The main characteristic of this sample size was that it enabled all data to be split into a training set (50%), validation set (25%), and test set (25%), whose sample sizes provided reliable estimations of the generalization error of the different final models. On the other hand, the main weakness in this study could be that no competition between different species has been taken into account. Despite the great relevance that biotic interactions hold in current distributions of every community, their measurement presents a tough nut to crack. Furthermore, historical factors such as human disturbance of the original ecosystem, have also been omitted. This is perhaps a problem without solution, except for undisturbed systems. Classic statistical models offered poor performance in our case study. The other three base models improved the results, while the fitted classification trees and multilayer perceptrons remained too complex. Support vector machines performed slightly better than classification trees. Nevertheless, two ensemble methods, random forests and bagged trees, were found to be the best models, while the performance of boosting remained similar to that of quadratic discriminant analysis. Although bagging and random forests exhibited similar results, the great advantage of RF is the availability of measures of importance of the variables, which suggests that the potential distribution for the oak forest type in the area under study would be determined mainly by climate, erosion processes, slope and physiography, and secondarily by edaphic variables. It seems that lithological variables have no influence on the potential distribution of oak forest. Not only can the methodology followed in this study help to determine the potential distribution of oaks in similar Mediterranean areas, but it can also be extended to different areas and classification problems. Acknowledgments This work was supported by Spanish Ministry of Education and Science (MTM ), Institute of Statistics of Andalusia (OG- 154/07) and by Andalusia Environment Government (OG-096/01). We thank Dr. De la Rosa for advice on the manuscript. The authors wish to thank the anonymous reviewers for their valuable comments. References Anaya-Romero, M., Modelo de distribución potencial de usos forestales basado en parámetros edáficos, geomorfológicos, climáticos y topográficos. PhD. University of Seville, Seville, Spain. Belanche-Muñoz, L., Blanch, A.R., Machine learning methods for microbial source tracking. Environmental Modelling & Software 23 (6), Bishop, C.M., Neural Networks for Pattern Recognition. Oxford University Press, New York. Boser, B.E., Guyon, I.M., Vapnik, V.N., A training algorithm for optimal margin classifiers. In: Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory. ACM Press, Pittsburgh, pp Bullock, P., Jones, R.J.A., Montanarella, L. (Eds.), Soil Resources of Europe. The European Soil Bureau, Joint Research Centre I ISPRA, Italy, 202 pp. Breiman, L., Bagging predictors. Machine Learning 24, Breiman, L., Random forests. Machine Learning 45 (1), Breiman, L., Population theory for boosting ensembles. The Annals of Statistics 32 (1), Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., Classification and Regression Trees. Wadsworth and Brooks, Belmont. Bühlman, P., Yu, B., Analyzing bagging. The Annals of Statistics 30 (4), Cheng, B., Titterington, D.M., Neural networks: a review from a statistical perspective. Statistical Science 9, Cristianini, N., Shawe-Taylor, J., An Introduction to Support Vector Machines. Cambridge University Press, Cambridge. De la Rosa, D., Mayol, F., Moreno, J.A., Bonsón, T., Lozano, S., An expert system/ neural network model (ImpelERO) for evaluating agricultural soil erosion in Andalucia region, southern Spain. Agriculture, Ecosystems and Environment 73, Del Toro, M., Capacidad de uso forestal de los suelos del Parque Natural Sierra de Grazalema en base a sus propiedades químicas. PhD. University of Seville, Seville, Spain. Dettling, M., boost: Boosting Methods for Real and Simulated Data. R package version Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., Weingessel, D., e1071: Misc Functions of the Department of Statistics (e1071), TU Wien. R package version Dixon, M., Gallop, J.R., Lambert, S.C., Healy, J.V., Experience with data mining for the anaerobic wastewater treatment process. Environmental Modelling & Software 22 (3), Draín, M., Geografía de la Península Ibérica. Oikos-Tau, Barcelona.

11 836 R. Pino-Mejías et al. / Environmental Modelling & Software 25 (2010) Ekasingh, B., Ngamsomsuke, K., Searching for simplified farmers crop choice models for integrated watershed management in Thailand: a data mining approach. Environmental Modelling & Software 24 (12), ESRI, ArcInfo v ESRI, Redlands. European Environment Agency. Data Corine Land Cover (CLC1990) 250 m Version 06/1999. Fayad, U., Piatetsky-Shapiro, G., Smith, P., From data mining to knowledge discovery in databases (a survey). AI Magazine 3 (17), Felicísimo, A.M., Digital Terrain Models. Introduction and Applications in Environmental Sciences. Ed. Pentalfa, Oviedo (in Spanish). Feranec, J., Hazeub, G., Christensenc, S., Jaffraind, G., CORINE land cover change detection in Europe (case studies of the Netherlands and Slovakia). Land Use Policy 24, Feranec, J., et al., Determining changes and flows in European landscapes using CORINE land cover data. Applied Geography. doi: / j.apgeog Fisher, R.A., The use of multiple measurements in taxonomic problems. Annals of Eugenics 7, Franklin, J., Predictive vegetation mapping: geographic modelling of biospatial patterns in relation to environmental gradients. Progress in Physical Geography 19, Freund, Y., Schapire, R., A decision-theoretic generalization of online learning and application to boosting. Journal of Computer and System Sciences 55, Furey, T.S., Cristianini, N., Duffy, N., Bednarski, D.W., Schummer, M., Haussler, D., Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16, Gibert, K., Sánchez-Marré, M., Rodríguez-Roda, I., GESCONDA: an intelligent data analysis system for knowledge discovery and management in environmental databases. Environmental Modelling & Software 26 (1), Guisan, A., Zimmermann, E., Predictive habitat distribution models in ecology. Ecological Modelling 135, Halpern, C.B., Spies, T.A., Plant species diversity in natural and managed forests of the Pacific Northwest. Ecological Applications 5, Hastie, T., Tibshirani, R., Friedman, J., The Elements of Statistical Learning. Springer, New York. Hertz, J., Krogh, A., Palmer, R., Introduction to the Theory of Neural Computation. Addison Wesley, Reading. Hirzel, A.H., Helfer, V., Métral, F., Assessing habitat-suitability models with a virtual species. Ecological Modelling 145, Hosmer, D.W., Lemeshow, S., Applied Logistic Regression. Wiley, New York. Ihaka, R., Gentleman, R., R: a language for data analysis and graphics. Journal of Computational and Graphical Statistics 5, Kimmins, J.P., Forest Ecology. Macmillan, New York, EEUU, 531 pp. Kirilenko, A., Chivoiu, B., Crick, J., Ross-Davis, A., Schaaf, K., Shao, G., Singhania, V., Swihart, R., An Internet-based decision support tool for non-industrial private forest landowners. Environmental Modelling & Software 22, Liaw, A., Wiener, M., Classification and regression by randomforest. R News 2 (3), MAPA, Forestación de tierras agrícolas. Ministerio de Agricultura y Pesca, Madrid, Spain, pp Martínez-Zavala, L., Análisis Territorial de la Comarca del Andévalo Occidental: una Aproximación desde el Medio Físico. PhD. University of Seville, Seville, Spain. May, R.J., Maier, H.R., Dandy, G.C., Gayani Fernando, T.M.K., Non-linear variable selection for artificial neural networks using partial mutual information. Environmental Modelling & Software 23 (11), Meyer, D., Support Vector Machines. The Interface to libsvm in Package e Michie, D., Spiegelhalter, D., Taylor, C. (Eds.), Machine Learning, Neural and Statistical Classification. Ellis Horwood Series in Artificial Intelligence. Ellis Horwood. MOPT, Guía para la elaboración de estudios del medio físico: Contenidos y metodología, 3 a edición. Ministerio de Obras Públicas y Turismo, Madrid. Moreira, J.M., Fernández Palacios, A., Cartografía y estadísticas de usos y coberturas vegetales del suelo en Andalucía. Evolución Junta de Andalucía, Consejería de Medio Ambiente, Sevilla, Spain. Morgan, J., Sonquist, J., Problems in the analysis of survey data, and a proposal. Journal of the American Statistical Society 58, Núñez, M.A., El Medio Físico del Parque Natural de la Sierra de Aracena-Picos de Aroche y su entorno. Paleoalteraciones, edafogénesis actual y unidades ambientales. PhD. University of Cordoba, Cordoba, Spain. Peters, A., Hothorn, T., ipred: Improved Predictors. R package version Quinlan, J.R.,1993. C4.5: Programs for MachineLearning. Morgan Kaufmann, San Mateo. R Development Core Team, R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, ISBN URL: Sing, T., Sander, O., Beerenwinkel, N., Lengauer, T., ROCR: Visualizing the Performance of Scoring Classifiers. R Package Version mpi-sb.mpg.de. Storch, H.V., On the use of inflation in statistical downscaling. Journal of Climate 12, Therneau, Terry M.,, Atkinson, B.R., Rpart: Recursive Partitioning. R Package Version splusfunctions.cfm port by Brian Ripley. Vapnik, V., Statistical Learning Theory. Wiley, New York. Venables, W.N., Ripley, B.D., Modern Applied Statistics with S-PLUS. Springer, New York. Walter, H., Zonas de vegetación y clima. Ediciones Omega, Barcelona. Witten, I.H., Frank, E., Data Mining, Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco. Zaniewski, A.E., Lehmann, A., Overton, J.McC, Predicting species spatial distributions using presence-only data: a case study of native New Zealand ferns. Ecological Modelling 157,

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

More information

Chapter 12 Discovering New Knowledge Data Mining

Chapter 12 Discovering New Knowledge Data Mining Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

Classification and Regression by randomforest

Classification and Regression by randomforest Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

More information

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

More information

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Yin Zhao School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Yahya

More information

Support Vector Machines Explained

Support Vector Machines Explained March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

More information

How To Perform An Ensemble Analysis

How To Perform An Ensemble Analysis Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier

More information

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Xavier Conort xavier.conort@gear-analytics.com Motivation Location matters! Observed value at one location is

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

6.2.8 Neural networks for data mining

6.2.8 Neural networks for data mining 6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

Model Combination. 24 Novembre 2009

Model Combination. 24 Novembre 2009 Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

More information

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

APPLICATION OF DATA MINING TECHNIQUES FOR DIRECT MARKETING. Anatoli Nachev

APPLICATION OF DATA MINING TECHNIQUES FOR DIRECT MARKETING. Anatoli Nachev 86 ITHEA APPLICATION OF DATA MINING TECHNIQUES FOR DIRECT MARKETING Anatoli Nachev Abstract: This paper presents a case study of data mining modeling techniques for direct marketing. It focuses to three

More information

Fraud Detection for Online Retail using Random Forests

Fraud Detection for Online Retail using Random Forests Fraud Detection for Online Retail using Random Forests Eric Altendorf, Peter Brende, Josh Daniel, Laurent Lessard Abstract As online commerce becomes more common, fraud is an increasingly important concern.

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

Leveraging Ensemble Models in SAS Enterprise Miner

Leveraging Ensemble Models in SAS Enterprise Miner ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

More information

Linear Classification. Volker Tresp Summer 2015

Linear Classification. Volker Tresp Summer 2015 Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

More information

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S. AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree

More information

Lecture 6. Artificial Neural Networks

Lecture 6. Artificial Neural Networks Lecture 6 Artificial Neural Networks 1 1 Artificial Neural Networks In this note we provide an overview of the key concepts that have led to the emergence of Artificial Neural Networks as a major paradigm

More information

E-commerce Transaction Anomaly Classification

E-commerce Transaction Anomaly Classification E-commerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of e-commerce

More information

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University AIRPO

More information

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether

More information

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.

More information

6 Classification and Regression Trees, 7 Bagging, and Boosting

6 Classification and Regression Trees, 7 Bagging, and Boosting hs24 v.2004/01/03 Prn:23/02/2005; 14:41 F:hs24011.tex; VTEX/ES p. 1 1 Handbook of Statistics, Vol. 24 ISSN: 0169-7161 2005 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(04)24011-1 1 6 Classification

More information

Prediction of Stock Performance Using Analytical Techniques

Prediction of Stock Performance Using Analytical Techniques 136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University

More information

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Data mining on the EPIRARE survey data

Data mining on the EPIRARE survey data Deliverable D1.5 Data mining on the EPIRARE survey data A. Coi 1, M. Santoro 2, M. Lipucci 2, A.M. Bianucci 1, F. Bianchi 2 1 Department of Pharmacy, Unit of Research of Bioinformatic and Computational

More information

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

More information

CS570 Data Mining Classification: Ensemble Methods

CS570 Data Mining Classification: Ensemble Methods CS570 Data Mining Classification: Ensemble Methods Cengiz Günay Dept. Math & CS, Emory University Fall 2013 Some slides courtesy of Han-Kamber-Pei, Tan et al., and Li Xiong Günay (Emory) Classification:

More information

How To Make A Credit Risk Model For A Bank Account

How To Make A Credit Risk Model For A Bank Account TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions

More information

Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

More information

Statistical Models in Data Mining

Statistical Models in Data Mining Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

More information

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

More information

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski trakovski@nyus.edu.mk Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems

More information

Intrusion Detection via Machine Learning for SCADA System Protection

Intrusion Detection via Machine Learning for SCADA System Protection Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. s.l.yasakethu@surrey.ac.uk J. Jiang Department

More information

Support Vector Machine (SVM)

Support Vector Machine (SVM) Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Predictive Data modeling for health care: Comparative performance study of different prediction models

Predictive Data modeling for health care: Comparative performance study of different prediction models Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath hiremat.nitie@gmail.com National Institute of Industrial Engineering (NITIE) Vihar

More information

Journal of Asian Scientific Research COMPARISON OF THREE CLASSIFICATION ALGORITHMS FOR PREDICTING PM2.5 IN HONG KONG RURAL AREA.

Journal of Asian Scientific Research COMPARISON OF THREE CLASSIFICATION ALGORITHMS FOR PREDICTING PM2.5 IN HONG KONG RURAL AREA. Journal of Asian Scientific Research journal homepage: http://aesswebcom/journal-detailphp?id=5003 COMPARISON OF THREE CLASSIFICATION ALGORITHMS FOR PREDICTING PM25 IN HONG KONG RURAL AREA Yin Zhao School

More information

COMPARISON OF OBJECT BASED AND PIXEL BASED CLASSIFICATION OF HIGH RESOLUTION SATELLITE IMAGES USING ARTIFICIAL NEURAL NETWORKS

COMPARISON OF OBJECT BASED AND PIXEL BASED CLASSIFICATION OF HIGH RESOLUTION SATELLITE IMAGES USING ARTIFICIAL NEURAL NETWORKS COMPARISON OF OBJECT BASED AND PIXEL BASED CLASSIFICATION OF HIGH RESOLUTION SATELLITE IMAGES USING ARTIFICIAL NEURAL NETWORKS B.K. Mohan and S. N. Ladha Centre for Studies in Resources Engineering IIT

More information

Ensemble Data Mining Methods

Ensemble Data Mining Methods Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

An Introduction to Data Mining

An Introduction to Data Mining An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

More information

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

More information

Multi-scale upscaling approaches of soil properties from soil monitoring data

Multi-scale upscaling approaches of soil properties from soil monitoring data local scale landscape scale forest stand/ site level (management unit) Multi-scale upscaling approaches of soil properties from soil monitoring data sampling plot level Motivation: The Need for Regionalization

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

INTELLIGENT ENERGY MANAGEMENT OF ELECTRICAL POWER SYSTEMS WITH DISTRIBUTED FEEDING ON THE BASIS OF FORECASTS OF DEMAND AND GENERATION Chr.

INTELLIGENT ENERGY MANAGEMENT OF ELECTRICAL POWER SYSTEMS WITH DISTRIBUTED FEEDING ON THE BASIS OF FORECASTS OF DEMAND AND GENERATION Chr. INTELLIGENT ENERGY MANAGEMENT OF ELECTRICAL POWER SYSTEMS WITH DISTRIBUTED FEEDING ON THE BASIS OF FORECASTS OF DEMAND AND GENERATION Chr. Meisenbach M. Hable G. Winkler P. Meier Technology, Laboratory

More information

Operations Research and Knowledge Modeling in Data Mining

Operations Research and Knowledge Modeling in Data Mining Operations Research and Knowledge Modeling in Data Mining Masato KODA Graduate School of Systems and Information Engineering University of Tsukuba, Tsukuba Science City, Japan 305-8573 koda@sk.tsukuba.ac.jp

More information

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.

More information

MHI3000 Big Data Analytics for Health Care Final Project Report

MHI3000 Big Data Analytics for Health Care Final Project Report MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data pre-processing The data given

More information

Combining GLM and datamining techniques for modelling accident compensation data. Peter Mulquiney

Combining GLM and datamining techniques for modelling accident compensation data. Peter Mulquiney Combining GLM and datamining techniques for modelling accident compensation data Peter Mulquiney Introduction Accident compensation data exhibit features which complicate loss reserving and premium rate

More information

1. Classification problems

1. Classification problems Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification

More information

Neural Networks and Support Vector Machines

Neural Networks and Support Vector Machines INF5390 - Kunstig intelligens Neural Networks and Support Vector Machines Roar Fjellheim INF5390-13 Neural Networks and SVM 1 Outline Neural networks Perceptrons Neural networks Support vector machines

More information

Chapter 12 Bagging and Random Forests

Chapter 12 Bagging and Random Forests Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts

More information

MERGING BUSINESS KPIs WITH PREDICTIVE MODEL KPIs FOR BINARY CLASSIFICATION MODEL SELECTION

MERGING BUSINESS KPIs WITH PREDICTIVE MODEL KPIs FOR BINARY CLASSIFICATION MODEL SELECTION MERGING BUSINESS KPIs WITH PREDICTIVE MODEL KPIs FOR BINARY CLASSIFICATION MODEL SELECTION Matthew A. Lanham & Ralph D. Badinelli Virginia Polytechnic Institute and State University Department of Business

More information

Scalable Developments for Big Data Analytics in Remote Sensing

Scalable Developments for Big Data Analytics in Remote Sensing Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,

More information

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a

More information

How To Do Data Mining In R

How To Do Data Mining In R Data Mining with R John Maindonald (Centre for Mathematics and Its Applications, Australian National University) and Yihui Xie (School of Statistics, Renmin University of China) December 13, 2008 Data

More information

Multivariate Analysis of Ecological Data

Multivariate Analysis of Ecological Data Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology

More information

How To Identify A Churner

How To Identify A Churner 2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management

More information

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

More information

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing

More information

A Property & Casualty Insurance Predictive Modeling Process in SAS

A Property & Casualty Insurance Predictive Modeling Process in SAS Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing

More information

USE OF REMOTE SENSING FOR MONITORING WETLAND PARAMETERS RELEVANT TO BIRD CONSERVATION

USE OF REMOTE SENSING FOR MONITORING WETLAND PARAMETERS RELEVANT TO BIRD CONSERVATION USE OF REMOTE SENSING FOR MONITORING WETLAND PARAMETERS RELEVANT TO BIRD CONSERVATION AURELIE DAVRANCHE TOUR DU VALAT ONCFS UNIVERSITY OF PROVENCE AIX-MARSEILLE 1 UFR «Sciences géographiques et de l aménagement»

More information

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee 1. Introduction There are two main approaches for companies to promote their products / services: through mass

More information

Data Mining and Neural Networks in Stata

Data Mining and Neural Networks in Stata Data Mining and Neural Networks in Stata 2 nd Italian Stata Users Group Meeting Milano, 10 October 2005 Mario Lucchini e Maurizo Pisati Università di Milano-Bicocca mario.lucchini@unimib.it maurizio.pisati@unimib.it

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

Numerical Algorithms Group

Numerical Algorithms Group Title: Summary: Using the Component Approach to Craft Customized Data Mining Solutions One definition of data mining is the non-trivial extraction of implicit, previously unknown and potentially useful

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Chapter 3 Communities, Biomes, and Ecosystems

Chapter 3 Communities, Biomes, and Ecosystems Communities, Biomes, and Ecosystems Section 1: Community Ecology Section 2: Terrestrial Biomes Section 3: Aquatic Ecosystems Click on a lesson name to select. 3.1 Community Ecology Communities A biological

More information

Better credit models benefit us all

Better credit models benefit us all Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis

More information

On the effect of data set size on bias and variance in classification learning

On the effect of data set size on bias and variance in classification learning On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

A fast multi-class SVM learning method for huge databases

A fast multi-class SVM learning method for huge databases www.ijcsi.org 544 A fast multi-class SVM learning method for huge databases Djeffal Abdelhamid 1, Babahenini Mohamed Chaouki 2 and Taleb-Ahmed Abdelmalik 3 1,2 Computer science department, LESIA Laboratory,

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

More information