Environmental Modelling & Software

Transcription

1 Environmental Modelling & Software 25 (2010) Contents lists available at ScienceDirect Environmental Modelling & Software journal homepage: Predicting the potential habitat of oaks with data mining models and the R system Rafael Pino-Mejías a, *, María Dolores Cubiles-de-la-Vega a, María Anaya-Romero b, Antonio Pascual-Acosta c, Antonio Jordán-López b, Nicolás Bellinfante-Crocci b a Department of Statistics and Operational Research, University of Seville, Avda. Reina Mercedes s/n, Seville, Spain b Department of Cristallography, Mineralogy and Agricultural Chemistry, University of Seville, Avda. Reina Mercedes s/n, Seville, Spain c Andalusian Prospective Center, Avda. Reina Mercedes s/n, Seville, Spain article info abstract Article history: Received 24 June 2009 Received in revised form 12 January 2010 Accepted 17 January 2010 Available online 12 February 2010 Keywords: Habitat modelling Supervised classification R system Data mining models Ensemble models Classification trees Neural networks Oaks Support vector machines Oak forests are essential for the ecosystems of many countries, particularly when they are used in vegetal restoration. Therefore, models for predicting the potential habitat of oaks can be a valuable tool for work in the environment. In accordance with this objective, the building and comparison of data mining models are presented for the prediction of potential habitats for the oak forest type in Mediterranean areas (southern Spain), with conclusions applicable to other regions. Thirty-one environmental input variables were measured and six base models for supervised classification problems were selected: linear and quadratic discriminant analysis, logistic regression, classification trees, neural networks and support vector machines. Three ensemble methods, based on the combination of classification tree models fitted from samples and sets of variables generated from the original data set were also evaluated: bagging, random forests and boosting. The available data set was randomly split into three parts: training set (50%), validation set (25%), and test set (25%). The analysis of the accuracy, the sensitivity, the specificity, together with the area under the ROC curve for the test set reveal that the best models for our oak data set are those of bagging and random forests. All of these models can be fitted by free R programs which use the libraries and functions described in this paper. Furthermore, the methodology used in this study will allow researchers to determine the potential distribution of oaks in other kinds of areas. Ó 2010 Elsevier Ltd. All rights reserved. 1. Introduction Sustainable land management is crucial for the prevention of land degradation, for the reclamation of degraded land for its productive use, for the reaping of benefits of crucial ecosystem services, and for the protection of biodiversity. The use of local knowledge should be encouraged since it can provide insights into proven adaptation techniques and can contribute towards the design of early-warning systems for extreme events. The sustainable management of natural resources requires an exhaustive knowledge of the physical environmental components with particular focus on the relations between these elements and the various plant communities. In this way, human activity in natural areas, and especially in protected areas, must be carefully planned in order to assure their conservation and to allow the socio-economic development of the populations located therein (Halpern and Spies, 1995; Bullock et al., 1999; Anaya-Romero, 2004; Kirilenko et al., 2007). In this sense, the sustainable management of the forests (trees, shrubs, etc.) must be considered as a fundamental aspect for the * Corresponding author. Tel.: þ ; fax: þ address: rafaelp@us.es (R. Pino-Mejías). soil protection based on land-ecological principles (Kimmins, 1987). Taking this into account, the Mediterranean oak is one of the most important woody species in the forest communities of the western Mediterranean basin. This species is a highly common Mediterranean sclerophyll which grows throughout the entire Mediterranean basin region and can be found in the form of pure or mixed stands. Therefore the use of oaks in vegetal restoration ensures the survival of reforestation, even in situations of prolonged drought. In fact, the use of oaks in Spanish reforestation programs has greatly increased in the last 10 years to exceed that of the Pinus species widely used in the past (MAPA, 2006). A suitable methodology for the determination of an appropriate model to predict the potential distribution of oak forest would be welcomed and could help towards the improvement of the process of reforestation under Mediterranean conditions. As regards land cover data, the European Environment Agency (EEA) aims to provide those responsible for and interested in European policy on the environment with quantitative data on land cover data which is consistent and comparable across the continent. Compiling a single CORINE Land Cover database for all European countries, and registering all changes to this land cover /$ see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi: /j.envsoft

2 R. Pino-Mejías et al. / Environmental Modelling & Software 25 (2010) (LC), is crucial since the impact of many environmental problems exceeds national borders, and hence solutions often involve more than one sovereign territory. In support of environmental assessment, the need for regularly updated information on land cover has become important at both European and national levels. The growing interest in such information can be ascribed to the important role of LC in processes taking place on the Earth s surface, such as absorption of solar radiation, utilization of carbon dioxide by plant associations, and evaporation. Landscape changes at national and global levels are becoming even more topical, and their influence acquires new dimensions not only in research, but also in environmental management. That is why harmonized and standardized spatial reference data is considered mandatory for the support of environmental management in the European Union policies (Feranec et al., 2007). The CLC database provides information about land cover that can contribute towards new approaches for the assessment of the European landscape, for instance, in the context of environmental and economic accounting, diversity, and modelling of its properties. These approaches are made possible by the fact that land cover reflects the biophysical state of the real landscape (Feranec et al., 2009). In the present research, data from CLC is considered in order to develop a new tool for the prediction of the potential habitat of oaks across Europe using an exhaustive environmental database. Relationships between variables in ecology are almost always extremely complicated and highly non-linear. Although a large number of evaluation models have already been described and compared by several authors (Franklin, 1995; Guisan and Zimmermann, 2000), it remains to be seen which models perform best, given particular circumstances (Guisan and Zimmermann, 2000; Hirzel et al., 2001; Zaniewski et al., 2002). On the other hand, tools and technologies do exist for efficient land management and administration, but their adaptation needs to be promoted and application expanded. A large number of human livelihoods and ecosystems can benefit from these tools and techniques since these yield multiple benefits. In the present research these issues are tackled empirically. This paper reports the development and comparison of data mining models in the prediction of the potential habitat for the oak forest type (Quercus rotundifolia and Quercus suber) in the Mediterranean pilot area comprising Sierra de Aracena Natural Park, and part of the Western Andevalo nature area, located in southern Spain. The term data mining is actually a part of a wider process that was termed Knowledge Discovery from Data (KDD) by Fayad et al. (1996), oriented to identify patterns in data sets. KDD includes several steps: collecting and cleaning the data, preprocessing, data reduction, and the application of specific algorithms to search for patterns in the data. This last par is usually known as data mining. Other steps in the KDD are the interpretation of the patterns discovered and the reporting of these findings. KDD is an interdisciplinary research field where the collaboration of different areas, such as statistics, artificial intelligence, information systems, machine learning, computational learning theory, and other related sciences has made many different tools available for the data mining process. Another related term, machine learning, is concerned with the design and development of algorithms and techniques capable of learning from experience, and therefore the data mining models used in our study also lie within the machine learning framework. Data mining models have been successfully applied in many fields, such as Medicine, Econometric Analysis, and Image Analysis, where they offer new and valuable tools for classification and regression problems and for the clustering of a set of objects. These models are also useful for the environmental sciences where some particular problems, such as the mixed nature of the data (quantitative and qualitative), and/or high non-linearity complicate the task of modelling the environmental system. There are several references in this journal where data mining techniques have been used. Ekasingh and Ngamsomsuke (2009) have used the C4.5 data mining algorithm to model farmers crop choice in two watersheds in Thailand. Neural networks, rule induction and clustering techniques are used in Dixon et al. (2007) to improve the reliability and efficiency of monitoring and control of anaerobic wastewater treatment plants. Gibert et al. (2006) present a software tool for intelligent data analysis and implicit knowledge management, including several data mining algorithms. Belanche-Muñoz and Blanch (2008) develop some statistical and machine learning models to obtain predictive models for the determination of faecal sources in waters. May et al. (2008) formulate a non-linear procedure to select input variables for artificial neural networks. For each problem, a great family of methods, ranging from classic and simple statistical methods to sophisticated and computer-intensive methods, is available, and therefore a careful process where these methods are correctly fitted may improve the construction of environmental models. However, it must be remarked that the more complex data mining methods are not necessarily superior. Each particular problem has its own most appropriate data mining model and it is possible that simple models yield a better performance for certain data sets. Our problem lies within the two-class supervised learning framework: given a set of multivariate vectors of measurements taken at n geographical points, each belonging to one known class (presence/absence of oaks), the task is to build a classification rule to assign new points to their correct class. Classic statistical models, such as Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), and Logistic Regression (LR) are potential tools for this task. LDA for two groups is based on the earliest approach by Fisher (1936), which computes a linear discriminant function for the definition of the classification rule which is optimum in homoscedastic Gaussian populations. For heteroscedastic Gaussian populations, the decision boundary is described by a quadratic equation (QDA). LR is one of the most widely used statistical models to predict binary outcomes and presents interesting properties concerning the interpretation of the coefficients (Hosmer and Lemeshow, 1989), and hence has been extensively used in the field of medical statistics. Other data mining techniques, such as Classification Trees (CTs), Artificial Neural Networks (ANNs) and Support Vector Machines (SVMs) have been successfully applied to data from different fields. CT has connections with the Social Sciences (Morgan and Sonquist, 1963) and the Artificial Intelligence (AI) communities (Quinlan, 1993), although methods with a more statistical foundation have also been developed. CT stands out for its ease of interpretation of the model obtained. The AI community developed a powerful computational paradigm, ANN (Hertz et al., 1991), which was progressively incorporated into statistical practice (Cheng and Titterington, 1994; Bishop, 1995), although its black-box nature makes the interpretation of the resulting models very difficult. SVM emerged from Statistical Learning Theory, otherwise known as Vapnik-Chernovenkis theory (Cristianini and Shawe-Taylor, 2000; Vapnik, 1998), and currently commands great interest accompanied by similar enthusiasm to that previously experienced for ANN. These models are freely available in the R system (R Development Core Team, 2008) which also provides the user with a powerful statistical programming language. Ihaka and Gentleman (1996) present an introduction to the main characteristics of the R system. The programming resources of this system are highly suitable for programming ensemble methods, where

3 828 R. Pino-Mejías et al. / Environmental Modelling & Software 25 (2010) a certain number of models are constructed by resampling the set of cases and/or the set of predictors, and the models are aggregated by majority voting. Three ensemble methods, bagging, random forests and boosting, are also considered in this paper. There is no universal best learning method. As explained in Witten and Frank (2005), the various data mining methods correspond to different concept description spaces searched with different schemes. Thus, certain description languages and search procedures serve some problems well and other problems badly, thereby making it necessary to perform a careful comparison of the many data mining techniques. Section 2 describes the data set used in our study. Data mining models are presented from the point of view of the currently available R implementations in Section 3. Several practical questions associated with its use are also answered. For a wider discussion of these and other learning techniques, an excellent reference is Hastie et al. (2001), where the different topics on Statistical Learning Theory are described both theoretically and in a practical way. The results obtained are presented in Section 4, and finally, the main conclusions are discussed in Section Data description 2.1. Area under study The area under study is situated in the North of Huelva, in the Southwest of Spain. This zone belongs to the Natural Protected Spaces of Andalusia and it comprises Sierra de Aracena Natural Park and part of Western Andevalo natural area (Fig. 1). The total area spans approximately 4770 km 2. Andevalo is located in the SW of Sierra de Aracena. The elevation and relief are lower and less extreme than in Sierra de Aracena, and the elevation scarcely exceeds 1000 m. The climate is Mediterranean, with great seasonal variance, characterized by rainy cold winters and dry warm summers (Draín, 1979). The soils are shallow (Núñez, 1998; Martínez-Zavala, 2001), with a profile A-C or A-R. Generally, Leptosols are the more frequent soils in the area of Andevalo, and Cambisols and Regosols occupy more extensive areas in Sierra de Aracena. Cork oak (Q. suber) and oak (Q. rotundifolia) constitute the majority of the vegetation. There are also widely reforested areas, usually with Eucalyptus and Pines, in the central and southern parts of the area of study. Occasionally, there are other cultivated broadleaved species, such as Castanea sativa and/or riverside species (Moreira and Fernández Palacios, 1997). The main land use in Sierra de Aracena is forestry, whilst in Andevalo, this is represented by farming Predictor variables The map of the distribution of the oak forest type in the area under study was extracted from the CORINE Land Cover database (scale 1:50,000) for the year 1990 (CLC1990). A high number of classes of oak forest type were considered in the original legend of CLC1990, related with other vegetation associations, and hence all of these classes are grouped together as oak forest type. The map of oak forest type (Fig. 2) was then obtained. From this map the dependent variable of our data set, namely the presence/ absence of oak forest type, was extracted. In this way, data on the current distribution of forest type can be used to predict the potential distribution of oaks, and additionally, this information can be used to validate results from each evaluation model used. Predictor variables were selected according to diverse environmental assessment studies and expert knowledge (e.g., Walter, 1977; MOPT, 1991; De la Rosa et al., 1999; Del Toro, 1996), which suggest several environmental variables that are suspected of being of great physiological importance to plants. These variables were grouped into several thematic categories: lithology, geomorphology, physiography, relief, soil, climate, and geographic location. The data was collected from different sources, including data directly processed by the authors, as described below. Lithological information was extracted from the Geological Map of Spain, published by the Geological and Mining Institute of Spain (IGME) on a scale of 1:50,000. The maps were digitalized and vectorized using Arc/Info software (ESRI, ). The lithological variables considered were: rock origin, acidity, and consolidation (Table 1). Fig. 1. Area under study. Fig. 2. Distribution of oak forest type referred to the year 1990 (CLC1990).

4 R. Pino-Mejías et al. / Environmental Modelling & Software 25 (2010) Soil data were extracted from the maps of geomorphoedaphic units of Sierra de Aracena national park (Núñez, 1998), and western Andevalo (Martínez-Zavala, 2001). The edaphic variables selected were: soil acidity (ph), available chemical elements (Fe, Mn, Cu, Mg, and K), organic matter, cation exchange capacity (CEC), saturation of the exchange complex (S), coarse element fraction, and clay fraction. Geomorphological variables and physiography were also extracted from the data in Núñez (1998) and Martínez-Zavala (2001). The five measured geomorphological variables were: dominant and secondary erosive processes (sheet and rill erosion, rill and gully erosion, gully erosion, water drop erosion, or vertical erosion by rivers), mass movements (presence/absence), sedimentation (by gravity or by flooding), and morphogenesis (fluvial, denudative, endogenous or karstic). Physiography was represented by one variable with the same name, and covered the following categories: river bed, stable plain, erosion surface, plateau, raised strip, karst, hill, butte, and mountain. The variables concerning relief were extracted from a digital terrain model (DTM, resolution m 2 ). The four selected variables were elevation, slope, curvature and orientation; and were derived from the digital elevation model (DEM). Climate data was processed by Anaya-Romero (2004) using a statistical interpolation of data points obtained from a network of weather stations. The five selected variables were: annual precipitation, summer precipitation, annual average temperature, average temperature of the hottest month, average temperature of the coolest month. Five temperature and precipitation maps were obtained by using multiple linear regression models. Dependent variables were collected from 55 weather stations located in the province of Huelva, comprising more than 20 consecutive years of weather data. Independent variables were Digital Elevation Model (DEM), sea distance and Digital Insolation Model (DIM). DIM was calculated following Felicísimo (1994) for 21st June, and was obtained as a sum of DIM calculated for every sunlight hour. The fit of the five models was assessed using numeric and graphic procedures. It is well known (Storch, 1999) that the interpolated variables present a smaller variance than that of the dependent variables. For multiple linear regression, the difference between the two variances is precisely the variance of the error s 2. Hence, the mean square error (MSE), usually presented in ANOVA tables for linear regression, which is an unbiased estimator of s 2, can be employed to judge the uncertainty associated to the interpolated variables. Thus, the multiple correlation coefficient (R) and the MSE for each climate variable model were: R ¼ and MSE ¼ for the logarithm of the annual precipitation; R ¼ and MSE ¼ for the logarithm of summer precipitation; R ¼ and MSE ¼ for the square of annual average temperature; Table 1 Lithological variables and description. Variable Class Description Rock origin Volcanic Extrusive igneous rock solidified near or on the surface of the Earth Plutonic Large mass of intrusive igneous rock believed to have solidified deep within the earth Metamorphic Rock altered by pressure and heat Sedimentary Rock formed from consolidated clay sediments Acidity Acid Siliceous sandstones, slates, quartzites, etc. Basic Limestones and ultrabasic plutonic rocks Consolidation Consolidated Hard and continuous rocks. Not consolidated Loose rocks (clay, sand, marl) or colluvial substrates R ¼ 0.625, and MSE ¼ for the average temperature of the hottest month, and R ¼ and MSE ¼ for the average temperature of the coolest month. Therefore, a reasonable fit was obtained for the climate variables, while a small degree of uncertainty can be expected. Regression equations were then extrapolated in order to obtain maps of temperature and precipitation for the whole area, as well as for the other predictor variables. Finally, geographic location variables were the longitude and latitude of each sample point. In short, 31 variables are considered as potential predictor variables, with approximately 3 points/km 2 of sampling density. Given the large size of the available data set, as explained in the following subsection, there is no risk of numeric failure in the fitting procedures. Therefore, a previous reduction of the number of variables is rendered unnecessary. Furthermore, some of the data mining methods are able to detect the importance of each variable Data preprocessing The grid corresponding to the oak forest type contains dichotomous information (1/0); value 1 indicates presence, and value 0, absence. Moreover, the coverage of each predictor variable was also transformed into grids, which were built with the same cell size, m 2. Due to the high number of data points, a systematic sampling of the area of study was made. Additionally, the spatial distribution of missing and error data was visually verified on a DTM in order to analyze the effect of dropping the absent data. Missing data was found to be distributed in a regular way, without any trend, and hence was eliminated from the analysis. Thus, 13,840 uniformly distributed points were finally selected, where 41.9% of these points presented the oak forest type. Given the large size of our data set, and following the suggestions of Hastie et al. (2001), this data set was randomly split into three disjoint parts: training set (50%), validation set (25%), and test set (25%). The training set was used to fit the models; the validation set was used to determine the parameter configuration of each model; and the test set was employed to estimate the generalization error of the various final models. 3. Data mining models 3.1. Linear and quadratic discriminant analysis Given two multivariate independent samples where p quantitative predictor variables have been observed for n i cases, i ¼ 1,2, n ¼ n 1 þ n 2, then the LDA model supposes that both populations are multivariate Normal with means m 1 and m 2, and have a common covariance matrix S.TheLDAruleclassifiesap-dimensional vector x to class 2 if x t b S 1 bm 2 bm 1 > 1 2 bmt 2 b S 1 bm bmt 1 b S 1 bm 1 þlog bp 1 log bp 2 (1) where the prior probabilities of class memberships p 1 and p 2 are usually estimated by using the class proportions in the training set. LDA provides the minimum misclassification rate, and therefore, it is optimal, under the previously described hypothesis. The environmental data is usually non-gaussian, for example qualitative variables must be coded as dummy 0 1 variables. It therefore makes sense to instead choose the cut point that empirically minimizes the classification error, as suggested by Hastie et al. (2001). This also corresponds to the use of prior class probabilities not necessarily equal to n i /n. We have used the R function lda (Venables and Ripley, 2002), which is available in the MASS library. This function computes the estimated probability for each class, and hence the classification rule could also be formulated by predicting class 1 if the

5 830 R. Pino-Mejías et al. / Environmental Modelling & Software 25 (2010) estimated probability for class 1 is greater than a threshold probability p c. This last value could be selected by the empirical optimization of the classification error. Thus, 99 possible values for p c (0.01, 0.02,., 0.99) are considered in our study, and the value minimizing the classification error over the validation set is selected. When the covariance matrices are not assumed to be equal, quadratic discrimination functions are computed, and hence the QDA rule yields arg max i d i ðxþ; d i ðxþ ¼ 1 2 log b S i 1 2 t x bm b i S 1 x bm i þ log bp i (2) The R function qda (Venables and Ripley, 2002) in the MASS library has been used in our case study. A similar search for the cut point was also realized for QDA model through the same set of 99 threshold probabilities. Both techniques are widely used, and they perform well on diverse classification tasks, as described in the STATLOG project (Michie et al., 1994). They can be efficiently computed and provide interpretable classification rules. The main drawback of LDA is also its simplicity, which can fail to capture complex structures in the data set. However, the quadratic version, which requires a larger number of parameters, tends to overfit on data sets of at most a few hundred points, whereas in this situation, LDA is usually better, since it is robust to departures from the assumptions Logistic regression For a binary response and p quantitative predictors x 1,.,x p, (some of which may be dummy variables for coding qualitative variables), the LR model assumes that the probability of the target response is e b 0þb 1 x 1þ/þb p x p p x 1 ;.; x p ¼ 1 þ e b (3) 0þb 1 x 1þ/þb p x p The glm function in R (Venables and Ripley, 2002) tries to compute the maximum likelihood estimators of the p þ 1 parameters by means of an iterative weighted least-squares (IWLS) algorithm. There are several inferential procedures to test the statistical significance of the whole model and the individual significance of each variable. The model may also be interpreted and a great family of diagnostics and criteria are available to identify influential and outlying observations. LR can be fully embedded into a formal decision framework, but in order to carry out a comparison with the other models, a threshold probability we needs to be specified, which, in fact, corresponds to varying the prior class probabilities. Thus, 99 possible values for this threshold probability (0.01,0.02,.,0.99) are also considered, and that value which minimizes the classification error over the validation set is selected. In the same way as LDA, LR is also optimal under the assumption of multivariate normal distributions with equal covariance matrices, although LR remains optimal in a wider variety of situations. However, LR requires larger data sets to obtain stable results, and a greater variety of statistical concepts, such as odds ratios and interactions, are needed. In addition, complex non-linear relations between the dependent and independent variables could be incorporated through appropriate but non-evident transformations Classification trees A classification tree (CT) is a set of logical if then conditions which drive each case to a final decision. These conditions can i easily be plotted in order to aid the understanding of the model. A binary CT is grown by binary recursive partitioning using the response in the specified formula and by choosing splits from the set of predictor variables. The split which maximizes the reduction in impurity (a measure of diversity for the outcome in a specific set of nodes) is chosen, the data set is then split and the process is repeated. Splitting continues until the terminal nodes are too small to be split. The classification for a vector is computed by a majority class vote in its terminal node. We have used the rpart package of R (Therneau and Atkinson, 2006), which implements the CART methodology as proposed by Breiman et al. (1984). One advantage of this model is that predictor input qualitative variables may be directly used without the aid of dummy variables. CART also includes a fully automatic missing-value handling mechanism. The Gini index (default impurity measure) has been taken as the splitting criterion. Given that large trees can lead to overfitting the data and can mean a loss in the generalization capability for new data, the user must tune a fundamental parameter: the number of terminal nodes, called the size of the tree. Several strategies exist for the selection of the optimal CT, but the availability of a large validation set in our case study led us to initially grow a tree as big as possible, and then to select the sub-tree which minimized the misclassification error over the validation set. The main drawback of classification trees is their high variance, that is, a small change in the data can result in a very different series of splits. This problem can be solved with the aid of bagging techniques. The final tree could also include weak interactions between variables while the strongest interactions may remain undetected Multilayer perceptron Artificial Neural Network (ANN) is a computational paradigm which provides a great variety of mathematical non-linear models, useful for tackling different statistical problems. Several theoretical results support a particular architecture, namely the multilayer perceptron (MLP), for example, the universal approximate property, as in Bishop (1995). To this end, we have considered a three-layered perceptron with the logistic activation function g(u) ¼ e u /(e u þ 1) in the hidden layer, and the identity function as the activation function for the output layer. By denoting H as the size of the hidden layer, {v ih, i ¼ 0,1,2,.,p, h ¼ 1,2,.,H} as the synaptic weights for the connections between the p-sized input and the hidden layer, and {w hj, h ¼ 0,1,2,.,H, j ¼ 1,2,.,q}as the synaptic weights for the connections between the hidden and the q-sized output layer, then the outputs of the neural network from a vector of inputs (x 1,.,x p )become o j ¼ w 0j þ XH h ¼ 1 w hj g v 0h þ Xp i ¼ 1 v ih x i!; j ¼ 1; 2;.; q (4) In our classification problem, the response was coded by a vector z ¼ (z 1, z 2 ) formed by two dummy variables 0 1, one for each class, and hence the number of outputs was q ¼ 2. For an input vector to the net, the classification is that class corresponding to the dummy variable which achieves the maximum of the two predictions of the net. Input qualitative variables must also be coded as dummy 0 1 variables. The nnet R function (Venables and Ripley, 2002) fits singlehidden-layer neural networks by using the BFGS procedure, a quasi-newton method also known as a variable metric algorithm, which attempts to minimize a least-squares criterion which introduces a decay term l in an effort to prevent problems of overfitting. The BFGS algorithm can be found in Bishop (1995). Defining W ¼ (W 1,.,W M ) as the vector of all the M coefficients of the net, and given nq-sized target vectors y 1,.,y n, then the BFGS

6 R. Pino-Mejías et al. / Environmental Modelling & Software 25 (2010) method can be applied to the following non-linear least-squares problem:! X n Min ky i by i k 2 þ l XM Wi 2 (5) W i ¼ i i ¼ i A major disadvantage of MLP is the fact that there is no known procedure which guarentees a global solution, and usually only one of the many possible local minima is obtained at most. Another drawback is its black-box nature, which renders the resulting model very difficult to interpret. The performance of the final model can be improved with the normalization of the quantitative input variables, and therefore for each quantitative variable X i the following value has been computed for each record j: z ij ¼ (x ij x i,min )/(x i,max x i,min ). The R implementation of an MLP model requires the specification of two parameters: the size of the hidden layer (H) and the decay parameter (l), and therefore a search was carried out over a grid defined as {15, 20, 25} {0,0.01, 0.05, 0.1, 0.2,.,1.5}. This grid was defined according to the suggestions of Hastie et al. (2001), and it must be remarked that greater values for H could not be attempted due to the limited memory resources of our personal computers Support vector machines Support Vector Machine (SVM) is a family of supervised machine learning techniques. They were originally introduced by Vapnik and other co-authors (Boser et al., 1992) and several extensions were successively proposed. When used for a twoclass classification problem where the set of binary labelled training patterns is linearly separable, the SVM method separates the two classes with a hyper-plane that is maximally distant from the patterns ( the maximal margin hyper-plane ). If linear separation is not possible, the feature space is enlarged using basis expansions such as polynomials or splines. However, explicit specification of this transformation is not necessary, only a kernel function that computes inner products in the transformed space is required. We have fitted the SVM models with the svm function available in the library e1071 of the R system (Dimitriadou et al., 2006), which offers an interface to the award-winning Cþþ implementation, LIBSVM, by Chan and Lin. The data set is described by n training vectors {x i,y i }, i ¼ 1,2,.,n, where the p- dimensional vectors x i contain the predictor features and the n labels y i { 1,1} identify the class of each vector. From among the several variants of SVM existing in the library e1071, following Meyer (2004), we have used C-classification with the Radial Basis Gaussian kernel function: Kðu; vþ ¼exp gku vk 2 (6) Min w;b;x The primal quadratic programming problem to be solved is: 1 2 wt w þ C Xn x i i ¼ 1 y i wt fðx i Þþb 1 x i x i 0; i ¼ 1; 2;.; n C > 0 is a parameter controlling the trade-off between margin and error, and P n i ¼ i x i is an upper bound on the sum of distances of the wrongly classified cases to their correct plane. The dual problem is 1 Max a 2 at Qa e t a (8) 0 a i C; i ¼ 1; 2;.; n y t a ¼ 0 where e is the n-vector of all ones, and Q is a positive semidefinite matrix defined by Q ij ¼ y i y j K(x i,x j ), i,j ¼ 1,2,.,n, where K(x i,x j ) ¼ f(x i )f(x j ) is the kernel function. No explicit construction of the non-linear mapping f(x) is needed, which is known as the kernel trick. A vector x is classified by the decision function! sign Xn y i a i Kðx i ; xþþb i ¼ 1 (9) which depends on the margins m i ¼ P n i ¼ 1 y i a i Kðx i ; xþþb; i ¼ 1; 2;.; n. The greater the (7) Table 2 Characteristics of the search method for each data mining model. Model Advantages Drawbacks Parameter search results LDA Interpretability Simplicity p c ¼ Cut point ¼ 0.58 Robust model QDA Interpretability Big sample size p c ¼ Cut point ¼ 0.04 LR Direct probability estimation Interpretability Statistical concepts Big sample size Transformations p c ¼ Cut point ¼ 0.47 CT Interpretability Handling of missing value Unstable Weak interactions possibly revealed M ¼ Number of rules or terminal nodes ¼ 87 MLP Universal approximate property Less formal statistical training Prone to overfitting Black box High computational cost H ¼ Number of hidden neurons ¼ 25 l ¼ Decay parameter ¼ 1.5 SVM Global solution for the optimization problem Theoretically well founded Selection of parameters Not clearly interpretable C ¼ Penalization error coefficient ¼ 10 g ¼ Parameter of the kernel function ¼ CTBag Lower variance than CART Out-of-bag estimates High computational cost Not interpretable B ¼ Number of trees ¼ 100 RF Efficient in large data sets Out-of-bag estimates Variable importance measures Prone to overfitting for some data sets mtry ¼ Number of variables to choose the best split ¼ 10 CTBoost Theoretical properties Identification of outliers High computational cost Not interpretable M ¼ Number of trees ¼ 3

7 832 R. Pino-Mejías et al. / Environmental Modelling & Software 25 (2010) absolute value of the margin, the more reliable the computed classification. The exploratory analysis of the margins could help to throw light on the SVM model (Furey et al., 2000). Note that the solution to the quadratic programming problem is global, and avoids the non-optimality of the neural network training algorithms. Two parameters must be tuned: C and g. Wehaveadoptedthe suggestions of Meyer (2004) in performing the selection of the parameters of the SVM model, and then a grid search for C and g over the set {1,10,20,30,40,50, 100,150,.,1000} {0.015,0.016,.,0.020} is conducted on the validation set. The explored values for g have been selected around the default value of the svm function in the R library e1071, defined as 1/p, where p is the number of predictors, some of which may be dummy variables for coding qualitative variables, while the grid for C includes both small and large values. One drawback of the SVM model is the necessity for a correct identification of appropriate values for the required parameters. Another drawback is that although several schemes exist which attempt to interpret the final model, none of them is sufficiently clear. Several methods to combine different classification models have been proposed in recent years. All are based on the combination of models fitted from samples and sets of variables generated from the original data set. Three important ensemble methods are considered here: Bagging, Random Forests and Boosting. Fig. 3. Test sensitivity vs. test false positive rate (100-specificity) for the nine classification models Bagging Bagging (Bootstrap Aggregating) is a method proposed by Breiman (1996) to improve the performance of prediction models. Given a classification model, bagging draws B independent samples with replacement from the available training set (bootstrap samples), fits a model to each bootstrap sample, and finally aggregates the B models by majority voting. Bagging tends to be a very effective procedure when applied to unstable learning algorithms (i.e., small changes in the data can cause large changes in the predicted values, Breiman, 1996), such as classification and regression trees and neural networks. The empirical success of the first publications has been confirmed by theoretical results as can be found in Bühlman and Yu (2002), where bagging is shown to smooth hard decision problems, by yielding a smaller variance and mean squared error. The R package ipred (Peters and Hothorn, 2004) computes bagged tree models (CTBag) and therefore is used in our study, while two values for B, 50 and 100, are considered, whereby the value which minimizes the validation classification error is selected. Bagging provides out-of-bag (OOB) estimates of the misclassification error rate without requiring a test set. Thus, for each case, the OOB aggregated classification is computed over the models which have been trained on bootstrap samples not containing that case. The out-of-bag estimate is defined as the error rate of this OOB classification. Table 3 Accuracy, sensitivity, specificity and 100AUC for each of the classification methods, computed on the test set. Model Accuracy Sensitivity Specificity 100AUC LDA QDA LR CT MLP SVM CTBag RF CTBoost Table 4 Importance measure for the random forest model. Values between 0 and 60 are marked as. Variable Gini decreasement measure Geographic location Longitude Latitude Climate Summer average precipitation Annual average precipitation Average temperature for the colder month Annual average temperature Average temperature for the hottest month Relief Dominant erosion Slope Physiography Morphogenesis Sedimentation Secondary erosion Mass movements Elevation Curvature Orientation Edaphic variables Saturation of the exchange complex Coarse element fraction Available Iron Available Manganese Available Potassium Organic matter ph Cation exchange capacity Clay fraction Copper available Magnesium available Lithologic variables Rock origin Rock acidity Rock consolidation

8 R. Pino-Mejías et al. / Environmental Modelling & Software 25 (2010) Random forests The Random Forests (RF) approach was proposed by Breiman (2001) as a way to combine many different trees. A number of trees are constructed. Each tree is grown over a bootstrap sample of the training data set, and a random selection of variables is considered for the choice of splits in each node. As in bagging, the trees are combined by majority voting, and out-of-bag estimates can also be computed. One important feature of this ensemble method is the availability of some measures to assess the importance of each variable and to identify outlier observations. Breiman (2001) claims that RF rarely overfits, and has shown that Bayes consistency is achieved with a simple version of RF. Moreover, this method runs efficiently on large databases since it is able to handle thousands of input variables. We have used the R package randomforest (Liaw and Wiener, 2002), which builds 500 trees by default. However, the number of variables to randomly select has been chosen by means of a search around the default value (mtry ¼ square root of the number of predictors), namely from mtry 5tomtry þ 5. One drawback is that RF has experienced from overfitting on some data sets used in the machine learning benchmarking Boosting The idea of boosting appeared in the Machine Learning literature in the 1980 s although the first algorithm of this type was proposed by Schapire in 1990 and the second by Freund in Boosting was proposed in order to combine the outputs of many weak classifiers thereby producing a powerful committee, in an attempt to improve the generalization performance of weak algorithms. The various models are fitted to differently reweighted samples. At each step, those observations that were misclassified by the previous classifier have their weights increased, whereas the weights of those correctly classified are decreased. One of the most popular boosting algorithm, due to Freund and Schapire (1997), called AdaBoost.M1, which arose from the original boosting algorithms they had discovered, is presented in the following. Supposing n training vectors {x i,y i }, i ¼ 1,2,.,n, where the p- dimensional vectors x i contain the predictor features and the n Fig. 4. Test predictions for the nine prediction models.

9 834 R. Pino-Mejías et al. / Environmental Modelling & Software 25 (2010) labels y i { 1,1} identify the class of each vector, then the algorithm is: 1. Initialize D 1 (i) ¼ 1/n, i ¼ 1,2,.,n. 2. For m ¼ 1,2,.,M, repeat: a) Fit the model g m from the training set, reweighted by D m. b) Compute the weighted error rate e m ¼ P Dm ½g m ðxþsy P ¼ D m ðiþ (10) i:gmðxiþsyi c) Compute c m ¼ ln((1 e m )/e m ). d) Recompute D mþ1 ðiþ ¼D m ðiþ exp½c m Iðy i Þsg m ðx i ÞŠ where I(u) ¼ 1ifu is true, 0 otherwise. e) Normalize the D mþ1 (i), i ¼ 1,2,.,n, to sum The boosted model is: GðxÞ ¼sign XM m ¼ 1 c m g m ðxþ! (11) The consistency of boosting is of major interest, as evidenced by Breiman (2004), which is accompanied by other three papers and several discussions. Like SVM, boosting tends to maximize the margins. The R package boost (Dettling, 2006) contains a function for boosting trees (adaboost) and is adopted in our case study. The function adaboost considers M ¼ 20, however the boosting models for M ¼ 1,2,3,.,20 have been empirically compared on the validation set. The examination of the weight distribution can help to identify outlier observations in the data set, since difficult cases receive greater weights. 4. Results Table 2 summarizes the characteristics of the search method and offers the parameter configuration resulting from the search over the validation set. It should be borne in mind that the final parameter configuration can be very different for other problems. Moreover, a different split into training, validation and test sets can produce other final parameters, although, as in our problem, the larger the sample size, the more similar the corresponding perfomances. CT and MLP are particularly complex as the former needs a large number of terminal nodes, and the latter relies on hidden layers. CTBoost only requires three classification trees, therefore the decision rule is based on the weighted aggregation of three base trees. Table 3 contains the test accuracy (percentage of correctly classified points), the test sensitivity (percentage of correctly classified points presenting oaks) and the test specificity (percentage of correctly classified points not presenting oaks). The area under the ROC curve (AUC) was also computed, with the aid of the ROCR library available in R (Sing et al., 2005). Two ensemble methods based on classification trees show the best results. The bagged classification tree (CTBag) provides the greatest test accuracy (83.93%) while Random Forests (RF) has a slightly lower accuracy (83.73%). Moreover, both values of the 100AUC are also very similar (90.96% for CTBag and 91.07% for RF). However, the sensitivity and specificity values are reversed: while RF has a test sensitivity (86.02%) greater than its test specificity (80.64%), CTBag exhibits a lower test sensitivity (81.10% against 86.02%). Classic statistical models, such as LDA, QDA and LR, are clearly superceded by these ensemble methods. Note that the classification tree model also outperforms these classic models for the accuracy criterion. Table 3 also shows that boosting is not effective in this study. Moreover, the validation set suggests halting the boosting algorithm after only three iterations. The sensitivity and the false positive rates (100 specificity) have been displayed for the nine models. Fig. 3 shows the resulting graph, whereby the bagged classification tree and the Random Forests method both stand out, since these models are those nearest to the upper left-hand corner. However, a distinctive feature of RF is the availability of measures of importance of the predictor variables. One of these measures computes for each variable the total decrease in node impurities (Gini index) given by the splitting on the variable, averaged over all trees. Table 4 displays these importance values for the predictor variables, where values lower than 60 are represented by. This table reveals that the limiting factors for the potential distribution for the oak forest type in the area under study would be determined mainly by climate, erosion processes and physiography. Secondarily, edaphic variables (saturation of the exchange complex, coarse element fraction, and available iron) would also influence the distribution of these species. Finally, it seems that lithological variables display no repercussions for the potential distribution of this group of species. A more in-depth review of the performances of the compared methods can be obtained through Fig. 4. Each part of this figure exhibits the map of test predictions of the corresponding model, where the colour black denotes predicted presence of oaks, and grey denotes the real presence of oaks. These maps reinforce the previous conclusions, and confirm that CTBag and RF present good agreement between the spatial distribution of the test oaks and the spatial distribution of the predictions. Fig. 4 also shows that in the lower left-hand region, the remaining models fail to adequately predict the presence of oaks. Fig. 5 presents a measure of the relative agreement between the nine models considered. Thus, black denotes five or more predictions of the presence of oaks, dark grey represents three or four predictions, and light grey denotes at most two predictions. This figure confirms that the lower left-hand region is the most difficult to predict, evidencing a clear spatial bias for the majority of the models analyzed, with the exception of CTBag and RF. Fig. 5. Test relative agreement. Black: five or more predictions of the presence of oaks; dark grey: three or four predictions; light grey: at most two predictions.

10 R. Pino-Mejías et al. / Environmental Modelling & Software 25 (2010) Conclusions We have studied a suitable methodology to determine an appropriate model to predict the potential distribution of oaks in an effort to improve the process of reforestation under Mediterranean conditions. Although the diversity of classification models currently available enriches statistical practice in the environmental framework by offering different alternatives to the analyst, the number of decisions to be made is clearly increased. The present research is designed to provide practical experience in order to facilitate the selection of methods. Due to the obvious need for a powerful statistical programming language system for implementing this computing task, a good and cheap choice is to employ the R system, which offers free implementation of both classic and modern classification models. A first conclusion is that none of the classification models studied should be utilized blindly, and therefore none of them can be named as a fully automatic classification model. Each has a certain number of parameters which must be carefully tuned, and requires search procedure to be employed. However, the R programs are able to search for the best configuration of values for the required parameters. Moreover, a correct model comparison also needs strategies for the estimation of the generalization errors. For large data sets, such as that used in this study, the random split into training, validation and test sets is recommended. For data sets of a reduced size, the use of cross-validation schemes for the selection of suitable parameters in the tuning process could be necessary. The analysis of the results for our case study has revealed that bagged classification trees and random forests offer very good performance. However, we cannot claim the universal superiority of these two models. The relative performance depends on the data set and computer resources, together with other factors, such as experience in using the methods. However, it is worth considering different alternative classification models, as shown in this case study. Concerning the quality of the present research, there are several actors which improved the comparison of the prediction models. Firstly, results may depend on the quality of input data (Guisan and Zimmermann, 2000). To this end, biophysical maps of a high accuracy and resolution were selected as input data. This input data was collected from different published sources or was extracted by the authors. This data belongs to several environmental fields of independent variables: lithology, geomorphology, physiography, relief, soil, and climate. There were both qualitative and quantitative variables, all of which were coded and pre-processed in order to obtain a homogeneous digital database. In agreement with other authors (Guisan and Zimmermann, 2000), the variability in predictive capacity obtained by the nine methods used in this paper may suggest that the best predictive model strongly depends upon the type and quality of the input data. Another strength of the present work is the high sample size of input data which contains 65,535 points uniformly distributed in the area under study; this represents 14% of the total area, and approximately 3 points/km 2 of sampling density. The main characteristic of this sample size was that it enabled all data to be split into a training set (50%), validation set (25%), and test set (25%), whose sample sizes provided reliable estimations of the generalization error of the different final models. On the other hand, the main weakness in this study could be that no competition between different species has been taken into account. Despite the great relevance that biotic interactions hold in current distributions of every community, their measurement presents a tough nut to crack. Furthermore, historical factors such as human disturbance of the original ecosystem, have also been omitted. This is perhaps a problem without solution, except for undisturbed systems. Classic statistical models offered poor performance in our case study. The other three base models improved the results, while the fitted classification trees and multilayer perceptrons remained too complex. Support vector machines performed slightly better than classification trees. Nevertheless, two ensemble methods, random forests and bagged trees, were found to be the best models, while the performance of boosting remained similar to that of quadratic discriminant analysis. Although bagging and random forests exhibited similar results, the great advantage of RF is the availability of measures of importance of the variables, which suggests that the potential distribution for the oak forest type in the area under study would be determined mainly by climate, erosion processes, slope and physiography, and secondarily by edaphic variables. It seems that lithological variables have no influence on the potential distribution of oak forest. Not only can the methodology followed in this study help to determine the potential distribution of oaks in similar Mediterranean areas, but it can also be extended to different areas and classification problems. Acknowledgments This work was supported by Spanish Ministry of Education and Science (MTM ), Institute of Statistics of Andalusia (OG- 154/07) and by Andalusia Environment Government (OG-096/01). We thank Dr. De la Rosa for advice on the manuscript. The authors wish to thank the anonymous reviewers for their valuable comments. References Anaya-Romero, M., Modelo de distribución potencial de usos forestales basado en parámetros edáficos, geomorfológicos, climáticos y topográficos. PhD. University of Seville, Seville, Spain. Belanche-Muñoz, L., Blanch, A.R., Machine learning methods for microbial source tracking. Environmental Modelling & Software 23 (6), Bishop, C.M., Neural Networks for Pattern Recognition. Oxford University Press, New York. Boser, B.E., Guyon, I.M., Vapnik, V.N., A training algorithm for optimal margin classifiers. In: Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory. ACM Press, Pittsburgh, pp Bullock, P., Jones, R.J.A., Montanarella, L. (Eds.), Soil Resources of Europe. The European Soil Bureau, Joint Research Centre I ISPRA, Italy, 202 pp. Breiman, L., Bagging predictors. Machine Learning 24, Breiman, L., Random forests. Machine Learning 45 (1), Breiman, L., Population theory for boosting ensembles. The Annals of Statistics 32 (1), Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., Classification and Regression Trees. Wadsworth and Brooks, Belmont. Bühlman, P., Yu, B., Analyzing bagging. The Annals of Statistics 30 (4), Cheng, B., Titterington, D.M., Neural networks: a review from a statistical perspective. Statistical Science 9, Cristianini, N., Shawe-Taylor, J., An Introduction to Support Vector Machines. Cambridge University Press, Cambridge. De la Rosa, D., Mayol, F., Moreno, J.A., Bonsón, T., Lozano, S., An expert system/ neural network model (ImpelERO) for evaluating agricultural soil erosion in Andalucia region, southern Spain. Agriculture, Ecosystems and Environment 73, Del Toro, M., Capacidad de uso forestal de los suelos del Parque Natural Sierra de Grazalema en base a sus propiedades químicas. PhD. University of Seville, Seville, Spain. Dettling, M., boost: Boosting Methods for Real and Simulated Data. R package version Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., Weingessel, D., e1071: Misc Functions of the Department of Statistics (e1071), TU Wien. R package version Dixon, M., Gallop, J.R., Lambert, S.C., Healy, J.V., Experience with data mining for the anaerobic wastewater treatment process. Environmental Modelling & Software 22 (3), Draín, M., Geografía de la Península Ibérica. Oikos-Tau, Barcelona.

11 836 R. Pino-Mejías et al. / Environmental Modelling & Software 25 (2010) Ekasingh, B., Ngamsomsuke, K., Searching for simplified farmers crop choice models for integrated watershed management in Thailand: a data mining approach. Environmental Modelling & Software 24 (12), ESRI, ArcInfo v ESRI, Redlands. European Environment Agency. Data Corine Land Cover (CLC1990) 250 m Version 06/1999. Fayad, U., Piatetsky-Shapiro, G., Smith, P., From data mining to knowledge discovery in databases (a survey). AI Magazine 3 (17), Felicísimo, A.M., Digital Terrain Models. Introduction and Applications in Environmental Sciences. Ed. Pentalfa, Oviedo (in Spanish). Feranec, J., Hazeub, G., Christensenc, S., Jaffraind, G., CORINE land cover change detection in Europe (case studies of the Netherlands and Slovakia). Land Use Policy 24, Feranec, J., et al., Determining changes and flows in European landscapes using CORINE land cover data. Applied Geography. doi: / j.apgeog Fisher, R.A., The use of multiple measurements in taxonomic problems. Annals of Eugenics 7, Franklin, J., Predictive vegetation mapping: geographic modelling of biospatial patterns in relation to environmental gradients. Progress in Physical Geography 19, Freund, Y., Schapire, R., A decision-theoretic generalization of online learning and application to boosting. Journal of Computer and System Sciences 55, Furey, T.S., Cristianini, N., Duffy, N., Bednarski, D.W., Schummer, M., Haussler, D., Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16, Gibert, K., Sánchez-Marré, M., Rodríguez-Roda, I., GESCONDA: an intelligent data analysis system for knowledge discovery and management in environmental databases. Environmental Modelling & Software 26 (1), Guisan, A., Zimmermann, E., Predictive habitat distribution models in ecology. Ecological Modelling 135, Halpern, C.B., Spies, T.A., Plant species diversity in natural and managed forests of the Pacific Northwest. Ecological Applications 5, Hastie, T., Tibshirani, R., Friedman, J., The Elements of Statistical Learning. Springer, New York. Hertz, J., Krogh, A., Palmer, R., Introduction to the Theory of Neural Computation. Addison Wesley, Reading. Hirzel, A.H., Helfer, V., Métral, F., Assessing habitat-suitability models with a virtual species. Ecological Modelling 145, Hosmer, D.W., Lemeshow, S., Applied Logistic Regression. Wiley, New York. Ihaka, R., Gentleman, R., R: a language for data analysis and graphics. Journal of Computational and Graphical Statistics 5, Kimmins, J.P., Forest Ecology. Macmillan, New York, EEUU, 531 pp. Kirilenko, A., Chivoiu, B., Crick, J., Ross-Davis, A., Schaaf, K., Shao, G., Singhania, V., Swihart, R., An Internet-based decision support tool for non-industrial private forest landowners. Environmental Modelling & Software 22, Liaw, A., Wiener, M., Classification and regression by randomforest. R News 2 (3), MAPA, Forestación de tierras agrícolas. Ministerio de Agricultura y Pesca, Madrid, Spain, pp Martínez-Zavala, L., Análisis Territorial de la Comarca del Andévalo Occidental: una Aproximación desde el Medio Físico. PhD. University of Seville, Seville, Spain. May, R.J., Maier, H.R., Dandy, G.C., Gayani Fernando, T.M.K., Non-linear variable selection for artificial neural networks using partial mutual information. Environmental Modelling & Software 23 (11), Meyer, D., Support Vector Machines. The Interface to libsvm in Package e Michie, D., Spiegelhalter, D., Taylor, C. (Eds.), Machine Learning, Neural and Statistical Classification. Ellis Horwood Series in Artificial Intelligence. Ellis Horwood. MOPT, Guía para la elaboración de estudios del medio físico: Contenidos y metodología, 3 a edición. Ministerio de Obras Públicas y Turismo, Madrid. Moreira, J.M., Fernández Palacios, A., Cartografía y estadísticas de usos y coberturas vegetales del suelo en Andalucía. Evolución Junta de Andalucía, Consejería de Medio Ambiente, Sevilla, Spain. Morgan, J., Sonquist, J., Problems in the analysis of survey data, and a proposal. Journal of the American Statistical Society 58, Núñez, M.A., El Medio Físico del Parque Natural de la Sierra de Aracena-Picos de Aroche y su entorno. Paleoalteraciones, edafogénesis actual y unidades ambientales. PhD. University of Cordoba, Cordoba, Spain. Peters, A., Hothorn, T., ipred: Improved Predictors. R package version Quinlan, J.R.,1993. C4.5: Programs for MachineLearning. Morgan Kaufmann, San Mateo. R Development Core Team, R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, ISBN URL: Sing, T., Sander, O., Beerenwinkel, N., Lengauer, T., ROCR: Visualizing the Performance of Scoring Classifiers. R Package Version mpi-sb.mpg.de. Storch, H.V., On the use of inflation in statistical downscaling. Journal of Climate 12, Therneau, Terry M.,, Atkinson, B.R., Rpart: Recursive Partitioning. R Package Version splusfunctions.cfm port by Brian Ripley. Vapnik, V., Statistical Learning Theory. Wiley, New York. Venables, W.N., Ripley, B.D., Modern Applied Statistics with S-PLUS. Springer, New York. Walter, H., Zonas de vegetación y clima. Ediciones Omega, Barcelona. Witten, I.H., Frank, E., Data Mining, Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco. Zaniewski, A.E., Lehmann, A., Overton, J.McC, Predicting species spatial distributions using presence-only data: a case study of native New Zealand ferns. Ecological Modelling 157,