STAT 503X Case Study 2: Italian Olive Oils

Size: px
Start display at page:

Download "STAT 503X Case Study 2: Italian Olive Oils"

Transcription

1 STAT 503X Case Study 2: Italian Olive Oils 1 Description This data consists of the percentage composition of 8 fatty acids (palmitic, palmitoleic, stearic, oleic, linoleic, linolenic, arachidic, eicosenoic) found in the lipid fraction of 572 Italian olive oils. (An analysis of this data is given in (Forina, Armanino, Lanteri & Tiscornia 1983)). There are 9 collection areas, 4 from southern Italy (North and South Apulia, Calabria, Sicily), two from Sardinia (Inland and Coastal) and 3 from northern Italy (Umbria, East and West Liguria). The data available are: Region Area Palmitic Acid Palmitoleic Acid Stearic Acid Oleic Acid Linoleic Acid Linolenic Acid Arachidic Acid Eicosenoic Acid South, North or Sardinia Sub-regions within the larger regions (North and South Apulia, Calabria, Sicily, Inland and Coastal Sardinia, Umbria, East and West Liguria Percentage 100 in sample Percentage 100 in sample Percentage 100 in sample Percentage 100 in sample Percentage 100 in sample Percentage 100 in sample Percentage 100 in sample Percentage 100 in sample The primary question is How do we distinguish the oils from different regions and areas in Italy based on their combinations of the fatty acids? 1

2 2

3 2 Suggested Approaches Approach Reason Type of questions addressed Data Restructuring This is very clean data so I don t see any need to restructure. Summary Statistics To get at location and scale information What is the average percent compo- for each variable, and by groups sition of eicosenoic acid overall? Is there a difference in the average percentage of eicosenoic acid for olives from different growing regions? Visual Inspection The Univariate Plots, Bivariate Plots, Are there differences in the fatty acid aim of visual methods Touring Plots composition between the olives from is to understand separations, different growing regions? help decide the best classification method, and hence interpret solutions Numerical Analysis The aim of numerical solutions is to get the best predictive results. Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Classification and Regression Trees (CART), Forests, Feed-Forward Neural Networks and Support Vector Machines Can define the percentage composition of fatty acids that distinguishes olives from different growing regions? 3

4 3 Actual Approaches 3.1 Summary Statistics The following tables contain information on the minimum, maximum, median, mean and standard deviation for the fatty acids, in total and broken down by different growing region. All the 572 observations (values reported as percentages): palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic Min Median Max Mean Std. Dev From the summary statistics we notice that olive oils are mostly constituted of oleic acid. accounts for about 12% on average and the rest less than 10% on average. Palmitic n palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic South Sardinia North Nth Apul Calabria Sth Apul Sicily Inla Sard Coas Sard East Lig West Lig Umbria Southern oils have much higher eicosenoic acid on average eicosenoic and slightly higher palmitic and palmitoleic acid content. The north and sardinian oils have some difference in the average oleic, linoleic and arachidic acids. Among the southern oils there is some difference in most of the averages. Northern oils have some difference in most of the averages. 4

5 3.2 Visual Inspection The objective here is to find differences in the measured variables amongst the classes. Differences here might be actual separations between classes on a variable or linear combination of variables. The approach is relatively simple: 1. Use color and or symbol to code the categorical class information into plot. 2. Begin with low-dimensional plots (histogram, density plot, dot plot, scatterplot) of the measured variables and work up to high-dimensional plots (parallel coordinate plot, tours), exploring class structure in relation to data space Regions - Univariate Plots Using 1D Plot mode sequentially work through the variables, either by manually sepecting variables or cycling through automatically, to examine separations between regions. Its possible to neatly separate the oils from southern Italy from the other two regions using just one variable, eicosenoic acid. Figure 1 displays a textured dotplot and an ASH plot of this variable. The oils from southern Italy are removed, and we concentrate on separating the oils from northern Italy and Sardinia. Although a clear separation between these two regions cannot be found using one variable, two of the variables, oleic and linoleic acid, appear to be important for the separation (Figure 1) Regions - Bivariate Plots Starting from the two variables identified by the univariate plots as important for separating northern Italian oils from Sardinian oils, the remaining variables are explored in relation two these two using scatterplots. Oleic acid and linoleic acid show some, but not cleanly, separated regions. Arachidic acid and linoleic acid display a clear separation between the regions, but it is a very non-linear boundary Regions - Multivariate Plots Starting from the three variables found from bivariate plots to be important for separating northern Italian and Sardinian oils we use a higher-dimensional technique to explore them. Using either Rotation, Tour1D or Tour2D examine the separation between the two regions in the 3-dimensional space. Figure 2 shows the results of using Tour1D on the three variables. The two regions can be separated cleanly by a linear combination of linoleic and arachidic acid, roughly corresponding to linoleic arachidic Areas - Northern Italy There are three areas in the region, Umbria, East and West Liguria. From univariate plots there are no clear separations between areas, although several variables, for example, linoleic acid (Figure 3, show some differences. In bivariate plots, two variables, stearic and linoleic show some differences between the areas. Examining combinations of variables in Tour1D shows that West Liguria is almost separable from the other two areas using palmitoleic, stearic, linoleic and arachidic acids. Umbria and East Liguria are separable, except for one sample, in a combination of most of the variables. To get this result, the projection pursuit controls were used followed by manual manipulation to assess the importance of each variable in the separation of the areas. 5

6 Figure 1: Separation between the 3 regions of Italian olive oil data in univariate plots. 6

7 Figure 2: Separation between the northern Italian and Sardinian oils in bivariate plots and multivariate plots Areas - Sardinia Separating the oils from the coastal and inland areas of Sardinia can be simply done by using two variables, oleic and linoleic acid Areas - Southern Italy There are four areas in the south Italy region, North and South Apulia, Calabria, Sicily. From univariate plots and bivariate plots there are no clear separations between all four areas. Simplify the work by working with fewer groups, take two of the areas at a time, or three at a time, and plot the data in univariate and bivariate plots. The best results are obtained by removing Sicily. If the oils from this area are removed then the remaining 3 areas are almost separable using the variables palmitic, palmitoleic, stearic and oleic acids. Figure 4 shows plots taken from the Tour2D of these 4 variables revealing the separations between the areas, North and South Apulia, and Calabria. However, the fatty acid content of oils from Sicily is confounded with those of the other areas Taking stock What we have learned from this data is that the olive oils from different geographic regions have dramatically different fatty acid content. The three larger geographic regions, North, South, Sardinia, are well-separated based on eicosenoic, linoleic and arachidic acids. The oils from areas in northern Italy are mostly separable from each other using all the variables. The oils from the inland and coastal areas of Sardinia have different amounts of oleic and linoleic acids. The oils from three of the areas in southern Italy are almost separable. And one is left with the curious content of the oils from Sicily. Why are these oils indistinguishable from the oils of all the other areas in the south? Is there a problem with the quality of these samples? 7

8 Figure 3: Separation in the oils from areas of northern Italy, univariate, bivariate and multivariate plots. 8

9 Figure 4: The areas of southern Italy are mostly separable, except for Sicily. 3.3 Numerical Analysis The data is broken into training and test samples for this part of the analysis. Approximately 25% of cases within each group are reserved for the test samples. These are the cases used for the test sample, where the data has been sorted by region and area: Region Area # Train # Test South N. Apul South Calabria South S. Apul South Sicily 27 9 Sard Inland Sard Coast 25 8 North E. Lig North W. Lig North Umbria We will use what we learned from the visual analysis to guide the numerical analysis. Some methods restrict classification to just two groups, so we will use this restriction with all the classification methods. We will also drop the Sicilian oils from the study - because there is some question raised about the validity of these oils. Here is the binary classification steps we will follow: 9

10 3.3.1 Methods Used Classification trees generate a classification tree by sequentially doing binary splits on variables. Splits are decided on according to the variable and split value which produces the lower measure of impurity. There are several common measures of impurity, Gini and deviance. For example, for a two class problem with 400 observations in each class, denoted as (400,400), variable 1 produces a split of (300,100) and (100,300), and variable 2 produces a split (200,400) and (200,0). The split on variable 2 will produce a lower impurity value because one branch is more pure. There are numerous implementations: R packages such as rpart and standalone packages such as C4.5, C5.0. Classical Linear discriminant analysis (LDA) assumes that the populations come from multivariate normal distributions with different mean but equal variance-covariance matrices. Linear boundaries result, and it is possible to reduce the dimension of the data into the space of maximum separation using canonical coordinates. The MASS package in R has functions for LDA. Quadratic discriminant analysis (QDA) assumes that the populations come from multivariate normal distributions with different mean and different variance-covariance. Non-linear boundaries are the result. The MASS package in R has functions for LDA. Feed-forward neural networks provide a flexible way to generalize linear regression functions. A simple network model as produced by nnet code in R (Venables & Ripley 1994) may be represented by the equation: y = φ(α + s w h φ(α h + h=1 p w ih x i )) where x is the vector of explanatory variable values, y is the target value, p is the number of variables, s is i=1 10

11 the number of nodes in the single hidden layer and φ is a fixed function, usually a linear or logistic function. This model has a single hidden layer, and univariate output values. SVM have recently gained widespread attention due to their success at prediction in classification problems. The subject started in the late seventies (Vapnik 1979). The definitive reference is Vapnik (1999), and Burges (1998) gives a simpler tutorial into the subject. The main difference between this and the previously described classification techniques is that SVMs really only work for separating between 2 groups. There are some modifications for multiple groups but these are little more than one might do manually by working pairwise through the groups. SVM algorithms work by finding a subset of points in the two groups that can be separated, which are then known as the support vectors. These support vectors can be used to define a separating hyperplane, w.x + b = 0, where w = N S i=1 α iy i x i, N S is the number of support vectors, and α comes from the constraint N S i=1 α iy i = 0. Actually we search for the support vectors which gives the hyperplane which gives the biggest separation between the two classes. Note that it is posssible to incorporate non-linear kernel functions allows for defining non-linear separations, and also that modifications allow SVMs to perform well when the classes are not separable. We use the software SVM Light: binary and documentation at The Classification Separating Regions In separating the southern oils from the other two regions, and then northern oils from Sardinian oils, here is the classification tree solution: If eicosenoic acid $>$ 7 then the region is South (1) Else If linoleic $>$ then the region is Sardinia (2) Else the region is North (3) There are no missclassifications on the first split but the second split is problematic. It is a perfect separation for this data but there is no gap between the two groups. This is easily seen from the plot of the two variables used, Figure 5. So at this point we will shift to using a different method to build a classification rule for northen oils from Sardinian oils. Here is the R code: # South vs others d.olive.train<-d.olive[indx.tr,-c(1,2)] d.olive.test<-d.olive[indx.tst,-c(1,2)] c1<-rep(-1,572) c1[d.olive[,1]!=1]<-1 c1.train<-c1[indx.tr] c1.test<-c1[indx.tst] # North vs Sardinia d.olive.train<-d.olive[indx.tr,] c1.train<-c(rep(-1,572))[indx.tr] c1.train[d.olive.train[,1]==3]<-1 c1.train<-c1.train[d.olive.train[,1]!=1] d.olive.train<-d.olive.train[d.olive.train[,1]!=1,-c(1,2)] d.olive.test<-d.olive[indx.tst,] c1.test<-c(rep(-1,572))[indx.tst] 11

12 Figure 5: (Left)Plot illustrating the results of tree classifier. (Right) Second boundary from LDA solution, dashed boundary is the tree solution on the LDA predicted values. c1.test[d.olive.test[,1]==3]<-1 c1.test<-c1.test[d.olive.test[,1]!=1] d.olive.test<-d.olive.test[d.olive.test[,1]!=1,-c(1,2)] # Trees # Load the rpart library olive.rp<-rpart(c1.train~.,data.frame(d.olive.train),method="class") table(c1.train,predict(olive.rp,type="class")) table(c1.test,predict(olive.rp,data.frame(d.olive.test),type="class")) Here is the output from R: # South vs others n= 436 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root ( ) 2) eicosenoic>= ( ) * 3) eicosenoic< ( ) * c1.train c1.test

13 # Sardinia vs North n= 190 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root ( ) 2) linoleic>= ( ) * 3) linoleic< ( ) * c1.train c1.test LDA results in errors in the training sample (error rate = 1/190 = 0.005) and no errors in the test sample. The predicted values are well-separated (see figure) but the boundary is in the wrong place! (See Figure 5, right plot.) palm palm oleic stear oleic lino lino nic arach eico Table 1: Correlations between predicted values and the predictors. The variables oleic, linoleic and arachidic are highly correlated with the predicted values (Table 1). The predicted values show a big gap between the northern oils and the sardinian oils but the boundary is set too close to the sardinian group because LDA assumes equal variances. olive.lda1<-lda(d.olive.train,c1.train) olive.lda1 table(c1.train,predict(olive.lda1,d.olive.train)$class) table(c1.test,predict(olive.lda1,d.olive.test)$class) # plot the predictions of the current sample olive.x<-predict(olive.lda1,d.olive.train, dimen=1)$x olive.x2<-predict(olive.lda1,d.olive.test, dimen=1)$x olive.lda.proj<-predict(olive.lda1,d.olive[,-c(1,2)],dimen=1)$x par(pty="s") plot(d.olive[,10],olive.lda.proj,type="n", xlab="eicosenoic",ylab="lda Discrim 1") 13

14 points(d.olive[d.olive[,1]==1,10],olive.lda.proj[d.olive[,1]==1],pch="x",col=2) points(d.olive[d.olive[,1]==2,10],olive.lda.proj[d.olive[,1]==2],pch="s",col=3) points(d.olive[d.olive[,1]==3,10],olive.lda.proj[d.olive[,1]==3],pch=16,col=6) abline(v=7) lines(c(0,7),c(-0.49,-0.49)) text(50,7,"south") text(-2,8,"sardinia",pos=4) text(-2,-6,"north",pos=4) # Predicted values are generated by # x sigma^(-1)(mean1-mean2)-((mean1+mean2)/2) sigma^(-1)(mean1-mean2) # So these lines should recreate the predicted values. xmn<-apply(d.olive.train,2,mean) x<-as.matrix(d.olive.train-matrix(rep(xmn,190),nrow=190,byrow=t))%*% as.matrix(olive.lda1$scaling) # compute correlations between variables and the predicted values cor(olive.x,d.olive.train[,1]) cor(olive.x,d.olive.train[,2]) cor(olive.x,d.olive.train[,3]) cor(olive.x,d.olive.train[,4]) cor(olive.x,d.olive.train[,5]) cor(olive.x,d.olive.train[,6]) cor(olive.x,d.olive.train[,7]) cor(olive.x,d.olive.train[,8]) These are the results: Prior probabilities of groups: Group means: palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic Coefficients of linear discriminants: LD1 palmitic palmitoleic stearic oleic linoleic linolenic arachidic

15 eicosenoic c1.train c1.test Quadratic discriminant analysis provides a perfect classification but its more difficult to determine the location of the boundary between the two groups. # Quadratic discriminant analysis olive.qda1<-qda(d.olive.train,c1.train) olive.qda1 table(c1.train,predict(olive.qda1,d.olive.train)$class) table(c1.test,predict(olive.qda1,d.olive.test)$class) Trees calculated on the predicted values provide a perfect classification, although the boundary seems a little too close to the northern oils (Figure 5). # Check if trees will give a better boundary rpart(c1.train~olive.x,method="class") par(lty=2) lines(c(0,7),c(-1.49,-1.49)) par(lty=1) This gives the classification of the regions as: If eicosenoic acid > 7 then the region is South (1) Else If palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic > then the region is Sardinia (2) Else the region is North (3) There is no error associated with this rule. Sardinian Oils Split the samples into training and test sets. # Sardinian oils d.olive.train<-d.olive[indx.tr,] c1.train<-c(rep(-1,572))[indx.tr] c1.train[d.olive.train[,2]==6]<-1 c1.train<-c1.train[d.olive.train[,1]==2] d.olive.train<-d.olive.train[d.olive.train[,1]==2,-c(1,2)] d.olive.test<-d.olive[indx.tst,] c1.test<-c(rep(-1,572))[indx.tst] 15

16 c1.test[d.olive.test[,2]==6]<-1 c1.test<-c1.test[d.olive.test[,1]==2] d.olive.test<-d.olive.test[d.olive.test[,1]==2,-c(1,2)] Figure 6: Separation of the inland and coastal Sardinian oils using trees (horizontal axis) and LDA (vertical axis). A classification tree produces a perfect classification of both training and test samples but the separation between groups is rather small relative to the split for a slightly more complex solution provided by LDA (Figure 6). Here is the code: # Trees olive.rp<-rpart(c1.train~.,data.frame(d.olive.train),method="class") olive.rp table(c1.train,predict(olive.rp,type="class")) table(c1.test,predict(olive.rp,data.frame(d.olive.test),type="class")) # Linear discriminant analysis olive.lda1<-lda(d.olive.train,c1.train) olive.lda1 table(c1.train,predict(olive.lda1,d.olive.train)$class) table(c1.test,predict(olive.lda1,d.olive.test)$class) # plot the predictions of the current sample olive.x<-predict(olive.lda1,d.olive.train, dimen=1)$x olive.x2<-predict(olive.lda1,d.olive.test, dimen=1)$x par(mfrow=c(1,1),pty="s") 16

17 plot(c(d.olive.train[,5],d.olive.test[,5]),c(olive.x,olive.x2),type="n", xlab="tree",ylab="lda Discrim") points(c(d.olive.train[c1.train==-1,5],d.olive.test[c1.test==-1,5]), c(olive.x[c1.train==-1],olive.x2[c1.test==-1]),pch=16,col=2) points(c(d.olive.train[c1.train==1,5],d.olive.test[c1.test==1,5]), c(olive.x[c1.train==1],olive.x2[c1.test==1]),pch="x",col=4) abline(v=1246.5) abline(h=0.73) text(1150,-4,"inland",pos=4) text(1260,7,"coastal",pos=4) The solution for Sardinian oils then is: If palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic > 0.73 then the sample is from inland Sardinia. palm palm oleic stear oleic lino lino nic arach eico Table 2: Correlations between variables and predicted values for Sardinian oils. Based on correlations with the predicted values (Table 2) the important variables for separating inland and coastal Sardinia are oleic, linoleic and stearic acids, with palmitic and linolenic being somewhat important. Northern Oils Separating these northern oils is more difficult, based on our visual analysis. East Liguria vs West Liguria,Umbria The classification difficulty is reflected in the results of tree classifiers: the error in the training set is 8/116 = 0.07 and the error in the test set is 8/35 = But is is a simple solution using just two variables, palmitic and arachidic. (See Figure 7 left plot.) The LDA solution is a littler better, with 6/116 = 0.05 errors in the training set and 3/35 = 0.09 in the test set (Figure 7). # East Liguria vs others # Trees olive.rp<-rpart(c1.train~.,data.frame(d.olive.train),method="class") olive.rp table(c1.train,predict(olive.rp,type="class")) table(c1.test,predict(olive.rp,data.frame(d.olive.test),type="class")) plot(c(d.olive.train[,7],d.olive.test[,7]), c(d.olive.train[,1],d.olive.test[,1]),type="n", xlab="arachidic",ylab="palmitic") points(c(d.olive.train[c1.train==-1,7],d.olive.test[c1.test==-1,7]), c(d.olive.train[c1.train==-1,1],d.olive.test[c1.test==-1,1]),pch=16,col=2) points(c(d.olive.train[c1.train==1,7],d.olive.test[c1.test==1,7]), c(d.olive.train[c1.train==1,1],d.olive.test[c1.test==1,1]),pch="x",col=4) 17

18 abline(v=59.5) lines(c(0,59.5),c(1165,1165)) text(80,1400,"east Liguria",pos=4) text(0,800,"west Liguria, Umbria",pos=4) # Linear discriminant analysis olive.lda1<-lda(d.olive.train,c1.train) olive.lda1 table(c1.train,predict(olive.lda1,d.olive.train)$class) table(c1.test,predict(olive.lda1,d.olive.test)$class) olive.x<-predict(olive.lda1,d.olive.train)$x olive.x2<-predict(olive.lda1,d.olive.test)$x xmn<-apply(d.olive.train,2,mean) x<-as.matrix(d.olive.train-matrix(rep(xmn,79),nrow=79,byrow=t))%*% as.matrix(olive.lda1$scaling) xmn%*%as.matrix(olive.lda1$scaling) plot(c(olive.x,olive.x2),c(c1.train,c1.test),type="n", xlab="lda Discrim",ylab="Area") points(c(olive.x[c1.train==-1],olive.x2[c1.test==-1]), c(c1.train[c1.train==-1],c1.test[c1.test==-1]),pch=16,col=2) points(c(olive.x[c1.train==1],olive.x2[c1.test==1]), c(c1.train[c1.train==1],c1.test[c1.test==1]),pch="x",col=4) abline(v=0.76) text(2,0.5,"east Liguria",pos=4) text(-4,-0.5,"west Liguria, Umbria",pos=4) # compute correlations between variables and the predicted values cor(olive.x,d.olive.train[,1]) cor(olive.x,d.olive.train[,2]) cor(olive.x,d.olive.train[,3]) cor(olive.x,d.olive.train[,4]) cor(olive.x,d.olive.train[,5]) cor(olive.x,d.olive.train[,6]) cor(olive.x,d.olive.train[,7]) cor(olive.x,d.olive.train[,8]) Turning to feed-forward neural networks. After several runs a good result the net converges, to a solution that has no error in the training set but two errors in the test set. The solution is as follows: Area = (-0.22 palmitic-0.01 palmitoleic-0.13 stearic oleic linoleic linolenic-0.05 arachidic eicosenoic) (0.07 palmitic+0.03 palmitoleic+0.06 stearic oleic linoleic linolenic+0.04 arachidic) (-0.10 palmitic+0.10 palmitoleic+0.11 stearic oleic linoleic linolenic-0.26 arachidic eicosenoic) The code is as follows: 18

19 Figure 7: (Left) Tree solution for East Liguria vs West Liguria,Umbria, (Middle) LDA solution for East Liguria vs West Liguria,Umbria, (Right) LDA solution for West Liguria vs Umbria. olive.nn<-nnet(c1.train~.,d.olive.train,size=3,linout=t,decay=5e-4, range=0.6,maxit=1000) # weights: 31 initial value iter 10 value iter 20 value iter 560 value final value converged > table(c1.train,round(predict(olive.nn,d.olive.train))) c1.train > table(c1.test,round(predict(olive.nn,d.olive.test))) c1.test Support vector machines (SVM) produce 2 errors in the training set and 6 errors in the test set. The Windows version of SVM light is used. To run SVM the data needs to be outputted to file in a special format. The svm learn and svm classify executables are run in a command window. The parameters needs to be adjusted some to get the best results. Linear kernels are used. The neural network solution is the best but not substantially different from the LDA solution in test set error. So we will use the LDA solution which is a little simpler. West Liguria vs Umbria Trees result in no errors in the training set and 2/22 = 0.09 errors in the test set. LDA results in no errors in either training or test sets (Figure 7). 19

20 # West Liguria vs Umbria d.olive.train2<-d.olive[indx.tr,] c2.train<-c(rep(-1,572))[indx.tr] c2.train[d.olive.train2[,2]==8]<-1 c2.train<-c2.train[d.olive.train2[,1]==3&d.olive.train2[,2]!=7] d.olive.train2<-d.olive.train2[d.olive.train2[,1]==3&d.olive.train2[,2]!=7, -c(1,2)] d.olive.test2<-d.olive[indx.tst,] c2.test<-c(rep(-1,572))[indx.tst] c2.test[d.olive.test2[,2]==8]<-1 c2.test<-c2.test[d.olive.test2[,1]==3&d.olive.test2[,2]!=7] d.olive.test2<-d.olive.test2[d.olive.test2[,1]==3&d.olive.test2[,2]!=7,-c(1,2)] The solution for northern oils then is: If palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic > 0.76 then the sample is from East Liguria Else if palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic > then the sample is from West Liguria Else it is from Umbria. palm palm oleic stear oleic lino lino nic arach eico E.Lig vs Oth W. Lig vs Umb Table 3: Correlations between variables and predicted values for northern oils. Based on correlations with the predicted values (Table 3) the important variables for separating East Liguria are palmitic and arachidic, and for separating West Liguria from Umbria the important variables are palmitoleic, stearic, oleic, linoleic, linolenic, arachidic. Southern Oils The areas of the southern region promised to be the most difficult to classify. We ignore Sicily to begin. Calabria vs Others With trees there are 4/219 = errors in the training set, and 2/68 = errors in the test set. The plots in Figure 8 illustrate the splits. This is about the best solution of all the methods, surprisingly! There are too many errors in the training set, but the most important error to guard against is the test error, and trees give the smallest test error. LDA results in 4/219 = 0.18 errors in the training set and 5/68 = 074 errors in the test set. QDA results in only 1/219 = in the training set but 3/68 = in the test set. The best results for FFNN were 0 errors in the training set but 3/ errors in the test set. SVM had 1 error in the training set and 4 errors in the test set. This is the code: 20

21 Figure 8: The boundaries resulting from a tree classifier on Calabria vs other southern areas. # Southern oils - without Sicily # Calabria vs Others d.olive.train<-d.olive[indx.tr,] c1.train<-c(rep(-1,572))[indx.tr] c1.train[d.olive.train[,2]==2]<-1 c1.train<-c1.train[d.olive.train[,1]==1&d.olive.train[,2]!=4] d.olive.train<-d.olive.train[d.olive.train[,1]==1&d.olive.train[,2]!=4,-c(1,2)] d.olive.test<-d.olive[indx.tst,] c1.test<-c(rep(-1,572))[indx.tst] c1.test[d.olive.test[,2]==2]<-1 c1.test<-c1.test[d.olive.test[,1]==1&d.olive.test[,2]!=4] d.olive.test<-d.olive.test[d.olive.test[,1]==1&d.olive.test[,2]!=4,-c(1,2)] # Calabria vs Others # Trees olive.rp<-rpart(c1.train~.,data.frame(d.olive.train),method="class") table(c1.train,predict(olive.rp,type="class")) table(c1.test,predict(olive.rp,data.frame(d.olive.test),type="class")) par(pty="s",mfrow=c(1,2)) plot(c(d.olive.train[,1],d.olive.test[,1]), c(d.olive.train[,5],d.olive.test[,5]), type="n",xlab="palmitic",ylab="linoleic",main="first split") points(c(d.olive.train[c1.train==-1,1],d.olive.test[c1.test==-1,1]), 21

22 c(d.olive.train[c1.train==-1,5],d.olive.test[c1.test==-1,5]),pch=16,col=2) points(c(d.olive.train[c1.train==1,1],d.olive.test[c1.test==1,1]), c(d.olive.train[c1.train==1,5],d.olive.test[c1.test==1,5]),pch="x",col=4) abline(h=946) text(1000,1400,"others",pos=4) plot(c(d.olive.train[d.olive.train[,5]<946,1], d.olive.test[d.olive.test[,5]<946,1]), c(d.olive.train[d.olive.train[,5]<946,6], d.olive.test[d.olive.test[,5]<946,6]), type="n",xlab="palmitic",ylab="linolenic",main="next two splits") points(c(d.olive.train[c1.train==-1&d.olive.train[,5]<946,1], d.olive.test[c1.test==-1&d.olive.test[,5]<946,1]), c(d.olive.train[c1.train==-1&d.olive.train[,5]<946,6], d.olive.test[c1.test==-1&d.olive.test[,5]<946,6]),pch=16,col=2) points(c(d.olive.train[c1.train==1&d.olive.train[,5]<946,1], d.olive.test[c1.test==1&d.olive.test[,5]<946,1]), c(d.olive.train[c1.train==1&d.olive.train[,5]<946,6], d.olive.test[c1.test==1&d.olive.test[,5]<946,6]),pch="x",col=4) abline(v=1116) lines(c(1116,1500),c(37,37)) text(900,70,"others",pos=4) text(1300,65,"calabria",pos=4) # Linear discriminant analysis olive.lda1<-lda(d.olive.train,c1.train) olive.lda1 table(c1.train,predict(olive.lda1,d.olive.train)$class) table(c1.test,predict(olive.lda1,d.olive.test)$class) olive.x<-predict(olive.lda1,d.olive.train)$x olive.x2<-predict(olive.lda1,d.olive.test)$x # Quadratic discriminant analysis olive.qda1<-qda(d.olive.train,c1.train) olive.qda1 table(c1.train,predict(olive.qda1,d.olive.train)$class) table(c1.test,predict(olive.qda1,d.olive.test)$class) # FFNN # Load nnet library olive.nn<-nnet(c1.train~.,d.olive.train,size=5,linout=t,decay=5e-4, range=0.6,maxit=1000) table(c1.train,round(predict(olive.nn,d.olive.train))) table(c1.test,round(predict(olive.nn,d.olive.test))) olive.x<-predict(olive.nn,d.olive.train) olive.x2<-predict(olive.nn,d.olive.test) Here are the results: 22

23 # trees n= 219 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root ( ) 2) linoleic>= ( ) * 3) linoleic< ( ) 6) palmitic< ( ) * 7) palmitic>= ( ) 14) linolenic< ( ) * 15) linolenic>= ( ) * c1.train c1.test # LDA lda.data.frame(d.olive.train, c1.train) Prior probabilities of groups: Group means: palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic Coefficients of linear discriminants: LD1 palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic

24 c1.train c1.test # QDA c1.train c1.test # FFNN # weights: 51 initial value iter 10 value iter 250 value final value converged c1.train c1.test North vs South Apulia These two areas turn out to be very easy to separate, using trees. The training and test sets are both perfectly classified using palmitoleic acid (Figure 9). Here is the code: # North Apulia vs South Apulia d.olive.train2<-d.olive[indx.tr,] c2.train<-c(rep(-1,572))[indx.tr] c2.train[d.olive.train2[,2]==1]<-1 c2.train<-c2.train[d.olive.train2[,1]==1&d.olive.train2[,2]!=2& d.olive.train2[,2]!=4] d.olive.train2<-d.olive.train2[d.olive.train2[,1]==1&d.olive.train2[,2]!=2& d.olive.train2[,2]!=4,-c(1,2)] d.olive.test2<-d.olive[indx.tst,] c2.test<-c(rep(-1,572))[indx.tst] 24

25 Figure 9: North and South Apulia can be separated very cleanly using palmitoleic acid. c2.test[d.olive.test2[,2]==1]<-1 c2.test<-c2.test[d.olive.test2[,1]==1&d.olive.test2[,2]!=2& d.olive.test2[,2]!=4] d.olive.test2<-d.olive.test2[d.olive.test2[,1]==1&d.olive.test2[,2]!=2& d.olive.test2[,2]!=4,-c(1,2)] # Trees olive.rp<-rpart(c2.train~.,data.frame(d.olive.train2),method="class") olive.rp table(c2.train,predict(olive.rp,type="class")) table(c2.test,predict(olive.rp,data.frame(d.olive.test2),type="class")) par(mfrow=c(1,1)) hist(c(d.olive.train2[,2],d.olive.test2[,2]),30,col=2, xlab="palmitoleic",main="north vs South Apulia",xlim=c(0,300)) text(50,30,"north",pos=4) text(250,30,"south",pos=4) abline(v=111.5) Here are the results: n= 177 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root ( ) 2) palmitoleic>= ( ) * 3) palmitoleic< ( ) * 25

26 c2.train c2.test Now lets review Sicily Sicily, the problem area, cannot be cleanly separated from the other areas using trees. The error is 15/246 = in the training set, and 7/77 = in the test set. What about FFNN or SVM? Here is the code: olive.rp<-rpart(c1.train~.,data.frame(d.olive.train),method="class") olive.rp table(c1.train,predict(olive.rp,type="class")) table(c1.test,predict(olive.rp,data.frame(d.olive.test),type="class")) Here are the results: n= 246 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root ( ) 2) eicosenoic< ( ) * 3) eicosenoic>= ( ) 6) stearic< ( ) 12) arachidic< ( ) * 13) arachidic>= ( ) * 7) stearic>= ( ) * c1.train c1.test Thus the solution for the areas of the southern region (without Sicily) is: If linoleic>=946 If palmitoleic >= then the area is South Apulia Else the area is North Apulia Else (linoleic<946) If palmitic<1116 If palmitoleic >= then the area is South Apulia 26

27 Else the area is North Apulia Else (palmitic>=1116) If linolenic<37 If palmitoleic >= then the area is South Apulia Else the area is North Apulia Else (linolenic>37) the area is Calabria The test error associated with this classification is 3%. 27

28 4 Summary The classification rule (with values given in % 100) is then: If eicosenoic acid > 7 then the region is South If eicosenoic acid < 34.5 then the area is Sicily Else If stearic < 261 If arachidic < 84.5 the area is Sicily If stearic 261 OR if (stearic < 261 and arachidic 84.5) If linoleic 946 If palmitoleic the area is South Apulia Else the area is North Apulia Else (linoleic <946) If palmitic 1116 If palmitoleic the area is South Apulia Else the area is North Apulia Else (palmitic 1116) If linolenic < 37 If palmitoleic then the area is South Apulia Else the area is North Apulia Else (linolenic > 37) the area is Calabria Else If palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic > then the region is Sardinia If palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic > 0.73 then the sample is from inland Sardinia Else the sample is from coastal Sardinia Else the region is North If palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic > 0.76 then the sample is from East Liguria Else if palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic > then the sample is from West Liguria Else it is from Umbria The error on the test set overall is ( )/136 = 0.09 including Sicily and 5/127 = 0.04 without Sicily, and separately on the areas is given in Table 4. The important variables in each of the separations are given in Table 5. There are several qualifications about the study: We are assuming that the data samples have been collected appropriately, that the samples are representative of the oils of the areas they are purported to be produced in. There is a big difference in the sample sizes for each class. South Apulia is the most abundantly represented and form the internet search of material this is one of the most prolific olive oil producing 28

29 Classes Test Error Overall (with Sicily) 12/136=0.09 Overall (without Sicily) 5/127=0.04 South vs Others 0/136 =0 North vs Sardinia 0/59 =0 Inland vs Coastal Sardinia 0/24=0 East Liguria vs Others 3/35=0.09 Umbria vs West Liguria 0/23=0 Sicily vs Others 7/77=0.09 Calabria vs Others 2/68=0.03 North vs South Apulia 0/54=0 Table 4: Prediction errors Separation palmitic palmitoleic stear oleic linoleic linolenic arachidic eicosenoic Regions South/Others x Sardinia/North x x x Sardinia Coast/Inland x x x North E.Liguria/Others x x W.Liguria/Umbria x x x x x x South Sicily/Others x x x Calab/Others x x x N.Apulia/S.Apulia x Table 5: This table marks the important variables for each separation. areas. North Apulia has the least samples, only 25. Perhaps the sample size is indicative of the productivity and should perhaps be incorporated into the classifiers. 29

30 References Burges, C. J. C. (1998), A Tutorial Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery 2, Forina, M., Armanino, C., Lanteri, S. & Tiscornia, E. (1983), Classification of olive oils from their fatty acid composition, in H. Martens & H. Russwurm Jr., eds, Food Research and Data Analysis, Applied Science Publishers, London, pp Vapnik, V. (1979), Estimation of Dependence Based on Empirical Data (In Russian), Nauka, Moscow. Vapnik, V. (1999), The Nature of Statistical Learning Theory (Statistics for Engineering and Information Science), Springer-Verlag, New York, NY. Venables, W. N. & Ripley, B. (1994), Modern Applied Statistics with S-Plus, Springer-Verlag, New York. 30

GGobi : Interactive and dynamic

GGobi : Interactive and dynamic GGobi : Interactive and dynamic data visualization system Bioinformatics and Biostatistics Lab., Seoul National Univ. Seoul, Korea Eun-Kyung Lee 1 Outline interactive and dynamic graphics Exploratory data

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Support Vector Machines Explained

Support Vector Machines Explained March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

An Overview and Evaluation of Decision Tree Methodology

An Overview and Evaluation of Decision Tree Methodology An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX [email protected] Carole Jesse Cargill, Inc. Wayzata, MN [email protected]

More information

Car Insurance. Havránek, Pokorný, Tomášek

Car Insurance. Havránek, Pokorný, Tomášek Car Insurance Havránek, Pokorný, Tomášek Outline Data overview Horizontal approach + Decision tree/forests Vertical (column) approach + Neural networks SVM Data overview Customers Viewed policies Bought

More information

Chapter 12 Discovering New Knowledge Data Mining

Chapter 12 Discovering New Knowledge Data Mining Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

More information

APPLICATION OF DATA MINING TECHNIQUES FOR DIRECT MARKETING. Anatoli Nachev

APPLICATION OF DATA MINING TECHNIQUES FOR DIRECT MARKETING. Anatoli Nachev 86 ITHEA APPLICATION OF DATA MINING TECHNIQUES FOR DIRECT MARKETING Anatoli Nachev Abstract: This paper presents a case study of data mining modeling techniques for direct marketing. It focuses to three

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information

Clustering through Decision Tree Construction in Geology

Clustering through Decision Tree Construction in Geology Nonlinear Analysis: Modelling and Control, 2001, v. 6, No. 2, 29-41 Clustering through Decision Tree Construction in Geology Received: 22.10.2001 Accepted: 31.10.2001 A. Juozapavičius, V. Rapševičius Faculty

More information

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES Claus Gwiggner, Ecole Polytechnique, LIX, Palaiseau, France Gert Lanckriet, University of Berkeley, EECS,

More information

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19 PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

Univariate Regression

Univariate Regression Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is

More information

Clustering & Visualization

Clustering & Visualization Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

More information

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d. EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller Getting to know the data An important first step before performing any kind of statistical analysis is to familiarize

More information

Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University)

Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University) 260 IJCSNS International Journal of Computer Science and Network Security, VOL.11 No.6, June 2011 Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case

More information

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental

More information

Support Vector Machine (SVM)

Support Vector Machine (SVM) Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

LCs for Binary Classification

LCs for Binary Classification Linear Classifiers A linear classifier is a classifier such that classification is performed by a dot product beteen the to vectors representing the document and the category, respectively. Therefore it

More information

Exercise 1.12 (Pg. 22-23)

Exercise 1.12 (Pg. 22-23) Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

More information

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Polynomial Neural Network Discovery Client User Guide

Polynomial Neural Network Discovery Client User Guide Polynomial Neural Network Discovery Client User Guide Version 1.3 Table of contents Table of contents...2 1. Introduction...3 1.1 Overview...3 1.2 PNN algorithm principles...3 1.3 Additional criteria...3

More information

Didacticiel - Études de cas

Didacticiel - Études de cas 1 Topic Linear Discriminant Analysis Data Mining Tools Comparison (Tanagra, R, SAS and SPSS). Linear discriminant analysis is a popular method in domains of statistics, machine learning and pattern recognition.

More information

Machine Learning in Spam Filtering

Machine Learning in Spam Filtering Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov [email protected] Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.

More information

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S. AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree

More information

Linear Classification. Volker Tresp Summer 2015

Linear Classification. Volker Tresp Summer 2015 Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

More information

DISCRIMINANT FUNCTION ANALYSIS (DA)

DISCRIMINANT FUNCTION ANALYSIS (DA) DISCRIMINANT FUNCTION ANALYSIS (DA) John Poulsen and Aaron French Key words: assumptions, further reading, computations, standardized coefficents, structure matrix, tests of signficance Introduction Discriminant

More information

Classification and Regression by randomforest

Classification and Regression by randomforest Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.7 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Linear Regression Other Regression Models References Introduction Introduction Numerical prediction is

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall Automatic Photo Quality Assessment Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall Estimating i the photorealism of images: Distinguishing i i paintings from photographs h Florin

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College

More information

A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

More information

Average Redistributional Effects. IFAI/IZA Conference on Labor Market Policy Evaluation

Average Redistributional Effects. IFAI/IZA Conference on Labor Market Policy Evaluation Average Redistributional Effects IFAI/IZA Conference on Labor Market Policy Evaluation Geert Ridder, Department of Economics, University of Southern California. October 10, 2006 1 Motivation Most papers

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

Lecture 2: The SVM classifier

Lecture 2: The SVM classifier Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function

More information

Data mining techniques: decision trees

Data mining techniques: decision trees Data mining techniques: decision trees 1/39 Agenda Rule systems Building rule systems vs rule systems Quick reference 2/39 1 Agenda Rule systems Building rule systems vs rule systems Quick reference 3/39

More information

Scalable Developments for Big Data Analytics in Remote Sensing

Scalable Developments for Big Data Analytics in Remote Sensing Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: [email protected] Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Ideas and Tools for a Data Mining Style of Analysis

Ideas and Tools for a Data Mining Style of Analysis Ideas and Tools for a Data Mining Style of Analysis John Maindonald July 2, 2009 Contents 1 Resampling and Other Computer Intensive Methods 2 2 Key Motivations 4 2.1 Computing Technology is Transforming

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

More information

Data Mining Techniques Chapter 6: Decision Trees

Data Mining Techniques Chapter 6: Decision Trees Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................

More information

Detecting Corporate Fraud: An Application of Machine Learning

Detecting Corporate Fraud: An Application of Machine Learning Detecting Corporate Fraud: An Application of Machine Learning Ophir Gottlieb, Curt Salisbury, Howard Shek, Vishal Vaidyanathan December 15, 2006 ABSTRACT This paper explores the application of several

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

Penalized Logistic Regression and Classification of Microarray Data

Penalized Logistic Regression and Classification of Microarray Data Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification

More information

ON THE DEGREES OF FREEDOM IN RICHLY PARAMETERISED MODELS

ON THE DEGREES OF FREEDOM IN RICHLY PARAMETERISED MODELS COMPSTAT 2004 Symposium c Physica-Verlag/Springer 2004 ON THE DEGREES OF FREEDOM IN RICHLY PARAMETERISED MODELS Salvatore Ingrassia and Isabella Morlini Key words: Richly parameterised models, small data

More information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

Statistics for BIG data

Statistics for BIG data Statistics for BIG data Statistics for Big Data: Are Statisticians Ready? Dennis Lin Department of Statistics The Pennsylvania State University John Jordan and Dennis K.J. Lin (ICSA-Bulletine 2014) Before

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

Dimensionality Reduction: Principal Components Analysis

Dimensionality Reduction: Principal Components Analysis Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

More information

Introduction. Chapter 1. 1.1 Before you start. 1.1.1 Formulation

Introduction. Chapter 1. 1.1 Before you start. 1.1.1 Formulation Chapter 1 Introduction 1.1 Before you start Statistics starts with a problem, continues with the collection of data, proceeds with the data analysis and finishes with conclusions. It is a common mistake

More information

Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

More information

Multivariate Logistic Regression

Multivariate Logistic Regression 1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through

More information

Multiple Kernel Learning on the Limit Order Book

Multiple Kernel Learning on the Limit Order Book JMLR: Workshop and Conference Proceedings 11 (2010) 167 174 Workshop on Applications of Pattern Analysis Multiple Kernel Learning on the Limit Order Book Tristan Fletcher Zakria Hussain John Shawe-Taylor

More information

Leveraging Ensemble Models in SAS Enterprise Miner

Leveraging Ensemble Models in SAS Enterprise Miner ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

Data exploration with Microsoft Excel: analysing more than one variable

Data exploration with Microsoft Excel: analysing more than one variable Data exploration with Microsoft Excel: analysing more than one variable Contents 1 Introduction... 1 2 Comparing different groups or different variables... 2 3 Exploring the association between categorical

More information

What is Data mining?

What is Data mining? STAT : DATA MIIG Javier Cabrera Fall Business Question Answer Business Question What is Data mining? Find Data Data Processing Extract Information Data Analysis Internal Databases Data Warehouses Internet

More information

Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar

Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar Prepared by Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc. www.data-mines.com [email protected]

More information

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller Agenda Introduktion till Prediktiva modeller Beslutsträd Beslutsträd och andra prediktiva modeller Mathias Lanner Sas Institute Pruning Regressioner Neurala Nätverk Utvärdering av modeller 2 Predictive

More information

How To Make A Credit Risk Model For A Bank Account

How To Make A Credit Risk Model For A Bank Account TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző [email protected] 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions

More information

3 An Illustrative Example

3 An Illustrative Example Objectives An Illustrative Example Objectives - Theory and Examples -2 Problem Statement -2 Perceptron - Two-Input Case -4 Pattern Recognition Example -5 Hamming Network -8 Feedforward Layer -8 Recurrent

More information

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge

More information

Package MDM. February 19, 2015

Package MDM. February 19, 2015 Type Package Title Multinomial Diversity Model Version 1.3 Date 2013-06-28 Package MDM February 19, 2015 Author Glenn De'ath ; Code for mdm was adapted from multinom in the nnet package

More information

Beating the MLB Moneyline

Beating the MLB Moneyline Beating the MLB Moneyline Leland Chen [email protected] Andrew He [email protected] 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series

More information

Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

More information

Large Margin DAGs for Multiclass Classification

Large Margin DAGs for Multiclass Classification S.A. Solla, T.K. Leen and K.-R. Müller (eds.), 57 55, MIT Press (000) Large Margin DAGs for Multiclass Classification John C. Platt Microsoft Research Microsoft Way Redmond, WA 9805 [email protected]

More information

The Value of Visualization 2

The Value of Visualization 2 The Value of Visualization 2 G Janacek -0.69 1.11-3.1 4.0 GJJ () Visualization 1 / 21 Parallel coordinates Parallel coordinates is a common way of visualising high-dimensional geometry and analysing multivariate

More information

Statistical Models in Data Mining

Statistical Models in Data Mining Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

More information