STAT 503X Case Study 2: Italian Olive Oils

Transcription

1 STAT 503X Case Study 2: Italian Olive Oils 1 Description This data consists of the percentage composition of 8 fatty acids (palmitic, palmitoleic, stearic, oleic, linoleic, linolenic, arachidic, eicosenoic) found in the lipid fraction of 572 Italian olive oils. (An analysis of this data is given in (Forina, Armanino, Lanteri & Tiscornia 1983)). There are 9 collection areas, 4 from southern Italy (North and South Apulia, Calabria, Sicily), two from Sardinia (Inland and Coastal) and 3 from northern Italy (Umbria, East and West Liguria). The data available are: Region Area Palmitic Acid Palmitoleic Acid Stearic Acid Oleic Acid Linoleic Acid Linolenic Acid Arachidic Acid Eicosenoic Acid South, North or Sardinia Sub-regions within the larger regions (North and South Apulia, Calabria, Sicily, Inland and Coastal Sardinia, Umbria, East and West Liguria Percentage 100 in sample Percentage 100 in sample Percentage 100 in sample Percentage 100 in sample Percentage 100 in sample Percentage 100 in sample Percentage 100 in sample Percentage 100 in sample The primary question is How do we distinguish the oils from different regions and areas in Italy based on their combinations of the fatty acids? 1

2 2

3 2 Suggested Approaches Approach Reason Type of questions addressed Data Restructuring This is very clean data so I don t see any need to restructure. Summary Statistics To get at location and scale information What is the average percent compo- for each variable, and by groups sition of eicosenoic acid overall? Is there a difference in the average percentage of eicosenoic acid for olives from different growing regions? Visual Inspection The Univariate Plots, Bivariate Plots, Are there differences in the fatty acid aim of visual methods Touring Plots composition between the olives from is to understand separations, different growing regions? help decide the best classification method, and hence interpret solutions Numerical Analysis The aim of numerical solutions is to get the best predictive results. Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Classification and Regression Trees (CART), Forests, Feed-Forward Neural Networks and Support Vector Machines Can define the percentage composition of fatty acids that distinguishes olives from different growing regions? 3

4 3 Actual Approaches 3.1 Summary Statistics The following tables contain information on the minimum, maximum, median, mean and standard deviation for the fatty acids, in total and broken down by different growing region. All the 572 observations (values reported as percentages): palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic Min Median Max Mean Std. Dev From the summary statistics we notice that olive oils are mostly constituted of oleic acid. accounts for about 12% on average and the rest less than 10% on average. Palmitic n palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic South Sardinia North Nth Apul Calabria Sth Apul Sicily Inla Sard Coas Sard East Lig West Lig Umbria Southern oils have much higher eicosenoic acid on average eicosenoic and slightly higher palmitic and palmitoleic acid content. The north and sardinian oils have some difference in the average oleic, linoleic and arachidic acids. Among the southern oils there is some difference in most of the averages. Northern oils have some difference in most of the averages. 4

5 3.2 Visual Inspection The objective here is to find differences in the measured variables amongst the classes. Differences here might be actual separations between classes on a variable or linear combination of variables. The approach is relatively simple: 1. Use color and or symbol to code the categorical class information into plot. 2. Begin with low-dimensional plots (histogram, density plot, dot plot, scatterplot) of the measured variables and work up to high-dimensional plots (parallel coordinate plot, tours), exploring class structure in relation to data space Regions - Univariate Plots Using 1D Plot mode sequentially work through the variables, either by manually sepecting variables or cycling through automatically, to examine separations between regions. Its possible to neatly separate the oils from southern Italy from the other two regions using just one variable, eicosenoic acid. Figure 1 displays a textured dotplot and an ASH plot of this variable. The oils from southern Italy are removed, and we concentrate on separating the oils from northern Italy and Sardinia. Although a clear separation between these two regions cannot be found using one variable, two of the variables, oleic and linoleic acid, appear to be important for the separation (Figure 1) Regions - Bivariate Plots Starting from the two variables identified by the univariate plots as important for separating northern Italian oils from Sardinian oils, the remaining variables are explored in relation two these two using scatterplots. Oleic acid and linoleic acid show some, but not cleanly, separated regions. Arachidic acid and linoleic acid display a clear separation between the regions, but it is a very non-linear boundary Regions - Multivariate Plots Starting from the three variables found from bivariate plots to be important for separating northern Italian and Sardinian oils we use a higher-dimensional technique to explore them. Using either Rotation, Tour1D or Tour2D examine the separation between the two regions in the 3-dimensional space. Figure 2 shows the results of using Tour1D on the three variables. The two regions can be separated cleanly by a linear combination of linoleic and arachidic acid, roughly corresponding to linoleic arachidic Areas - Northern Italy There are three areas in the region, Umbria, East and West Liguria. From univariate plots there are no clear separations between areas, although several variables, for example, linoleic acid (Figure 3, show some differences. In bivariate plots, two variables, stearic and linoleic show some differences between the areas. Examining combinations of variables in Tour1D shows that West Liguria is almost separable from the other two areas using palmitoleic, stearic, linoleic and arachidic acids. Umbria and East Liguria are separable, except for one sample, in a combination of most of the variables. To get this result, the projection pursuit controls were used followed by manual manipulation to assess the importance of each variable in the separation of the areas. 5

6 Figure 1: Separation between the 3 regions of Italian olive oil data in univariate plots. 6

7 Figure 2: Separation between the northern Italian and Sardinian oils in bivariate plots and multivariate plots Areas - Sardinia Separating the oils from the coastal and inland areas of Sardinia can be simply done by using two variables, oleic and linoleic acid Areas - Southern Italy There are four areas in the south Italy region, North and South Apulia, Calabria, Sicily. From univariate plots and bivariate plots there are no clear separations between all four areas. Simplify the work by working with fewer groups, take two of the areas at a time, or three at a time, and plot the data in univariate and bivariate plots. The best results are obtained by removing Sicily. If the oils from this area are removed then the remaining 3 areas are almost separable using the variables palmitic, palmitoleic, stearic and oleic acids. Figure 4 shows plots taken from the Tour2D of these 4 variables revealing the separations between the areas, North and South Apulia, and Calabria. However, the fatty acid content of oils from Sicily is confounded with those of the other areas Taking stock What we have learned from this data is that the olive oils from different geographic regions have dramatically different fatty acid content. The three larger geographic regions, North, South, Sardinia, are well-separated based on eicosenoic, linoleic and arachidic acids. The oils from areas in northern Italy are mostly separable from each other using all the variables. The oils from the inland and coastal areas of Sardinia have different amounts of oleic and linoleic acids. The oils from three of the areas in southern Italy are almost separable. And one is left with the curious content of the oils from Sicily. Why are these oils indistinguishable from the oils of all the other areas in the south? Is there a problem with the quality of these samples? 7

8 Figure 3: Separation in the oils from areas of northern Italy, univariate, bivariate and multivariate plots. 8

9 Figure 4: The areas of southern Italy are mostly separable, except for Sicily. 3.3 Numerical Analysis The data is broken into training and test samples for this part of the analysis. Approximately 25% of cases within each group are reserved for the test samples. These are the cases used for the test sample, where the data has been sorted by region and area: Region Area # Train # Test South N. Apul South Calabria South S. Apul South Sicily 27 9 Sard Inland Sard Coast 25 8 North E. Lig North W. Lig North Umbria We will use what we learned from the visual analysis to guide the numerical analysis. Some methods restrict classification to just two groups, so we will use this restriction with all the classification methods. We will also drop the Sicilian oils from the study - because there is some question raised about the validity of these oils. Here is the binary classification steps we will follow: 9

10 3.3.1 Methods Used Classification trees generate a classification tree by sequentially doing binary splits on variables. Splits are decided on according to the variable and split value which produces the lower measure of impurity. There are several common measures of impurity, Gini and deviance. For example, for a two class problem with 400 observations in each class, denoted as (400,400), variable 1 produces a split of (300,100) and (100,300), and variable 2 produces a split (200,400) and (200,0). The split on variable 2 will produce a lower impurity value because one branch is more pure. There are numerous implementations: R packages such as rpart and standalone packages such as C4.5, C5.0. Classical Linear discriminant analysis (LDA) assumes that the populations come from multivariate normal distributions with different mean but equal variance-covariance matrices. Linear boundaries result, and it is possible to reduce the dimension of the data into the space of maximum separation using canonical coordinates. The MASS package in R has functions for LDA. Quadratic discriminant analysis (QDA) assumes that the populations come from multivariate normal distributions with different mean and different variance-covariance. Non-linear boundaries are the result. The MASS package in R has functions for LDA. Feed-forward neural networks provide a flexible way to generalize linear regression functions. A simple network model as produced by nnet code in R (Venables & Ripley 1994) may be represented by the equation: y = φ(α + s w h φ(α h + h=1 p w ih x i )) where x is the vector of explanatory variable values, y is the target value, p is the number of variables, s is i=1 10

11 the number of nodes in the single hidden layer and φ is a fixed function, usually a linear or logistic function. This model has a single hidden layer, and univariate output values. SVM have recently gained widespread attention due to their success at prediction in classification problems. The subject started in the late seventies (Vapnik 1979). The definitive reference is Vapnik (1999), and Burges (1998) gives a simpler tutorial into the subject. The main difference between this and the previously described classification techniques is that SVMs really only work for separating between 2 groups. There are some modifications for multiple groups but these are little more than one might do manually by working pairwise through the groups. SVM algorithms work by finding a subset of points in the two groups that can be separated, which are then known as the support vectors. These support vectors can be used to define a separating hyperplane, w.x + b = 0, where w = N S i=1 α iy i x i, N S is the number of support vectors, and α comes from the constraint N S i=1 α iy i = 0. Actually we search for the support vectors which gives the hyperplane which gives the biggest separation between the two classes. Note that it is posssible to incorporate non-linear kernel functions allows for defining non-linear separations, and also that modifications allow SVMs to perform well when the classes are not separable. We use the software SVM Light: binary and documentation at The Classification Separating Regions In separating the southern oils from the other two regions, and then northern oils from Sardinian oils, here is the classification tree solution: If eicosenoic acid $>$ 7 then the region is South (1) Else If linoleic $>$ then the region is Sardinia (2) Else the region is North (3) There are no missclassifications on the first split but the second split is problematic. It is a perfect separation for this data but there is no gap between the two groups. This is easily seen from the plot of the two variables used, Figure 5. So at this point we will shift to using a different method to build a classification rule for northen oils from Sardinian oils. Here is the R code: # South vs others d.olive.train<-d.olive[indx.tr,-c(1,2)] d.olive.test<-d.olive[indx.tst,-c(1,2)] c1<-rep(-1,572) c1[d.olive[,1]!=1]<-1 c1.train<-c1[indx.tr] c1.test<-c1[indx.tst] # North vs Sardinia d.olive.train<-d.olive[indx.tr,] c1.train<-c(rep(-1,572))[indx.tr] c1.train[d.olive.train[,1]==3]<-1 c1.train<-c1.train[d.olive.train[,1]!=1] d.olive.train<-d.olive.train[d.olive.train[,1]!=1,-c(1,2)] d.olive.test<-d.olive[indx.tst,] c1.test<-c(rep(-1,572))[indx.tst] 11

12 Figure 5: (Left)Plot illustrating the results of tree classifier. (Right) Second boundary from LDA solution, dashed boundary is the tree solution on the LDA predicted values. c1.test[d.olive.test[,1]==3]<-1 c1.test<-c1.test[d.olive.test[,1]!=1] d.olive.test<-d.olive.test[d.olive.test[,1]!=1,-c(1,2)] # Trees # Load the rpart library olive.rp<-rpart(c1.train~.,data.frame(d.olive.train),method="class") table(c1.train,predict(olive.rp,type="class")) table(c1.test,predict(olive.rp,data.frame(d.olive.test),type="class")) Here is the output from R: # South vs others n= 436 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root ( ) 2) eicosenoic>= ( ) * 3) eicosenoic< ( ) * c1.train c1.test

13 # Sardinia vs North n= 190 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root ( ) 2) linoleic>= ( ) * 3) linoleic< ( ) * c1.train c1.test LDA results in errors in the training sample (error rate = 1/190 = 0.005) and no errors in the test sample. The predicted values are well-separated (see figure) but the boundary is in the wrong place! (See Figure 5, right plot.) palm palm oleic stear oleic lino lino nic arach eico Table 1: Correlations between predicted values and the predictors. The variables oleic, linoleic and arachidic are highly correlated with the predicted values (Table 1). The predicted values show a big gap between the northern oils and the sardinian oils but the boundary is set too close to the sardinian group because LDA assumes equal variances. olive.lda1<-lda(d.olive.train,c1.train) olive.lda1 table(c1.train,predict(olive.lda1,d.olive.train)$class) table(c1.test,predict(olive.lda1,d.olive.test)$class) # plot the predictions of the current sample olive.x<-predict(olive.lda1,d.olive.train, dimen=1)$x olive.x2<-predict(olive.lda1,d.olive.test, dimen=1)$x olive.lda.proj<-predict(olive.lda1,d.olive[,-c(1,2)],dimen=1)$x par(pty="s") plot(d.olive[,10],olive.lda.proj,type="n", xlab="eicosenoic",ylab="lda Discrim 1") 13

14 points(d.olive[d.olive[,1]==1,10],olive.lda.proj[d.olive[,1]==1],pch="x",col=2) points(d.olive[d.olive[,1]==2,10],olive.lda.proj[d.olive[,1]==2],pch="s",col=3) points(d.olive[d.olive[,1]==3,10],olive.lda.proj[d.olive[,1]==3],pch=16,col=6) abline(v=7) lines(c(0,7),c(-0.49,-0.49)) text(50,7,"south") text(-2,8,"sardinia",pos=4) text(-2,-6,"north",pos=4) # Predicted values are generated by # x sigma^(-1)(mean1-mean2)-((mean1+mean2)/2) sigma^(-1)(mean1-mean2) # So these lines should recreate the predicted values. xmn<-apply(d.olive.train,2,mean) x<-as.matrix(d.olive.train-matrix(rep(xmn,190),nrow=190,byrow=t))%*% as.matrix(olive.lda1$scaling) # compute correlations between variables and the predicted values cor(olive.x,d.olive.train[,1]) cor(olive.x,d.olive.train[,2]) cor(olive.x,d.olive.train[,3]) cor(olive.x,d.olive.train[,4]) cor(olive.x,d.olive.train[,5]) cor(olive.x,d.olive.train[,6]) cor(olive.x,d.olive.train[,7]) cor(olive.x,d.olive.train[,8]) These are the results: Prior probabilities of groups: Group means: palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic Coefficients of linear discriminants: LD1 palmitic palmitoleic stearic oleic linoleic linolenic arachidic

15 eicosenoic c1.train c1.test Quadratic discriminant analysis provides a perfect classification but its more difficult to determine the location of the boundary between the two groups. # Quadratic discriminant analysis olive.qda1<-qda(d.olive.train,c1.train) olive.qda1 table(c1.train,predict(olive.qda1,d.olive.train)$class) table(c1.test,predict(olive.qda1,d.olive.test)$class) Trees calculated on the predicted values provide a perfect classification, although the boundary seems a little too close to the northern oils (Figure 5). # Check if trees will give a better boundary rpart(c1.train~olive.x,method="class") par(lty=2) lines(c(0,7),c(-1.49,-1.49)) par(lty=1) This gives the classification of the regions as: If eicosenoic acid > 7 then the region is South (1) Else If palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic > then the region is Sardinia (2) Else the region is North (3) There is no error associated with this rule. Sardinian Oils Split the samples into training and test sets. # Sardinian oils d.olive.train<-d.olive[indx.tr,] c1.train<-c(rep(-1,572))[indx.tr] c1.train[d.olive.train[,2]==6]<-1 c1.train<-c1.train[d.olive.train[,1]==2] d.olive.train<-d.olive.train[d.olive.train[,1]==2,-c(1,2)] d.olive.test<-d.olive[indx.tst,] c1.test<-c(rep(-1,572))[indx.tst] 15

16 c1.test[d.olive.test[,2]==6]<-1 c1.test<-c1.test[d.olive.test[,1]==2] d.olive.test<-d.olive.test[d.olive.test[,1]==2,-c(1,2)] Figure 6: Separation of the inland and coastal Sardinian oils using trees (horizontal axis) and LDA (vertical axis). A classification tree produces a perfect classification of both training and test samples but the separation between groups is rather small relative to the split for a slightly more complex solution provided by LDA (Figure 6). Here is the code: # Trees olive.rp<-rpart(c1.train~.,data.frame(d.olive.train),method="class") olive.rp table(c1.train,predict(olive.rp,type="class")) table(c1.test,predict(olive.rp,data.frame(d.olive.test),type="class")) # Linear discriminant analysis olive.lda1<-lda(d.olive.train,c1.train) olive.lda1 table(c1.train,predict(olive.lda1,d.olive.train)$class) table(c1.test,predict(olive.lda1,d.olive.test)$class) # plot the predictions of the current sample olive.x<-predict(olive.lda1,d.olive.train, dimen=1)$x olive.x2<-predict(olive.lda1,d.olive.test, dimen=1)$x par(mfrow=c(1,1),pty="s") 16

17 plot(c(d.olive.train[,5],d.olive.test[,5]),c(olive.x,olive.x2),type="n", xlab="tree",ylab="lda Discrim") points(c(d.olive.train[c1.train==-1,5],d.olive.test[c1.test==-1,5]), c(olive.x[c1.train==-1],olive.x2[c1.test==-1]),pch=16,col=2) points(c(d.olive.train[c1.train==1,5],d.olive.test[c1.test==1,5]), c(olive.x[c1.train==1],olive.x2[c1.test==1]),pch="x",col=4) abline(v=1246.5) abline(h=0.73) text(1150,-4,"inland",pos=4) text(1260,7,"coastal",pos=4) The solution for Sardinian oils then is: If palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic > 0.73 then the sample is from inland Sardinia. palm palm oleic stear oleic lino lino nic arach eico Table 2: Correlations between variables and predicted values for Sardinian oils. Based on correlations with the predicted values (Table 2) the important variables for separating inland and coastal Sardinia are oleic, linoleic and stearic acids, with palmitic and linolenic being somewhat important. Northern Oils Separating these northern oils is more difficult, based on our visual analysis. East Liguria vs West Liguria,Umbria The classification difficulty is reflected in the results of tree classifiers: the error in the training set is 8/116 = 0.07 and the error in the test set is 8/35 = But is is a simple solution using just two variables, palmitic and arachidic. (See Figure 7 left plot.) The LDA solution is a littler better, with 6/116 = 0.05 errors in the training set and 3/35 = 0.09 in the test set (Figure 7). # East Liguria vs others # Trees olive.rp<-rpart(c1.train~.,data.frame(d.olive.train),method="class") olive.rp table(c1.train,predict(olive.rp,type="class")) table(c1.test,predict(olive.rp,data.frame(d.olive.test),type="class")) plot(c(d.olive.train[,7],d.olive.test[,7]), c(d.olive.train[,1],d.olive.test[,1]),type="n", xlab="arachidic",ylab="palmitic") points(c(d.olive.train[c1.train==-1,7],d.olive.test[c1.test==-1,7]), c(d.olive.train[c1.train==-1,1],d.olive.test[c1.test==-1,1]),pch=16,col=2) points(c(d.olive.train[c1.train==1,7],d.olive.test[c1.test==1,7]), c(d.olive.train[c1.train==1,1],d.olive.test[c1.test==1,1]),pch="x",col=4) 17

18 abline(v=59.5) lines(c(0,59.5),c(1165,1165)) text(80,1400,"east Liguria",pos=4) text(0,800,"west Liguria, Umbria",pos=4) # Linear discriminant analysis olive.lda1<-lda(d.olive.train,c1.train) olive.lda1 table(c1.train,predict(olive.lda1,d.olive.train)$class) table(c1.test,predict(olive.lda1,d.olive.test)$class) olive.x<-predict(olive.lda1,d.olive.train)$x olive.x2<-predict(olive.lda1,d.olive.test)$x xmn<-apply(d.olive.train,2,mean) x<-as.matrix(d.olive.train-matrix(rep(xmn,79),nrow=79,byrow=t))%*% as.matrix(olive.lda1$scaling) xmn%*%as.matrix(olive.lda1$scaling) plot(c(olive.x,olive.x2),c(c1.train,c1.test),type="n", xlab="lda Discrim",ylab="Area") points(c(olive.x[c1.train==-1],olive.x2[c1.test==-1]), c(c1.train[c1.train==-1],c1.test[c1.test==-1]),pch=16,col=2) points(c(olive.x[c1.train==1],olive.x2[c1.test==1]), c(c1.train[c1.train==1],c1.test[c1.test==1]),pch="x",col=4) abline(v=0.76) text(2,0.5,"east Liguria",pos=4) text(-4,-0.5,"west Liguria, Umbria",pos=4) # compute correlations between variables and the predicted values cor(olive.x,d.olive.train[,1]) cor(olive.x,d.olive.train[,2]) cor(olive.x,d.olive.train[,3]) cor(olive.x,d.olive.train[,4]) cor(olive.x,d.olive.train[,5]) cor(olive.x,d.olive.train[,6]) cor(olive.x,d.olive.train[,7]) cor(olive.x,d.olive.train[,8]) Turning to feed-forward neural networks. After several runs a good result the net converges, to a solution that has no error in the training set but two errors in the test set. The solution is as follows: Area = (-0.22 palmitic-0.01 palmitoleic-0.13 stearic oleic linoleic linolenic-0.05 arachidic eicosenoic) (0.07 palmitic+0.03 palmitoleic+0.06 stearic oleic linoleic linolenic+0.04 arachidic) (-0.10 palmitic+0.10 palmitoleic+0.11 stearic oleic linoleic linolenic-0.26 arachidic eicosenoic) The code is as follows: 18

19 Figure 7: (Left) Tree solution for East Liguria vs West Liguria,Umbria, (Middle) LDA solution for East Liguria vs West Liguria,Umbria, (Right) LDA solution for West Liguria vs Umbria. olive.nn<-nnet(c1.train~.,d.olive.train,size=3,linout=t,decay=5e-4, range=0.6,maxit=1000) # weights: 31 initial value iter 10 value iter 20 value iter 560 value final value converged > table(c1.train,round(predict(olive.nn,d.olive.train))) c1.train > table(c1.test,round(predict(olive.nn,d.olive.test))) c1.test Support vector machines (SVM) produce 2 errors in the training set and 6 errors in the test set. The Windows version of SVM light is used. To run SVM the data needs to be outputted to file in a special format. The svm learn and svm classify executables are run in a command window. The parameters needs to be adjusted some to get the best results. Linear kernels are used. The neural network solution is the best but not substantially different from the LDA solution in test set error. So we will use the LDA solution which is a little simpler. West Liguria vs Umbria Trees result in no errors in the training set and 2/22 = 0.09 errors in the test set. LDA results in no errors in either training or test sets (Figure 7). 19

20 # West Liguria vs Umbria d.olive.train2<-d.olive[indx.tr,] c2.train<-c(rep(-1,572))[indx.tr] c2.train[d.olive.train2[,2]==8]<-1 c2.train<-c2.train[d.olive.train2[,1]==3&d.olive.train2[,2]!=7] d.olive.train2<-d.olive.train2[d.olive.train2[,1]==3&d.olive.train2[,2]!=7, -c(1,2)] d.olive.test2<-d.olive[indx.tst,] c2.test<-c(rep(-1,572))[indx.tst] c2.test[d.olive.test2[,2]==8]<-1 c2.test<-c2.test[d.olive.test2[,1]==3&d.olive.test2[,2]!=7] d.olive.test2<-d.olive.test2[d.olive.test2[,1]==3&d.olive.test2[,2]!=7,-c(1,2)] The solution for northern oils then is: If palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic > 0.76 then the sample is from East Liguria Else if palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic > then the sample is from West Liguria Else it is from Umbria. palm palm oleic stear oleic lino lino nic arach eico E.Lig vs Oth W. Lig vs Umb Table 3: Correlations between variables and predicted values for northern oils. Based on correlations with the predicted values (Table 3) the important variables for separating East Liguria are palmitic and arachidic, and for separating West Liguria from Umbria the important variables are palmitoleic, stearic, oleic, linoleic, linolenic, arachidic. Southern Oils The areas of the southern region promised to be the most difficult to classify. We ignore Sicily to begin. Calabria vs Others With trees there are 4/219 = errors in the training set, and 2/68 = errors in the test set. The plots in Figure 8 illustrate the splits. This is about the best solution of all the methods, surprisingly! There are too many errors in the training set, but the most important error to guard against is the test error, and trees give the smallest test error. LDA results in 4/219 = 0.18 errors in the training set and 5/68 = 074 errors in the test set. QDA results in only 1/219 = in the training set but 3/68 = in the test set. The best results for FFNN were 0 errors in the training set but 3/ errors in the test set. SVM had 1 error in the training set and 4 errors in the test set. This is the code: 20

21 Figure 8: The boundaries resulting from a tree classifier on Calabria vs other southern areas. # Southern oils - without Sicily # Calabria vs Others d.olive.train<-d.olive[indx.tr,] c1.train<-c(rep(-1,572))[indx.tr] c1.train[d.olive.train[,2]==2]<-1 c1.train<-c1.train[d.olive.train[,1]==1&d.olive.train[,2]!=4] d.olive.train<-d.olive.train[d.olive.train[,1]==1&d.olive.train[,2]!=4,-c(1,2)] d.olive.test<-d.olive[indx.tst,] c1.test<-c(rep(-1,572))[indx.tst] c1.test[d.olive.test[,2]==2]<-1 c1.test<-c1.test[d.olive.test[,1]==1&d.olive.test[,2]!=4] d.olive.test<-d.olive.test[d.olive.test[,1]==1&d.olive.test[,2]!=4,-c(1,2)] # Calabria vs Others # Trees olive.rp<-rpart(c1.train~.,data.frame(d.olive.train),method="class") table(c1.train,predict(olive.rp,type="class")) table(c1.test,predict(olive.rp,data.frame(d.olive.test),type="class")) par(pty="s",mfrow=c(1,2)) plot(c(d.olive.train[,1],d.olive.test[,1]), c(d.olive.train[,5],d.olive.test[,5]), type="n",xlab="palmitic",ylab="linoleic",main="first split") points(c(d.olive.train[c1.train==-1,1],d.olive.test[c1.test==-1,1]), 21

22 c(d.olive.train[c1.train==-1,5],d.olive.test[c1.test==-1,5]),pch=16,col=2) points(c(d.olive.train[c1.train==1,1],d.olive.test[c1.test==1,1]), c(d.olive.train[c1.train==1,5],d.olive.test[c1.test==1,5]),pch="x",col=4) abline(h=946) text(1000,1400,"others",pos=4) plot(c(d.olive.train[d.olive.train[,5]<946,1], d.olive.test[d.olive.test[,5]<946,1]), c(d.olive.train[d.olive.train[,5]<946,6], d.olive.test[d.olive.test[,5]<946,6]), type="n",xlab="palmitic",ylab="linolenic",main="next two splits") points(c(d.olive.train[c1.train==-1&d.olive.train[,5]<946,1], d.olive.test[c1.test==-1&d.olive.test[,5]<946,1]), c(d.olive.train[c1.train==-1&d.olive.train[,5]<946,6], d.olive.test[c1.test==-1&d.olive.test[,5]<946,6]),pch=16,col=2) points(c(d.olive.train[c1.train==1&d.olive.train[,5]<946,1], d.olive.test[c1.test==1&d.olive.test[,5]<946,1]), c(d.olive.train[c1.train==1&d.olive.train[,5]<946,6], d.olive.test[c1.test==1&d.olive.test[,5]<946,6]),pch="x",col=4) abline(v=1116) lines(c(1116,1500),c(37,37)) text(900,70,"others",pos=4) text(1300,65,"calabria",pos=4) # Linear discriminant analysis olive.lda1<-lda(d.olive.train,c1.train) olive.lda1 table(c1.train,predict(olive.lda1,d.olive.train)$class) table(c1.test,predict(olive.lda1,d.olive.test)$class) olive.x<-predict(olive.lda1,d.olive.train)$x olive.x2<-predict(olive.lda1,d.olive.test)$x # Quadratic discriminant analysis olive.qda1<-qda(d.olive.train,c1.train) olive.qda1 table(c1.train,predict(olive.qda1,d.olive.train)$class) table(c1.test,predict(olive.qda1,d.olive.test)$class) # FFNN # Load nnet library olive.nn<-nnet(c1.train~.,d.olive.train,size=5,linout=t,decay=5e-4, range=0.6,maxit=1000) table(c1.train,round(predict(olive.nn,d.olive.train))) table(c1.test,round(predict(olive.nn,d.olive.test))) olive.x<-predict(olive.nn,d.olive.train) olive.x2<-predict(olive.nn,d.olive.test) Here are the results: 22

23 # trees n= 219 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root ( ) 2) linoleic>= ( ) * 3) linoleic< ( ) 6) palmitic< ( ) * 7) palmitic>= ( ) 14) linolenic< ( ) * 15) linolenic>= ( ) * c1.train c1.test # LDA lda.data.frame(d.olive.train, c1.train) Prior probabilities of groups: Group means: palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic Coefficients of linear discriminants: LD1 palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic

24 c1.train c1.test # QDA c1.train c1.test # FFNN # weights: 51 initial value iter 10 value iter 250 value final value converged c1.train c1.test North vs South Apulia These two areas turn out to be very easy to separate, using trees. The training and test sets are both perfectly classified using palmitoleic acid (Figure 9). Here is the code: # North Apulia vs South Apulia d.olive.train2<-d.olive[indx.tr,] c2.train<-c(rep(-1,572))[indx.tr] c2.train[d.olive.train2[,2]==1]<-1 c2.train<-c2.train[d.olive.train2[,1]==1&d.olive.train2[,2]!=2& d.olive.train2[,2]!=4] d.olive.train2<-d.olive.train2[d.olive.train2[,1]==1&d.olive.train2[,2]!=2& d.olive.train2[,2]!=4,-c(1,2)] d.olive.test2<-d.olive[indx.tst,] c2.test<-c(rep(-1,572))[indx.tst] 24

25 Figure 9: North and South Apulia can be separated very cleanly using palmitoleic acid. c2.test[d.olive.test2[,2]==1]<-1 c2.test<-c2.test[d.olive.test2[,1]==1&d.olive.test2[,2]!=2& d.olive.test2[,2]!=4] d.olive.test2<-d.olive.test2[d.olive.test2[,1]==1&d.olive.test2[,2]!=2& d.olive.test2[,2]!=4,-c(1,2)] # Trees olive.rp<-rpart(c2.train~.,data.frame(d.olive.train2),method="class") olive.rp table(c2.train,predict(olive.rp,type="class")) table(c2.test,predict(olive.rp,data.frame(d.olive.test2),type="class")) par(mfrow=c(1,1)) hist(c(d.olive.train2[,2],d.olive.test2[,2]),30,col=2, xlab="palmitoleic",main="north vs South Apulia",xlim=c(0,300)) text(50,30,"north",pos=4) text(250,30,"south",pos=4) abline(v=111.5) Here are the results: n= 177 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root ( ) 2) palmitoleic>= ( ) * 3) palmitoleic< ( ) * 25

26 c2.train c2.test Now lets review Sicily Sicily, the problem area, cannot be cleanly separated from the other areas using trees. The error is 15/246 = in the training set, and 7/77 = in the test set. What about FFNN or SVM? Here is the code: olive.rp<-rpart(c1.train~.,data.frame(d.olive.train),method="class") olive.rp table(c1.train,predict(olive.rp,type="class")) table(c1.test,predict(olive.rp,data.frame(d.olive.test),type="class")) Here are the results: n= 246 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root ( ) 2) eicosenoic< ( ) * 3) eicosenoic>= ( ) 6) stearic< ( ) 12) arachidic< ( ) * 13) arachidic>= ( ) * 7) stearic>= ( ) * c1.train c1.test Thus the solution for the areas of the southern region (without Sicily) is: If linoleic>=946 If palmitoleic >= then the area is South Apulia Else the area is North Apulia Else (linoleic<946) If palmitic<1116 If palmitoleic >= then the area is South Apulia 26

27 Else the area is North Apulia Else (palmitic>=1116) If linolenic<37 If palmitoleic >= then the area is South Apulia Else the area is North Apulia Else (linolenic>37) the area is Calabria The test error associated with this classification is 3%. 27

28 4 Summary The classification rule (with values given in % 100) is then: If eicosenoic acid > 7 then the region is South If eicosenoic acid < 34.5 then the area is Sicily Else If stearic < 261 If arachidic < 84.5 the area is Sicily If stearic 261 OR if (stearic < 261 and arachidic 84.5) If linoleic 946 If palmitoleic the area is South Apulia Else the area is North Apulia Else (linoleic <946) If palmitic 1116 If palmitoleic the area is South Apulia Else the area is North Apulia Else (palmitic 1116) If linolenic < 37 If palmitoleic then the area is South Apulia Else the area is North Apulia Else (linolenic > 37) the area is Calabria Else If palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic > then the region is Sardinia If palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic > 0.73 then the sample is from inland Sardinia Else the sample is from coastal Sardinia Else the region is North If palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic > 0.76 then the sample is from East Liguria Else if palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic > then the sample is from West Liguria Else it is from Umbria The error on the test set overall is ( )/136 = 0.09 including Sicily and 5/127 = 0.04 without Sicily, and separately on the areas is given in Table 4. The important variables in each of the separations are given in Table 5. There are several qualifications about the study: We are assuming that the data samples have been collected appropriately, that the samples are representative of the oils of the areas they are purported to be produced in. There is a big difference in the sample sizes for each class. South Apulia is the most abundantly represented and form the internet search of material this is one of the most prolific olive oil producing 28

29 Classes Test Error Overall (with Sicily) 12/136=0.09 Overall (without Sicily) 5/127=0.04 South vs Others 0/136 =0 North vs Sardinia 0/59 =0 Inland vs Coastal Sardinia 0/24=0 East Liguria vs Others 3/35=0.09 Umbria vs West Liguria 0/23=0 Sicily vs Others 7/77=0.09 Calabria vs Others 2/68=0.03 North vs South Apulia 0/54=0 Table 4: Prediction errors Separation palmitic palmitoleic stear oleic linoleic linolenic arachidic eicosenoic Regions South/Others x Sardinia/North x x x Sardinia Coast/Inland x x x North E.Liguria/Others x x W.Liguria/Umbria x x x x x x South Sicily/Others x x x Calab/Others x x x N.Apulia/S.Apulia x Table 5: This table marks the important variables for each separation. areas. North Apulia has the least samples, only 25. Perhaps the sample size is indicative of the productivity and should perhaps be incorporated into the classifiers. 29

30 References Burges, C. J. C. (1998), A Tutorial Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery 2, Forina, M., Armanino, C., Lanteri, S. & Tiscornia, E. (1983), Classification of olive oils from their fatty acid composition, in H. Martens & H. Russwurm Jr., eds, Food Research and Data Analysis, Applied Science Publishers, London, pp Vapnik, V. (1979), Estimation of Dependence Based on Empirical Data (In Russian), Nauka, Moscow. Vapnik, V. (1999), The Nature of Statistical Learning Theory (Statistics for Engineering and Information Science), Springer-Verlag, New York, NY. Venables, W. N. & Ripley, B. (1994), Modern Applied Statistics with S-Plus, Springer-Verlag, New York. 30