LAPPEENRANNAN TEKNILLINEN YLIOPISTO Faculty of technology Bachelor s degree program in chemical engineering

Transcription

1 LAPPEENRANNAN TEKNILLINEN YLIOPISTO Faculty of technology Bachelor s degree program in chemical engineering Tuomas Sihvonen Support vector machines in spectral data classification Examiner: Satu-Pia Reinikainen

2 Contents 1 Introduction Theory.1 Hard margin support vector machines Soft-Margin SVMs Kernel methods Linear kernel Polynomial kernel Radial basis function kernel Experimental Datasets used Data pretreatment Results and discussion Effect of kernel parameters Linear kernel Polynomial kernel RBF kernel Conclusion 16 References 17

3 1 Introduction In chemistry and chemical engineering, as in many other branches of science and engineering, the advances in computer and measurement technologies have made it possible to gather much more information about different experiments or processes. Then the problem is how to extract useful information from all of the data gathered. One way of doing this is to use pattern recognition to classify the data. This is useful when the outcome is not known and we want to explore the dataset and find patterns. [1] In classification the aim is to separate samples differing from one another, by assigning different labels to them. These samples can be virtually anything, form separating healthy people form the sick or monitoring if quality of a product is satisfactory. Support vector machines (SVM) are a powerful classification tool that have been gaining popularity in chemometrics during the last 1 years. SVMs are computationally light but provide good generalization properties and can be applied to complex datasets. In chemistry SVMs have been used especially in the bio and medical side, where datasets are usually quite big [,3]. Some uses in process monitoring have also been reported [4 6]. In the theory part of this study we take a look at SVM theory and formulation, and in the experimental section we see how different SVM parameters effect the results obtained from SVM. Theory In support vector machines the idea is to find a hyperplane that separate the points of different labels and that the margin between the plane and the points is largest possible. This principle is visualized in Figure 1. Figure 1: Two separating hyperplanes fitted to a dataset. Plane with the larger margins is the optimal one and the filled symbols are support vectors. [7]

4 3.1 Hard margin support vector machines We are going to use the hard margin support vector machine as the basis when go through the theory of support vector machines. This form of SVM can be considered to be the simplest. Later we see that even the more advanced SVM variants are only small modifications of this inital theory. Then we only have to look at the equations derived here and point out how they have been modified. Hard margin SVM:s work we have data points x i (i = 1,..., M) that are linearly separable and have a class label of y i = 1 or y i = 1. Then we can place a line or a hyper plane between these points, to separate them according to their labels (Fig. 1). This plane has the form w x n + b = and it is used as the decision function: D(x) = w x n + b (1) where w is the normal to the plane and b is the bias term. The distance between the plane and the nearest data point is called the margin. Then a sample is classified according to the following rule { When D(x) < Sample belongs to the class -1 When D(x) > Sample belongs to the class 1 There are a infinite amount of planes that can separate the data points. To get the optimal separating hyperplane we need to maximize the plane s distance from the data points i.e. maximizes the margin. To do this we must first calculate the distance of a point x n from the plane w x + b =. The plane is normalized so that w x + b = 1. This means that the point closest to the plane is assigned the value 1. distance = ŵ (x n x) () where ŵ is the unit vector ŵ = w. Then by substitution this, and adding and w subtracting b from equation (), we get equation (3). distance = 1 w w x n + b w }{{}} {{ x } b = 1 w Here we see that the point x is on the plane and thus the term becomes zero. 1 (3) To get a plane that maximizes the margins we want to maximize this distance max 1 w (4)

5 4 subject to min w x n + b = 1 (5) This is not easy a easy optimization task. We notice that we can get rid of the absolute values by multiplying by the class labels w x n + b = y n (w x n + b) (6) To make the optimization easier we move from maximizing to minimizing and get the following problem minimize 1 w w (7) subject to y n (w x n + b) 1 for n = 1,..., N (8) We transform the previous constraint optimization problem to a easier one by using the method of Lagrange multipliers L(w, b, α) = 1 w w α n (y n (w x n + b) 1) (9) where α n is the Lagrange multiplier. We want to minimize this equation with respect to w and b and maximize it w.r.t each α n. To achieve the minimization w.r.t. w we take the gradient of equation (9) to form (1) and set it to zero w L(w, b, α) = w α n y n x n = (1) and we can solve w w = α n y n x n (11) and to minimize the equation (9) w.r.t. b we differentiate it N L b = α n y n = (1) The equations (11) and (1) are called Karush Kuhn Tucker (KKT) conditions and by substituting these to (9) we get L(α) = α n 1 y n y m α n α m x n x m (13) m=1 So now the equation is free of w and b and can be maximized w.r.t. α n using quadratic programming. Here the y n y m x n x m are just the data points from the train-

6 5 ing set multiplied by their labels and can be written as the matrix Q so that in the end we have the following optimization task minimize 1 α Qα 1 α (14) subject to y α = α for n = 1,..., N (15) After the alpha has been solved by quadratic programming, we can solve w from the equation (11). Those x n that α n are called support vectors (SV). Those are the points that can be used to create the separating hyper plane. Then it is enough to use just the x n s that are support vector to calculate w w = α n y n x n (16) x n is SV After w is solved the bias b can be solved from any SV y n (w x n + b) = 1 (17). Soft-Margin SVMs Hard-Margin SMVs work only when the data is linearly separable, because otherwise the constraint (8) is violated. To allow some violation of the margin we add the so called slack variable ξ n to the constraint equation. This variable tels us how deep in to the margin a point is allowed during placing of the plane, to still accept the plane as the optimal one. y n (w x n + b) 1 ξ n (18) Now our minimization task (7) is altered minimize 1 w w + C ξ n (19) subject to y n (w x n + b) 1 ξ n for n = 1,..., N () and ξ n for n = 1,..., N (1) ξ n R N, b R, w R d ()

7 6 Now we can write the Lagrangian again for the modified target function. L(w, b, ξ, α, β) = 1 w w + C ξ n α n (y n (w x n + b) 1 + ξ) + βξ n (3) This equation is very similar to the equation for the linearly separable case (9). Now we have just added another lagrange multiplier for the new variable ξ. This function is also minimized with respect to w and b and maximize it w.r.t each α n and β n. When equation (3) is minimized w.r.t. w and b we get exactly the same results as previously in equations (1) and (1). When minimized w.r.t. ξ n we get the following equation L ξ n = C α n β n = (4) From the equation this, we get the condition that α n C, because β n and if alpha would be greater than C then we cant find a β to make the equation (.) true. When the equations (1), (1) and are substituted back to the equation (3) we get again the equation (13). Now the only thing that was changed in the nonlinearly separable case was that we have the condition that α n C. So the final target function for the soft margin case is the following minimize 1 α Qα 1 α (5) subject to y α = α C for n = 1,..., N (6).3 Kernel methods In the case that the data which is being classified is not linearly separable and a separating hyper plane can not be found, the data can be raised to a higher dimension where it is linearly separable with a hyper plane. This can be done by moving from the X to Z space and solving the Lagrangian there. L(α) = α n 1 y n y m α n α m z n z m (7) m=1 In fact we only need to know the inner product z n z m from the Z space and not the vector z itself. To solve the inner product we can for the following function we call the kernel. z z = K(x, x ) (8)

8 7 If this kernel corresponds to a inner product on some space Z it can be taken as the inner product with out calculating the transformation. This makes the computations much more economic. When calculating the Lagrangian we get L(α) = α n 1 y n y m α n α m K(x, x ) (9) m=1 When using a kernel, also the decision function is altered, as we need to move the classifiable data to the same space out separating plane is. Computationally this is not much heavier than the normal case where just the inner product of data points in the original space is calculated. D(x) = y αk(x, x ) + b (3) The kernel function can be selected according to the classification task at hand. With it we can move to very high dimensional spaces to find the best separating plane and use it to classify our data in that space according to (3)..3.1 Linear kernel When the classifiable data is linearly separable or nearly linearly separable in the input space, mapping to a higher dimensional space is not needed. Then we use the so called linear kernel, which is just the inner product of the input data. K(x, x ) = x x (31).3. Polynomial kernel Polynomial kernels have the following form K(x, x ) = (1 + x x ) d (3) Here d is the degree of the polynomial..3.3 Radial basis function kernel Radial basis function kernel has the following form K(x, x ) = exp( γ x x ) (33)

9 8 Where γ is a positive parameter controlling the radius of the function. The gamma value determines how far from a data point the separating plane is placed. So that with large gamma values the plane is near a point and for small values the plane is further a way. This can be understood so that when the distance between points increases the function value approaches zero very rapidly. Then γ as the multiplier affects how fast the function value decreases and thus how far a point of data influences. This parameter can be optimized for the classification task. γ can also be represented as γ = 1 σ. 3 Experimental Implementing the SVM algorithm was done in Matlab R1a software. Two algorithms found were used as the basis of the matlab coding [8, 9]. 3.1 Datasets used The algorithm was tested on NIR dataset. The dataset consisted on 175 spectra that had been classified to two classes. Each spectra had 168 data points. In Figure all of the 175 spectra have been plotted to a single image. From this image the differences between spectra are hard to see and the only area where a clear difference between the spectra can be seen is between wavenumbers 4 5. The difference does not become clear even when the classes of spectra are separated and plotted side by side, as in Figure Data pretreatment The data represented in Figures and 3 is clearly linearly inseparable. This is why we do some form of data pretreatment. One often used multivariate method, in study of spectral data, is principal component analysis (PCA). Idea is to transform data to a set of linearly uncorrelated varables. Tranfomation is done so that the first principal component (PC) has the largest possible variance i.e. it explains most of the original data. The subsequent PCs have lower variances and they are chosen so that all of the PCs are orthogonal to the other PCs. Because of this property of PCA the first PCs should explain most of the data, and as they have most of the variance they also have most of the differences between

10 9 3.5 Transmittance, Wavenumber, 1/cm Transmittance, Transmittance, Figure : Unmodified spectral dataset Wavenumber, 1/cm Wavenumber, 1/cm Figure 3: The two classes of spectra separated and plotted side by side. the two classes. Thus principal components one and two were chosen when representing data in plots. In this way the different classes can be observed easier, as can be seen in Fugure 4. After PCA transformation we have 175 PCs, the same amount as the original spectra, each with the length of 88. As the dataset was fairly small to begin with, we used all of the PCs, so no features of the data would be left out. The dataset was split in two. Half of the data was used for the training of the SVM and the other half was used in testing of the classification.

11 1 6 4 Training data luokka +1 luokka 1 PC PC 1 Figure 4: PCA transformed training data plotted as a function of principal components 1 and. 4 Results and discussion 4.1 Effect of kernel parameters All three kernels that were described earlier were tested on classification of the spectral dataset. The effect of different parameters was studied for each kernel. The parameter C was common for all of the kernels, although it is more of a parameter of the soft margin SMV than of the kernels. This parameter determines the trade off between misclassification and minimizing the error. When C is high the SVM aims to classify all the training samples correctly and the resulting decision border becomes more complex. Similarly, when C is small the border become smoother, as more error is allowed. This can be seen also from the equation (3), where C is the multiplier of the slack variable Linear kernel This kernel is the simples one to use in the classification. It is just an inner product of the data points in the input space. When using this kernel the only parameter adjusting the SMVs action is the parameter C,

12 11 The dataset was classified by changing the parameter C from to 1. The performance of the classification was evaluated by calculating how well the test set was classified. In Figure 5 the classification error is presented as the function of C. We can see that after C value of.5 the error is at its minimum of 6,7 % Ammount of misclassified samples, % Value of parameter C Figure 5: Error of classifiacation as a function of C when using linear kernel. In Figure 6 we can see the separating line in drawn for the test data for various C values. It should be noted that the line in in Figure 6 is a projection of the separating plane placed to the dataset which is, after the PCA, 88 dimensional. Still this image can give us a visual indicator on how the classification is working.

13 1 PC Test data Class +1 8 Class 1 Classified +1 Classified PC 1 C =.1 C =.1 C =.1 C =.1 C = C = C = C = C =.6 C =.6 C =.6 C =.6 C =.5 C =.5 C =.5 C =.5 C =.4 C =.4 C =.4 C =.4 C =.3 C =.3 C =.3 C =.3 C =.1 C =.1 C =.1 C =.1 C =. C =. C =. C =. Figure 6: Test dataset with the separating lines for different C values Polynomial kernel In a sense the polynomial kernel has two parameters, the parameter C and the power of the polynomial d (eq. (3)). Of course it could also be interpreted that when the power of polynomial is changed also the kernel changes. Polynomial kernels were unable to classify data that was not first transformed through PCA. In Figure 7 the classification error is represented as a function of C and d. We can see that the lowest error (6.9 %) is achieved when the power of the polynomial is one. For this power the polynomial kernel actually reduces back to the linear kernel. Another thing noticed from the Figure 7 is that the odd numbered powers give smaller errors.

14 13 Ammount of misclassified samples, % Value of parameter C Power of the polynomial kernel, Figure 7: Error of classification as a function of the power of the polynomial kernel and value of C RBF kernel The RBF kernel has a parameter σ that controls the radius of the function that is how close to the data points the margins are set. Lowers classification errors were achieved by using the RBF kernel. At its lowest the error was.3 %. This kind of low error is most likely caused by over fitting during SVM training. For example when comparing Figures 6 and 9, we see that the RBF kernel has found the data points in the middle of what seems to be another class.

15 14 Figure 8: Error of classification as a function of the values of σ and C. The class borders for different values of C have been presented in Figure 9. The behavior of parameter C can be seen better in this figure, than in the case of linear kernel. Here we can see how the border stretches further, covering more points, when value of C increases. This illustrates that the RBF kernel can produce very complex borders, to the point that even single samples have been given their own borders. With this kind of borders just looking at the classification error can be dangerous, as there is a great chance on over fitting.

16 15 PC 6 4 C =.5 4 C =.5 C =.4C =.4 C =.4 C =.4 C =.5 C =.5 Test data C =.5 C =.4 C =.5 C =.5 C =.6 C =.7 C =.8 C =.9 C =.1 C C = =.9.8 C =.6 C =.6.7 C =.5 C =.4 C =.6 C =.4 C =.7 C =.4 C =.6 C =.5 C =.6 Class +1 6 Class 1 C =.9 Classified +1 Classified PC 1 C =.1 C =.6 C =.7 C =.8 C =.1 C =.9 C =.8 C =.1 C =.7 C =.1 C =.8 C =.9 C =.8 C =.1 C =.9 C =.7 Figure 9: Decision boundary for values of C when σ =.5. The effect of parameter σ can be seen in Figure 1. For lower σ values the border is very close to the data points, in some cases the border goes individually around a single data point. This clearly over fitting the data, as it seems unlikely that a new data point would fall to a so tightly confined space. As σ values increase, so does the distance from the data points to the border. Further increase of the parameter would most likely cause the border to move so that points would begin to be misclassified. In fact this can be seen in Figure 8, where the error starts to increase at higher σ values. RBF kernel was also able to classify the data even without the PCA pretreatment. But getting any kind of visual interpretation from this is quite impossible, because of the high dimensionality.

17 16 6 Test data 4 PC 4 6 Class +1 Class 1 Classified +1 Classified 1 σ = 1 σ =.4 σ =. σ = 1 σ =.8 σ =. σ =.4 σ =.8 σ =.6 σ =.6 σ = 1 σ =.4 σ =.6 σ =.8 σ =. σ =.6 σ =.4 σ =. σ =. σ =. σ =.6 σ =.6 σ =.8 σ =.4 σ = 1 σ =.4 σ = 1 σ =.8 σ =.6 σ =. σ =. σ =. σ =.4 σ =. σ =.6 σ =.8 σ = 1 σ = PC 1 Figure 1: Decision boundary for values of σ when C =.5. 5 Conclusion Support vector machines are a classification tool that has been gaining popularity in the field of chemometrics. The theoretical background given in this work helps users to understand how the SVMs work, what are their limitations and what are their strengths. Functionality of the SVMs were further explored in the experimental section. There it was shown how data derived from the chemical industry could be classified. Different kernels were tested for the data. For each kernel the effect of different parameters on the performance of the classification were studied. Linear and RFB kernels were most successful in classification of the data used. From these two the linear kernel would be the better choice. It only has one parameter to optimize and in the case of this data, it seems to ignore the outliers in the training data. RBF kernel gave the lowest classification errors, but this was due to the over fitting. RBF kernel was also the only one capable in classifying the data when it was not transformed to principal components.

18 17 References [1] Richard G. Brereton. Chemometrics for pattern recognition. Wiley, 9. [] R. Burbidge, M. Trotter, B. Buxton, and S. Holden. Drug design by machine learning: support vector machines for pharmaceutical data analysis. Computers & Chemistry, 6(1):5 14, 1. [3] Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. Gene selection for cancer classification using support vector machines. Machine Learning, 46(1-3):389 4,. [4] Olivier Devos, Gerard Downey, and Ludovic Duponchel. Simultaneous data pre-processing and {SVM} classification model selection based on a parallel genetic algorithm applied to spectroscopic data of olive oils. Food Chemistry, 148():14 13, 14. [5] Manabu Kano and Yoshiaki Nakagawa. Data-based process monitoring, process control, and quality improvement: Recent developments and applications in steel industry. Computers & Chemical Engineering, 3(1 ):1 4, 8. [6] Yingwei Zhang. Enhanced statistical analysis of nonlinear processes using kpca, {KICA} and {SVM}. Chemical Engineering Science, 64(5):81 811, 9. [7] Shigeo Abe. Support vector machines for pattern classification. Springer, 1. [8] Anton Schwaighofer. Support vector machine toolbox for matlab, 1. [9] Simon Rogers and Mark Girolami. A first course in machine learning simon rogers and mark girolami: Accompanying material. Website, 8 1.