Machine Learning - Spring 2012 Problem Set 3

Transcription

1 Machine Learning - Spring 2012 Problem Set 3 Out: February 29th, 1:30pm In: March 19h, 1:30pm TA: Hai-Son Le (hple@cs.cmu.edu) School Of Computer Science, Carnegie Mellon University Homework will be done individually: each student must hand in their own answers. It is acceptable for students to collaborate in figuring out answers and helping each other solve the problems. We will be assuming that, as participants in a graduate course, you will be taking the responsibility to make sure you personally understand the solution to any work arising from such collaboration. You also must indicate on each homework with whom you collaborated. Homework is due at the beginning of class on the due date. For programming questions, please submit your code and any plot/figures to the AFS submission folder: /afs/cs.cmu.edu/academic/class/10701-s12-users/yourandrewid/hw3 where yourandrewid is your Andrew ID. To copy the files to the folder, you can follow these steps: Copy the homework file (pdf format) and other code and data files as needed to your home directory on an Andrew machine, eg scp probset3.pdf yourandrewid@linux.andrew.cmu.edu:yourhomepath password : ********* Start an Andrew session, eg ssh -l yourandrewid linux.andrew.cmu.edu password : ********* Invoke aklog and copy the files over to your dedicated course folder. You have insert and listing rights for this folder, eg aklog cs.cmu.edu mkdir /afs/cs.cmu.edu/academic/class/10701-s12-users/yourandrewid/hw3 cp probset3.pdf /afs/cs.cmu.edu/academic/class/ \ s12-users/yourandrewid/hw3/ Please also include a file README file that describes your submission. 1

2 You may not overwrite your submission - if you want to update a file please submit a file with a new filename, keeping in mind a 20 MB space limit per student. The file with the newest timestamp prior to the due date and time will be evaluated. 1 Decision trees [25 points] Short decision trees with few levels are more desirable in practice because they generalize well to new data and are less prone to overfit. 1.1 Information Gain [10 points] Most of algorithms for learning the structure of decision trees rely on a heuristic to pick an attribute to split the trees in sequential steps. In this question, we are particularly interested in discrete valued attributes. We explore a popular measure: Information Gain (IG) and relate it to other concepts in probability. We start with these definitions. Assume we have two discrete random variables X and Y taking values {1,..., m} and {1,..., n} respectively. Entropy m H(X) = p(x = x) log p(x = x) (1) x=1 Conditional entropy of Y given X: H(Y X). Note that this is different from entropy of Y conditioned on X = x: H(Y X = x). Information Gain H(Y X) = m p(x = x)h(y X = x) (2) x=1 IG(X; Y ) = H(X) H(X Y ) (3) Kullback-Leibeler divergence (K-L divergence) between two distributions. D KL (P Q) = i P (i) log P (i) Q(i) (4) Mutual Information I(X; Y ) = m x=1 y=1 n p(x = x, Y = y) log p(x = x, Y = y) p(x = x)p(y = y) (5) 1. [4 points] Show that IG(X; Y ) = IG(Y ; X). 2. [3 points] Show that IG(X; Y ) = H(X) + H(Y ) H(X, Y ). 2

3 Having a Car Student Owning a house Risk Count Yes Yes Yes Yes 5 Yes Yes No Yes 3 Yes No Yes Yes 15 Yes No No Yes 2 No Yes Yes Yes 5 No Yes No Yes 0 No No Yes Yes 4 No No No Yes 1 Yes Yes Yes No 3 Yes Yes No No 3 Yes No Yes No 2 Yes No No No 2 No Yes Yes No 2 No Yes No No 10 No No Yes No 3 No No No No 10 Table 1: Data from Problem [3 points] Write the Information Gain in terms of the K-L divergence of two distributions. In other words, find P and Q such that IG(X, Y ) = D KL (P Q). 1.2 Tree Construction [15 points] It is that time of the year to file tax returns to the IRS. Bob is hired by a famous tax software company to assess the risk of audit for customers. Unfortunately, being a poor grad student, he has paid no attention to taxes. He decides to build a decision tree to classify customers into high and low risk groups. The data is shown in Table [7 points] Show the steps in constructing the decision tree (without postpruning) and the final tree using the algorithm we described in class. What is the classification error? 2. [3 points] Bob would like to add two new attributes: Income and Age. Assume that the decision tree splits on an attribute X by splitting the examples into two sets: X < a and X a. How about when an attribute has K distinct values? How should we determine the best splitting value a? 3. [5 points] Real datasets may not be perfect, e.g. some may contain systematic errors. In each of these situations, if there are systematic errors, describe how you detect them and what we should do. An attribute I has only one single value. 3

4 Attributes T 1 and T 2 are duplicates. That means all examples having the same values for these two attributes. 2 Neural Networks [30 points] 2.1 Binary classification [3 points] Figure 1: Plots for 2.1. Which classifiers: knn, logistic regression, neural networks, naive Bayes can learn the decision boundary shown in Figure 1? 2.2 Boolean functions and neural networks [9 points] For this problem, we only consider neural networks with a hidden layer. Each unit has a weight w and uses the following activation functions: 1. Hard-threshold function: for a constant a, { 1 if w T x >= a h(x) = 0 otherwise (6) 2. Linear: h(x) = w T x (7) In class, we show that this type of networks can compute the boolean functions OR, AND and XOR. 4

5 1. [2 points] How many boolean functions f(x 1, x 2 ), where x 1,2 {0, 1}, are there? 2. [7 points] For any boolean function f(x 1, x 2 ), can we build a neural network to compute f? If yes, please prove by construction? If no, explain in details. 2.3 Two layer neural networks [6 points] We now consider neural networks with a hidden layer with only one input signal x and one output signal y. Using hard threshold and/or linear activation functions. Which of the following cases can be exactly represented by a neural network? For each case, if yes, draw the network with choice of activations for both levels and briefly explain how the function is represented. If no, explain why. 1. [2 points] Polynomials of degree one. 2. [2 points] Polynomials of degree two. 3. [2 points] Hinge loss: L(x) = max(1 x, 0). 2.4 Network Training [12 points] We train a neural network on a set of input vectors {x n } where n = 1,..., N, together with a corresponding set of target K-dimensional vectors {t n }. We assume that t has a Gaussian distribution with an x-dependent mean, so that : p(t x, w) = N (t y(x, w), β 1 I) (8) where β is the precision of the Gaussian noise. parameters w using the MLE principle: We would like to find the w MLE = argmax w N p(t n x n, w) (9) 1. [2 points] Show that solving (9) is equivalent to minimizing the sum-ofsquares error function: E(w) = 1 2 y(x n, w) t n 2 2 (10) 2. [10 points] Consider a network with one hidden layer as shown in Figure 2. Assume that each hidden unit j computes a weighted sum of its inputs x of the form: a j = w ji x i (11) i 5

6 Figure 2: Plots for 2.3. where w j and z are the weights and input of this hidden unit; and use an activation function h(a j ) = z j. Also the output unit k computes a weighed sum of its inputs z of the form b k = j w kjz j (12) where w k are the weights of this output unit. This unit outputs y k = b k. Together, the output of this network is: y(x n, w) = [y 1, y 2,... y K ] T (13) To train a neural network, we need to compute the derivatives of E(w). We can do so by considering a simpler quantity E n = 1 2 y(x n, w) t n 2 2. Show that E n = ( h (a j ) w ) w kjδ k xi (14) ji k where w k are weights of the output units and δ k = y k t k. Hints: This is covered in section of PRML. You should try to follow the derivation there. 3 SVM [40 points] In this problem, we consider training a SVM on a set of inputs {x n } where n = 1,..., N, together with a corresponding set of target values {t n } (t n { 1, 1}). The goal is to maximize the margin and at the same time allows some small 6

7 misclassifications. An example is classify as follows: { y(x n ) = w T +1, if y(x n ) 0 φ(x n ) + b; ˆt n = 1, otherwise (15) We therefore minimize: C ξ n w 2 (16) s.t. t n y(x n ) 1 ξ n, n = 1,..., N (17) ξ n 0, n = 1,..., N (18) where C 0 is a parameter that controls the trade-off between the stack variable penalty and the margin. 3.1 Slackness and Kernel [4 points] Figure 3 shows the decision boundaries of training four SVMs using different values of C and different kernels. For each of the examples, specify which setting was used to get the result. (a) (b) (c) (d) Figure 3: Plots for

8 1. C = 1 and no kernel is used. 2. C = 0.1 and no kernel is used. 3. C = 0.1 and the kernel is K(x i, x j ) = exp xi xj C = 0.1 and the kernel is K(x i, x j ) = x T i x j + (x T i x j) Dual formalization [16 points] 1. [2 points] Using the Lagrangian multipliers, show that the corresponding Lagrangian is L(w, b, a) = 1 2 w 2 + C ξ n a n {t n y(x n ) 1 + ξ n } µ n ξ n Be explicit in specifying which multiplier is correspondent to which constraint. Hints: A brief introduction to Lagrange multipliers is in Appendix E of the PRML book. 2. The Karush-Kuhn-Tucker (KKT) conditions for the optimal solutions are: a n 0 (19) t n y(x n ) 1 + ξ n 0 (20) a n {t n y(x n ) 1 + ξ n } = 0 (21) µ n 0 (22) ξ n 0 (23) µ n ξ n = 0 (24) 3. [5 points] By taking the derivatives of L(w, b, a), show that the dual problem is: max a n 1 2 m=1 a n a m t n t m k(x n, x m ) (25) s.t. 0 a n C, n = 1,..., N (26) a n t n = 0 (27) Please show all the steps in very details. 4. [2 points] Given a solution to the dual problem, which inputs are support vectors? 5. [3 points] Can you compute w from the dual solutions? If yes, give the formula, if no, explain why? Hints: Do we always have φ(x n )? 6. [4 points] How can we classify a new example x? 8

9 3.3 Implementation [20 points] 1. [10 points] Implement SVM in MATLAB. function [y_new, test_error, training_error] = train_svm(x, t, x_new, C) This function trains a SVM with inputs x,t,c and classifies new examples x_new and outputs the predictions in y_new. Your implementation should satisfy these requirements: You cannot use any existing implementations of SVM including MAT- LAB s built-in functions. However, you can use a quadratic programming(qp) solver (e.g. MATLAB s quadprog). You should use a subroutine to compute the Kernel distance between two input examples so that it is easy to use different kernels with your implementation. 2. A dangerous mutant of bird flu viruses was discovered and a widespread pandemic is imminent. A vaccine was developed for improving immunity against this potent virus. However, this new vaccine is very expensive and only effective for a subset of patients. Expression of two genes X,Y are shown as predictors of the vaccine s effectiveness. You are asked to train a SVM to predict the vaccine s effectiveness on patients using gene expression measurement. Download the dataset from the course website. The.mat file contains two matrices: train and test. There are 160 training examples and 40 testing examples. (a) [4 points] Using no kernel (or φ(x n ) = x n )), train a SVM using different values of C = 0, 0.1, 0.3, 0.5, 1, 2, 5, 8, 10. Use 4-fold cross validation and report the train and test errors. Produce a plot of 2 curves: training and testing error. The x-axis shows different values of C and the y-axis shows the error rate. Which value of C yields the best training and testing error rate? Which value of C should you use? Finally, plot the training and testing examples along with the decision boundary. What do you observe? Hints: To plot the decision boundary, you can use the code in svmplot.m. (b) [4 points] Now use Gaussian kernels K(x i, x j ) = exp{ xi xj 2 2 } and repeat the previous experiment. (c) [2 points] Summarize the result of two experiments. Should we use Gaussian kernels at all? 4 Hierarchical clustering v.s. K-means [5 points] Single linkage and K-means can give very different clustering results and one may be more preferred than the others. In each of the examples in Figure 4, write 9

10 down whether the clusters are produced by K-means (KM) or agglomerative clustering with single linkage (SL). 10

11 E Figure 4: Plots for 4. 11