Solutios to Selected Problems I: Patter Classificatio by Duda, Hart, Stork Joh L. Weatherwax February 4, 008 Problem Solutios Chapter Bayesia Decisio Theory Problem radomized rules Part a: Let Rx be the average risk give measuremet x. The sice this risk depeds o the actio ad correspodigly the probability of takig such actio, this risk ca be expressed as m Rx Ra i xpa i x The the total risk over all possible x is give by the average of Rx or R Ω x i m Ra i xpa i xpxdx i Part b: Oe would pick Pa i x { i ArgMii Ra i x 0 i aythig else Thus for every x pick the actio that miimizes the local risk. This amouts to a determiistic rule i.e. there is o beefit to usig radomess. wax@alum.mit.edu
Part c: Yes, sice i a determiistic decisio rule settig radomizig some decisios might further improve performace. But how this would be doe would be difficult to say. Problem 3 Let 0 i j λα i ω j λ r i c + λ s otherwise The sice the defiitio of risk is expected loss we have Rα i x c λa i ω j Pω j x i,,..., c i Rα c+ x λ r Now for i,,..., c the risk ca be simplified as Rα i x λ s c j, j i λa i ω j Pω j x λ s Pω i x Our decisio fuctio is to choose the actio with the smallest risk, this meas that we pick the reject optio if ad oly if This is equivalet to λ r < λ s Pω i x i,,..., c Pω i x < λ r λ s i,,..., c This ca be iterpreted as if all posterior probabilities are below a threshold λr λ s we should reject. Otherwise we pick actio i such that This is equivalet to λ s Pω i x < λ s Pω i x j,,..., c, j i Pω i x > Pω j x j i This ca be iterpreted as selectig the class with the largest posteriori probability. I summary we should reject if ad oly if Pω i x < λ r λ s i,,..., c 3 Note that if λ r 0 there is o loss for rejectig equivalet beefit/reward for correctly classifyig ad we always reject. If λ r > λ s, the loss from rejectig is greater tha the loss from missclassifig ad we should ever reject. This ca be see from the above sice λr λ s > so λr λ s < 0 so equatio 3 will ever be satisfied.
Problem 3 the probability of error Part a: We ca assume with out loss of geerality that µ < µ. If this is ot the case iitially chage the labels of class oe ad class two so that it is. We begi by recallig the probability of error P e P e PĤ H H PH + PĤ H H PH ξ px H dx PH + px H dx PH. Here ξ is the decisio boudary betwee the two classes i.e. we classify as class H whe x < ξ ad classify x as class H whe x > ξ. Here PH i are the priors for each of the two classes. We begi by fidig the Bayes decisio boudary ξ uder the miimum error criterio. From the discussio i the book, the decisio boudary is give by a likelihood ratio test. That is we decide class H if px H px H PH PH ξ C C, C C ad classify the poit as a member of H otherwise. The decisio boudary ξ is the poit at which this iequality is a equality. If we use a miimum probability of error criterio the costs C ij are give by C ij δ ij, ad the decisio boudary reduces whe PH PH to pξ H pξ H. If each coditioal distributios is Gaussia px H i Nµ i, σi, the the above expressio becomes πσ exp{ ξ µ } σ πσ exp{ ξ µ }. If we assume for simplicity that σ σ σ the above expressio simplifies to ξ µ ξ µ, or o takig the square root of both sides of this we get ξ µ ±ξ µ. If we take the plus sig i this equatio we get the cotradictio that µ µ. Takig the mius sig ad solvig for ξ we fid our decisio threshold give by ξ µ + µ. Agai sice we have uiform priors PH PH, we have a error probability give by or P e ξ πσp e exp{ x µ }dx + πσ σ ξ ξ exp{ x µ }dx + σ σ πσ exp{ ξ x µ }dx. σ exp{ x µ }dx. σ To evaluate the first itegral use the substitutio v x µ σ so dv dx σ, ad i the secod use v x µ σ so dv dx σ. This the gives πσp e σ µ µ σ e v dv + σ e v dv. µ µ σ
or P e µ µ σ e v dv + π e v dv. π µ µ σ Now sice e v is a eve fuctio, we ca make the substitutio u v ad trasform the secod itegral ito a itegral that is the same as the first. Doig so we fid that P e µ µ σ π e v dv. Doig the same trasformatio we ca covert the first itegral ito the secod ad we have the combied represetatio for P e of P e µ µ σ π e v dv π µ µ σ e v dv. Sice we have that µ < µ, we see that the lower limit o the secod itegral is positive. I the case whe µ > µ we would switch the ames of the two classes ad still arrive at the above. I the geeral case our probability of error is give by P e π a e v dv, with a defied as a µ µ. To match exactly the book let v u σ so that dv du ad the above itegral becomes P e e u / du. π I terms of the error fuctio which is defied as erfx x π dt, we have that 0 e t [ ] [ ] P e [ π a e v dv a π 0 ] e v dv erf µ µ σ. a π 0 e v dv a π 0 e v dv Sice the error fuctio is programmed ito may mathematical libraries this form is easily evaluated. Part b: From the give iequality P e π a e t / dt πa e a /, ad the defiitio a as the lower limit of the itegratio i Part a above we have that P e π µ µ σ e µ µ 4σ σ π µ µ exp{ 8 µ µ σ } 0,
as µ µ +. This ca happe if µ σ ad µ get far apart of σ shriks to zero. I each case the classificatio problem gets easier. Problem 3 probability of error i higher dimesios As i Problem 3 the classificatio decisio boudary are ow the poits x that satisfy x µ x µ. Expadig each side of this expressio we fid x x t µ + µ x x t µ + µ. Cacelig the commo x from each side ad groupig the terms liear i x we have that x t µ µ µ µ. Which is the equatio for a hyperplae. Problem 34 error bouds o o-gaussia data Part a: Sice the Bhattacharyya boud is a specializatio of the Cheroff boud, we begi by cosiderig the Cheroff boud. Specifically, usig the boud mia, b a β b β, for a > 0, b > 0, ad 0 β, oe ca show that Perror Pω β Pω β e kβ for 0 β. Here the expressio kβ is give by β β kβ µ µ t [βσ + βσ ] µ µ + l βσ + βσ. Σ β Σ β Whe we desire to compute the Bhattacharyya boud we assig β /. For a two class scalar problem with symmetric meas ad equal variaces that is where µ µ, µ +µ, ad σ σ σ the expressio for kβ above becomes kβ β β µ [ βσ + βσ ] µ + βσ l + βσ σ β σ β β βµ. σ Sice the above expressio is valid for all β i the rage 0 β, we ca obtai the tightest possible boud by computig the maximum of the expressio kβ sice the domiat β expressio is e kβ. To evaluate this maximum of kβ we take the derivative with respect to β ad set that result equal to zero obtaiig d β βµ 0 dβ σ
which gives β 0 or β /. Thus i this case the Cheroff boud equals the Bhattacharra boud. Whe β / we have for the followig boud o our our error probability Perror Pω Pω e µ σ. To further simplify thigs if we assume that our variace is such that σ µ, ad that our priors over class are take to be uiform Pω Pω / the above becomes Perror e / 0.3033.
Chapter 3 Maximum Likelihood ad Bayesia Estimatio Problem 34 I the problem statemet we are told that to operate o a data of size, requires f umeric calculatios. Assumig that each calculatio requires 0 9 secods of time to compute, we ca process a data structure of size i 0 9 f secods. If we wat our results fiished by a time T the the largest data structure that we ca process must satisfy 0 9 f T or f 0 9 T. Covertig the rages of times ito secods we fid that T { sec, hour, day, year} { sec, 3600 sec, 86400 sec, 3536000 sec}. Usig a very aive algorithm we ca compute the largest such that f by simply icremetig repeatedly. This is doe i the Matlab script prob 34 chap 3.m ad the results are displayed i Table XXX. Problem 35 recursive v.s. direct calculatio of meas ad covariaces Part a: To compute ˆµ we have to add, d-dimesioal vectors resultig i d computatios. We the have to divide each compoet by the scalar, resultig i aother computatios. Thus i total we have calculatios. d + + d, To compute the sample covariace matrix, C, we have to first compute the differeces x k ˆµ requirig d subtractio for each vector x k. Sice we must do this for each of the vectors x k we have a total of d calculatios required to compute all of the x k ˆµ differece vectors. We the have to compute the outer products x k ˆµ x k ˆµ, which require d calculatios to produce each i a aive implemetatio. Sice we have such outer products to compute computig them all requires d operatios. We ext, have to add all of these outer products together, resultig i aother set of d operatios. Fially dividig by the scalar requires aother d operatios. Thus i total we have calculatios. d + d + d + d d + d, Part b: We ca derive recursive formulas for ˆµ ad C i the followig way. First for ˆµ we ote that + ˆµ + x k x + + + + x k + k k + x + + ˆµ. +
Writig + as expected. as + + + the above becomes ˆµ + + x + + ˆµ + ˆµ ˆµ + x + ˆµ, + To derive the recursive update formula for the covariace matrix C we begi by recallig the defiitio of C of C + + x k ˆµ + x k ˆµ +. k Now itroducig the recursive defiitio of the mea ˆµ + as computed above we have that C + + k + x k ˆµ + x + ˆµ x k ˆµ + x + ˆµ x k ˆµ x k ˆµ k + + k Sice ˆµ k x k we see that + + x + ˆµ x k ˆµ ˆµ x k 0 or k k x k ˆµ x + ˆµ + x + + ˆµ x + ˆµ. k ˆµ x k 0, Usig this fact the secod ad third terms the above ca be greatly simplified. For example we have that + x k ˆµ x k ˆµ + x + ˆµ x + ˆµ k k Usig this idetity ad the fact that the fourth term has a summad that does ot deped o k we have C + give by C + x k ˆµ x k ˆµ + x + ˆµ x + ˆµ k + x + ˆµ x + ˆµ + C + + + C C + k + x + ˆµ x + ˆµ x + ˆµ x + ˆµ x + ˆµ x + ˆµ + C + + x + ˆµ x + ˆµ.
which is the result i the book. Part c: To compute ˆµ usig the recurrece formulatio we have d subtractios to compute x ˆµ ad the scalar divisio requires aother d operatios. Fially, additio by the ˆµ requires aother d operatios givig i total 3d operatios. To be compared with the d + operatios i the o-recursive case. To compute the umber of operatios required to compute C recursively we have d computatios to compute the differece x ˆµ ad the d operatios to compute the outer product x ˆµ x ˆµ, aother d operatios to multiply C by ad fially d operatios to add this product to the other oe. Thus i total we have d + d + d + d + d 4d + d. To be compare with d + d i the o recursive case. For large the o-recursive calculatios are much more computatioally itesive.
Chapter 6 Multilayer Neural Networks Problem 39 derivatives of matrix ier products Part a: We preset a slightly differet method to prove this here. We desire the gradiet or derivative of the fuctio φ x T Kx. The i-th compoet of this derivative is give by φ x i x i x T Kx e T i Kx + xt Ke i where e i is the i-th elemetary basis fuctio for R, i.e. it has a i the i-th positio ad zeros everywhere else. Now sice the above becomes e T i KxT x T K T e T i e T i Kx, φ x i e T i Kx + et i KT x e T i K + K T x. Sice multiplyig by e T i o the left selects the i-th row from the expressio to its right we see that the full gradiet expressio is give by φ K + K T x, as requested i the text. Note that this expressio ca also be proved easily by writig each term i compoet otatio as suggested i the text. Part b: If K is symmetric the sice K T K the above expressio simplifies to as claimed i the book. φ Kx, Chapter 9 Algorithm-Idepedet Machie Learig Problem 9 sums of biomial coefficiets Part a: We first recall the biomial theorem which is x + y x k y k. k If we let x ad y the x + y ad the sum above becomes, k k0 k0
which is the stated idetity. As a aside, ote also that if we let x ad y the x + y 0 ad we obtai aother iterestig idetity ivolvig the biomial coefficiets 0 k0 k k. Problem 3 the average of the leave oe out meas We defie the leave oe out mea µ i as µ i x j. This ca obviously be computed for every i. Now we wat to cosider the mea of the leave oe out meas. To do so first otice that µ i ca be writte as µ i j i x j j i x j x i j j ˆµ x i. x j x i With this we see that the mea of all of the leave oe out meas is give by as expected. µ i i i ˆµ ˆµ x i ˆµ ˆµ ˆµ, i x i Problem 40 maximum likelihood estimatio with a biomial radom variable Sice k has a biomial distributio with parameters, p we have that Pk p k p k. k
The the p that maximizes this expressio is give by takig the derivative of the above with respect to p settig the resultig expressio equal to zero ad solvig for p. We fid that this derivative is give by d dp Pk k kp k p k + k p k p k k. Which whe set equal to zero ad solve for p we fid that p k, or the empirical coutig estimate of the probability of success as we were asked to show.