Statistical Pattern Recognition (CE-725) Department of Computer Engineering Sharif University of Technology

I The Name of God, The Compassoate, The ercful Name: Problems' eys Studet ID#:. Statstcal Patter Recogto (CE-725) Departmet of Computer Egeerg Sharf Uversty of Techology Fal Exam Soluto - Sprg 202 (50 mutes 00+5 pots) ) Basc Cocepts (5 pots) a) True or false questos: For each of the followg parts, specfy that the gve statemet s true or false. I the case of true, provde a bref explaato, otherwse, propose a couter example. -.. The erel (x, x 2 ) s symmetrc, where x ad x j are the feature vectors for -th ad j-th examples. -.. Ay decso boudary that we get from a geeratve model wth class-codtoal Gaussa dstrbutos could prcple be reproduced wth a SV ad a polyomal erel. -.. After trag a SV, we ca dscard all examples whch are ot support vectors ad ca stll classfy ew examples. b) What would happe f the actvato fucto at hdde ad output layer LP be lear? Expla why ths smpler actvato fucto s ot ormally used LPs although t would smplfy ad accelerate the calculatos for the bac-propagato algorthm? a) - True. (x,x 2) φ(x ) φ(x 2) φ(x 2) φ(x ) (x 2,x ). - True. Sce class-codtoal Gaussas always yeld quadratc decso boudares, they ca be reproduced wth a SV wth erel of degree less tha or equal to two. - True. Oly support vectors affect the boudary. b) If we use ths actvato fucto, LP becomes le Perceptro (a lear classfer). To be more specfc, we ca wrte the weghts from the put to the hdde layer as a matrx W HI, the weghts from the hdde to output layer as W OH, ad the bas at the hdde ad output layer as vectors b H ad b O. Usg vector ad matrx multplcato, the hdde actvatos ca be wrtte as H b H + W HI * I. Ad, the output actvatos ca be wrtte as O b O + W OH * H b O + W OH * ( b H + W HI * I ) ( b O + W OH * b H ) + (W OH * W HI ) * I b OI + W OI * I; b OI b O + W OH * b H, W OI W OH * W HI Therefore, the same fucto ca be computed wth a smpler etwor, wth o hdde layer, usg the weghts W OI ad bas b OI. 2) Support Vector aches (20 pots) Cosder the followg data pots ad labels: Data pot Label x (,) x 2 (2,) x 3 (2,0) x 4 (,2) - x 5 (2,2) - x 6 (,-3) - Suppose that we use followg embeddg fucto to separate two classes by a large marg classfer.

2 2 2 2 φ (x) (x + x,x x ) a) Fd the support vectors, vsually. b) Fd the parameters of the SV classfer (w, w 0, λ ). c) Itroduce a embeddg fucto from 2-D to -D that separates orgal data pots learly. a) Trasformed data pots are: Orgal Trasformed x (,) x ' (2,0) x 2 (2,) x '2 (5,) x 3 (2,0) x '3 (4,2) x 4 (,2) x '4 (5,-) x 5 (2,2) x '5 (8,0) x 6 (,-3) x '6 (0,4) The, x2, x4 ad x6 are support vectors. b) y(w T x+w 0 ) for all support vectors, the w(-,), b5. y λ 0, the λ λ 2 λ 3 0. I addto w T T T λyx, the λ(5,) λ2(5, ) λ3(0,4) (, ), ad ( λ, λ2, λ 3) (, 4/5, /5). c) The -D embeddg fucto s φ(x,y) y. 3) Graphcal ethods (25 pots) a) Cosder a H wth three odes {S,S 2,S 3 }, outputs {A,B}, tal state probabltes {,0,0}, state trasto probablty matrx A, ad output probablty matrx B. Compute P(O B, O 2 B,, O 200 B) the gve H. b) Gve the followg graphcal model, whch of the followg statemets are true, regardless of the codtoal probablty dstrbutos? b) P(D,H) P(D)P(H) b2) P(A, I) P(A)P(I) b5) P(J,,L) P(J,L)P(,L) b6) P(E,C A,G) P(E A,G)P(C A,G) a) We ca wrte the probablty as: b3) P(A, I G) P(A G)P(I G) A b4) P(J,G F) P(J F)P(G F) I B E J C F L D G N H Ad we have, 0.5 0.25 0.25 0.5 0.5 A 0 0, B 0.5 0.5 0 0 0 P(O B,...,O B) P(O B,...,O B q S )P ( S ) 200 200 200 200 P(O B,...,O B q S ) P(O B,...,O B q S ) 2 P(O B,...,O B q S ) 200 200 200 200 2 200 200 3 0 99 P 200(S ) 2,P 200(S 2 ) P 200(S 3 ) ( P 200(s )) 2 The, P(O B, O 2 B,, O 200 B)2-399 +2-20 (-2-99 ) b) b) True. Because there s o actve trals ay possble paths from D to H (DCGFEIJH, DCGFEIJLH, DCBAEIJH ad DCBAEIJLH). b2) True. Because there s o actve trals ay possble paths from A to I (AEI ad ABCGFEI). b3) False. There s a actve tral o the path ABCGFEI. 200

b4) False: There s a actve tral o the path GCBAEIJ (E s descedat of F). b5) True: Because there s o actve trals ay possble paths from J to (J ad JL). b6) False: There s a actve tral o the path EFGC. 4) Expectato ad axmato (20 pots) Cosder a radom varable x that s categorcal wth possble values {,2,,}. Suppose that x s represeted as a vector such that x() f x taes the -th value, ad x (). The dstrbuto of x s represeted by a mxture of dscrete multomal dstrbutos such that: px ( ) π px ( ) ad px ( ) ( j) j x( j) π deotes the mxg coeffcet for the -th compoet (or the pror probablty that the hdde varable ), ad (j) represets the probablty P(x(j) ). Observed data pots {x },,, derve the E ad steps to estmate π ad (j) for all values of, ad j. The hdde varables are j s. j s a bary varable whch s f x s draw from the j-th dstrbuto. E Step: Step: We have, P(x, θ )P( θ) P( x, θ ) P(x θ) P(x ) π P(x j j ) π j x(j) π j (j) x(l) π j j l j(l) P( X,Z θ ) P(x, θ) P(x, θ)p( θ) P( θ ) π P( θ ) π x ( j) θ θ j ( ) P(x, ) P(x ) P(x, ) P(x ) ( j) Substtutg the above two values the lelhood results : The log lelhood s: x(j) ( (j) ) x(j) (j) P( X,Z θ ) P(x, θ)p( θ) π j π j ( ) L(X,Z θ ) lp(x,z θ ) lπ + x ( j)l ( j) j To estmate π s, fxg j s ad cosderg the costrat Σπ, ad usg Lagraga multpler we must optme the followg objectve fucto: Settg the dfferetato to ero we have, ( l l ) L( π ) L( X,Z θ ) +λ π

To calculate the value of λ we use the fact that L +λ π π π λ λ 0 l π l l l λ. The, Ths equato results l l. We complete estmato by substtutg the value of λ the prevous obtaed equato for π. The fal result s: l l π I the same way, to estmate s, fxg j s ad cosderg the costrat Σ l (l), ad usg Lagraga multpler we must optme the followg objectve fucto: ( l ) L( ) L( X,Z θ ) +λ (l) Settg the dfferetato to ero we have, L x ( j) +λ (j) (j) λ x(j) To calculate the value of λ we use the fact that λ 0 (j) λ l (l) l x l. The, Ths equato results l x l. We complete estmato by substtutg the value of λ the prevous obtaed equato for. The fal result s: x(l) l xj x j (j) l xl 5) Clusterg (5 pots) 0 2 a) Assume we are tryg to cluster the pots 2, 2, 2,..., 2 (a total of + pots where +2 ) usg herarchcal clusterg. We brea tes by combg the two clusters whch the lowest umber resdes. For example, f the dstace betwee clusters A ad B s the same as the dstace betwee clusters C ad D we would chose A ad B as the ext two clusters to combe f m{a,b} < m{c,d} where {A,B} are the set of umbers assged to A ad B. a) If we are usg Euclda dstace, draw a setch of the herarchcal clusterg tree we would obta for each of sgle/complete lage methods. a2) Now assume we are usg the dstace fucto d(p,q) max(p,q)/m(p,q). Whch of the sgle/complete lage methods wll result a dfferet tree from the oe obtaed (a) whe usg ths dstace fucto? If you th that oe or more of these methods wll result a dfferet tree, setch the ew tree as well. b) Cosder the followg algorthm to partto the data pots to clusters:. Calculate the parwse dstace d(p, P j ) betwee every two data pots P ad P j the set of data pots to be clustered ad buld a complete graph o the set of data pots wth edge weghts correspodg dstaces. 2. Geerate the mum Spag Tree of the graph.e. Choose the subset of edges E' wth mmum sum of weghts such that G' (P,E') s a sgle coected tree. N

3. Throw out the - edges wth the heavest weghts to geerate dscoected trees correspodg to the clusters. Idetfy whch of the clusterg algorthms you saw the class correspods to the metoed algorthm. a) All lage methods leads to the same tree show fgure (a). Fg. a Fg. b a2) Sgle l does ot chage. Complete l chages to the graph show Fg. b. b) The clusterg correspods to sgle-l bottom-up clusterg. The edges used to calculate the cluster dstaces for the sgle l bottom up clusterg correspod to the edges of the ST (sce all pots must be clustered, ad the cluster dstace s sgle l ad chooses the m weght edge jog together two so far ucoected clusters). Thus, the heavest edge the tree correspods to the top most clusters, ad so o. 6) Sem-Supervsed Learg (0 pots) Cosder the followg fgure (a) whch cotas labeled (class flled blac crcles class 2 hollow crcles) ad ulabeled (squares) data. We would le to use two methods (re-weghtg ad cotrag) order to utle the ulabeled data whe trag a Gaussa classfer. Fg. a Fg. b x 2 x 2 x x a) How ca we use co-trag ths case (what are the two classfers)? b) We would le to use re-weghtg of ulabeled data to mprove the classfcato performace. Reweghtg method s doe by placg the dashed crcle (show fgure b) o each of the labeled data pots ad coutg the umber of ulabeled data pots that crcle. Next, a Gaussa classfer s ru wth the ew weghts computed. b) To what class (hollow crcles or full crcles) would we assg the ulabeled pot A s we were trag a Gaussa classfer usg oly the labeled data pots (wth o re-weghtg)? b2) To what class (hollow crcles or full crcles) would we assg the ulabeled pot A s we were trag a classfer usg the re-weghtg procedure descrbed above? a) Co-trag parttos the feature space to two separate sets ad uses these sets to costruct depedet classfers. Here, the most atural way s to use oe classfer (a Gaussa) for the x axs ad the secod (aother Gaussa) usg the y axs. b) Hollow class. Note that the hollow pots are much more spread out ad so the Gaussa leared for them wll have a hgher varace. b2) Aga, the hollow class. Re-weghtg wll ot chage the result sce t wll be doe depedetly for each of the two classes, ad wll produce very smlar class ceters to the oes (b) above. Good Luc!