Lnear mehods for regresson and classfcaon wh funconal daa Glber Sapora Chare de Sasue Appluée & CEDRIC Conservaore Naonal des Ars e Méers 9 rue San Marn, case 44 754 Pars cedex 3, France sapora@cnam.fr G. Damana Cosanzo Dparmeno d Economa e Sasca Unversà della Calabra Va P. Bucc, Cubo C 8736 Arcavacaa d Rende (CS) - Ialy dm.cosanzo@uncal. Crsan Preda Déparemen de Sasue-CERIM, Faculé de Médecne, Unversé de Llle,, Place de Verdun, 5945 Llle Cedex, France crsan.preda@unv-llle.fr Funconal daa occurs when we observe curves or pahs from a sochasc process X. If for each curve or pah we have a sngle response varable Y, we have a regresson problem when Y s numercal, a classfcaon problem when Y s caegorcal. We assume here ha all rajecores are observed connuously on a me nerval [;] and ha he varables Y (when numercal) and X have zero mean.. Regresson wh a funconal predcor he funconal lnear model consders a predcor whch may be expressed as an negral sum: ˆ Y = X β () d he problem s no new and comes back o Fsher (94) who used he expresson negral regresson. I s well known ha hs regresson model yelds o an ll-posed problem: he leas suares creron leads o he Wener-Hopf euaon whch n general has no an unue soluon. E( XY ) = E( X ) ( ) Xs β sds and he problem s even worse when we ry o esmae he regresson coeffcen funcon β () wh a fne number of observaons. Snce he works of Ramsay & Slverman (997), many echnues have been appled o solve hese knd of problem, mosly by usng explc regularzaon echnues. Hgh dmensonaly and mulcollneary also nvolves some smoohng. In he funconal lnear approach, funconal daa (he predcor) and funconal parameer can be modelled as lnear combnaons of a bass funcons from a gven funconal famly. Leraure on ha subjec essenally dffers n he choce of he bass and he way parameers are esmaed. Bass funcons should be chosen o reflec he characerscs of he daa: for example, Fourer bass are usually used o model perodc daa, whle B-splne bass funcons are chosen as hey have he advanage of fne suppor. We wll focus here on lnear mehods based on an orhogonal decomposon of he predcors.
. Lnear regresson on prncpal componens (Preda & Sapora, 5a) he use of componens derved from he Karhunen-Loeve expanson s, for funconal daa, he euvalen of prncpal componens regresson (PCR). he prncpal componen analyss (PCA) of he sochasc process (X ) consss n represenng X as: X = f() ξ = where he prncpal componens ξ = f () X d are obaned hrough he egenfuncons of he covarance operaor: Csf (, ) ( ) () sds= λ f. In pracce we need o choose an approxmaon of order : ˆ cov( Y; ξ ) Y = ξ. = λ Bu he use of prncpal componens for predcon s heursc because hey are compued ndependenly of he response: he componens correspondng o he larges egenvalues are no necessarly he mos predcve, bu s dffcul o rank an nfne number of componens accordng o R.... Funconal PLS regresson PLS regresson offers a good alernave o he PCR mehod by replacng he leas suares creron wh ha of maxmal covarance beween (X ) and Y. max wcov ( Y, w( ) X ) d wh w = he frs PLS componen s gven by = w() X. d he PLS regresson s erave and furher PLS componens are obaned by maxmzng he covarance creron beween he resduals of boh Y and (X ) wh he prevous componens. he PLS approxmaon s gven by: Yˆ ˆ PLS ( ) = c +... + c = β ( )( ) d PLS X and for funconal daa he same propery han n fne dmenson holds: PLS fs closer han PCR R ( Y; Yˆ ˆ PLS ( ) ) R ( Y; YPCR ( ) ) snce PCR componens are obaned rrespecve of he response. In Preda & Sapora () we show he convergence of he PLS approxmaon o he approxmaon gven by he classcal lnear regresson: lm ˆ ˆ E( YPLS( ) Y ) = In pracce, he number of PLS componens used for regresson s deermned by crossvaldaon.. Cluserwse PLS regresson Cluserwse regresson may be used when heerogeney n he daa s presen. hs corresponds o a mxure of several regresson models, ha s, here exss laen caegorcal varable G wh k caegores defnng he clusers such ha: E( Y / X =, G = g) = α + X β ( ) d VY ( / X=, G= g) = σ g g
k s supposed o be known, bu no he clusers. Le us remnd of he classcal case for a fne number of predcors : for n observaons, he cluser lnear algorhm fnds an opmal paron of he n pons, and he regresson models for each cluser (elemen of paron) whch mnmze he creron: ' ( y ( ˆ )) αg + βgx g he mnmzaon s acheved by an alernaed leas suares algorhm of he k-means famly alernang an OLS for each group (supposed known) and an allocaon of each un o he closes regresson surface e he model where he resdual s mnmal. Under he hypohess ha resduals whn each cluser are ndependen and normally dsrbued, hs creron s euvalen o maxmzaon of he lkelhood funcon (Henng, ). For funconal regresson he prevous model s no adeuae and we have proposed o esmae he local models n each cluser by PLS regresson n order o overcome hs problem. he convergence of hs algorhm has been dscussed n (Preda & Sapora, 5b) and cluserwse PLS funconal regresson has been appled o predc he behavor of shares of he Pars sock marke on a ceran lapse of me. 3. Bnary classfcaon wh a funconal predcor 3. Fsher s lnear dscrmnan analyss Prevous mehods are easly generalzed o bnary classfcaon, snce Fsher s lnear dscrmnan funcon s euvalen o a mulple regresson where he response varable Y s coded wh values a and b : mos freuenly ±, bu also convenenly p p and - wh p p (p, p ) he probably dsrbuon of Y. Cosanzo D. e al. (6) and Preda C. e al. (7) have appled PLS funconal classfcaon o predc he ualy of cookes from curves represenng he ressance (densy) of dough observed durng he kneadng process. For a gven flour, he ressance of dough s recorded durng he frs 48 s of he kneadng process. We have 5 curves whch can be consdered as sample pahs of a L -connuous sochasc process. Each curve s observed n 4 euspaced me pons of he nerval me [, 48]. Afer kneadng, he dough s processed o oban cookes. For each flour we have he ualy Y of cookes whch can be Good, Adjusable and Bad. Our sample conans 5 observaons for Y = Good, 5 for Y = Adjusable and 4 for Y = Bad. Due o measurng errors, each curve s smoohed usng cubc B-splne funcons wh 6 knos. Fgure : Smoohed kneadng curves Fgure : Dscrmnan coeffcen funcon 3
Some of hese flours could be adjused o become Good. herefore, we have consdered he se of Adjusable flours as he es sample and predc for each one he group membershp, Y = {Good, Bad}, usng he dscrmnan coeffcen funcon (Fg. ) gven by he PLS approach on he 9 flours. PLS funconal dscrmnan analyss gave an average error rae of % whch s beer han dscrmnaon based on prncpal componens. 3. Funconal logsc regresson Le Y be a bnary random varable and y,, y n he correspondng random sample assocaed o he sample pahs x (), =,, n. A naural exenson of he logsc regresson (Ramsay e al., 997) s o defne he funconal logsc regresson model by : π ln = α + x () ()d;,, β = π n where π = P( Y = X = x ( ); ). I may be assumed (Ramsay e al., 997) ha he parameer funcon and he sample pahs () are n he same fne space: x p β ( ) = b ψ ( ) = b ψ = p x ( ) = cψ ( ) = c ψ = where ψ ( ),, ψ ( ) are he elemens of a bass of he fne dmensonal space. Such an approxmaon ransform he funconal model () n a smlar form o sandard mulple logsc regresson model whose desgn marx s he marx whch conans he coeffcens of he expanson of sample pahs n erms of he bass, C = ( c ), mulpled by he marx Φ = ( φk = ψ k ( ) ψ ( )d), whose elemens are he nner produc of he bass funcons π ln = α + C Φ b π wh b = ( b,, b p ), π = ( π, π p ) and beng he p-dmensonal uny vecor. Fnally, n order o esmae he parameers a furher approxmaon by runcang he bass expanson could be consdered. Alernavely, regularzaon or smoohng may be ge by some roughness penales approach. In a smlar way as we defned earler funconal PCR, Leng and Müller (6) use funconal logsc regresson based on funconal prncpal componens wh he am of classfyng gene expresson curves no known gene groups.wh he explc am o avod mulcollnary and reduce dmensonaly, Escabas e al. (4) and Agulera e al. (6) propose an esmaon procedure of funconal logsc regresson, based on akng as covaraes a reduced se of funconal prncpal componens of he predcor sample curves, whose approxmaon s ge n a fne space of no necessarly orhonormal funcons. wo dfferen forms of funconal prncpal componens analyss are hen consdered, and wo 4
dfferen creron for ncludng he covaraes n he model are also consdered. Müller and Sadmüller (5) consder a funconal uas lkelhood and an approxmaon of he predcor process wh a runcaed Karhunen-Loeve expanson. he laer also developed asympoc dsrbuon heory usng funconal prncpal scores. Comparsons wh funconal LDA are n progress, bu s lkely ha he dfferences wll be small. 3.3 Ancpaed predcon In many real me applcaons lke ndusral process, s of he hghes neres o make ancpaed predcons. Le denoe d he approxmaon for a dscrmnan score consdered on he nerval me [, ], wh <. For funconal PLS or logsc regresson he score s d = X ˆ( β ) d bu any mehod leadng o an esmaon of he poseror probably of belongng o one group gves a score. he objecve here s o fnd * < such ha he dscrmnan funcon d* performs ue as well as d. For a bnary arge Y, he ROC curve and he AUC (Area Under Curve) are generally acceped as effcen measures of he dscrmnang power of a dscrmnan score. Le d (x) be he score value for some un x. Gven a hreshold r, x s classfed no Y = f d (x) > r. he rue posve rae or sensvy s P(d > r Y = ) and he false posve rae or specfcy, P(d > r Y = ). he ROC curve gves he rue posve rae as a funcon of he false posve rae and s nvaran under any monoonc ncreasng ransformaon of he score. In he case of an neffcen score, boh condonal dsrbuons of d gven Y = and Y= are dencal and he ROC curve s he dagonal lne. In case of perfec dscrmnaon, he ROC curve s confounded wh he edges of he un suare. he Area Under ROC Curve, s hen a global measure of dscrmnaon. I can be easly proved ha AUC()= P(X > X ), where X s a random varable dsrbued as d wheny= and X s ndependenly dsrbued as d for Y =. akng all pars of observaons, one n each group, AUC() s hus esmaed by he percenage of concordan pars (Wlcoxon- Mann-Whney sasc). A soluon s o defne * as he frs value of s where AUC(s) s no sgnfcanly dfferen from AUC() Snce AUC(s) and AUC() are wo dependen random varables, we use a boosrap es for comparng areas under ROC curves: we resample M mes he daa, accordng o a srafed scheme n order o keep nvaran he number of observaons of each group. Le AUC m (s) and AUC m () be he resampled values of AUC for m = o M, and δ m her dfference. esng f AUC(s) = AUC() s performed by usng a pared -es, or a Wlcoxon pared es, on he M values δ m. he prevous mehodology has been appled o he kneadng daa: he sample of 9 flours s randomly dvded no a learnng sample of sze 6 and a es sample of sze 3. In he es sample he wo classes have he same number of observaons. he funconal PLS dscrmnan analyss gves, wh he whole nerval [, 48], an average of he es error rae of abou., for an average AUC() =.746. he ancpaed predcon procedure gves for M = 5 and sample sze es n = 3 (same number of observaon n each class), * = 86. hus, one can reduce he recordng perod of he ressance of dough o less han half of he curren one. 5
4. Concluson and perspecves In hs paper we addressed he problem of predcng a caegorcal or numercal varable Y wh an nfne se of predcors X. We advocaed lnear models whch are easy o use and nerpre; mulcollneary beween predcors s bes solved by PLS han by PCR. A cluserwse generalzaon s a way o ake no accoun laen heerogeney as well as some knd of non lneary. For bnary classfcaon we proposed an ancpaed predcon echnue based on boosrap comparsons of ROC curves. Works n progress comprses he exenson of cluserwse funconal regresson o bnary classfcaon, comparson wh funconal logsc regresson as well as on-lne forecasng: nsead of usng he same ancpaed decson me * for all daa, we wll ry o adap * o each new rajecory gven s ncomng measuremens. References Agulera A.M., Escabas, M. & Valderrama M.J. (6) Usng prncpal componens for esmang logsc regresson wh hgh-dmensonal mulcollnear daa, Compuaonal Sascs & Daa Analyss, 5, 95-94 Cosanzo D., Preda C. & Sapora G. (6) Ancpaed predcon n dscrmnan analyss on funconal daa for bnary response. In COMPSA6, 8-88, Physca-Verlag Escabas, M., Agulera A.M. & Valderrama M.J. (4) Prncpal Componen Esmaon of Funconal Logsc Regresson: dscusson of wo dfferen approaches. Nonparamerc Sascs 6, 365-384. Fsher R.A. (94) he Influence of Ranfall on he Yeld of Whea a Rohamsed. Phlosophcal ransacons of he Royal Socey, B: 3: 89-4 Henng, C., (). Idenfably of models for cluserwse lnear regresson. J. Classfcaon 7, 73 96. Leng X. & Müller, H.G. (6) Classfcaon usng funconal daa analyss for emporal gene expresson daa. Bonformacs, 68-76. Müller, H.G. & Sadmüller, U. (5) Generalzed funconal lnear models. he Annals of Sascs 33, 774-85. Preda C. & Sapora G. (5a) PLS regresson on a sochasc process. Compuaonal Sascs and Daa Analyss, 48, 49-58. Preda C. & Sapora G. (5b) Cluserwse PLS regresson on a sochasc process. Compuaonal Sascs and Daa Analyss, 49, 99-8 Preda C., Sapora G. & Lévéder C., (7) PLS classfcaon of funconal daa, Compuaonal Sascs Ramsay & Slverman (997) Funconal daa analyss, Sprnger 6