How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

Transcription

1 S C H E D A E I N F O R M A T I C A E VOLUME 0 0 On Mean Squared Error of Herarchcal Estmator Stans law Brodowsk Faculty of Physcs, Astronomy, and Appled Computer Scence, Jagellonan Unversty, Reymonta 4, Kraków, Poland e-mal: stanslaw.brodowsk@uj.edu.pl Abstract. In ths paper a new theorem about components of mean squared error of Herarchcal Estmator s presented. Herarchcal Estmator s a machne learnng meta-algorthm that attempts to buld, n an ncremental and herarchcal manner, a tree of relatvely smple functon estmators and combne ther results to acheve better accuracy than any of the ndvdual ones. The components of the error of a node of such tree are: weghted mean of the error of the estmator n a node and the errors of chldren, a non-postve term that descreases below 0 f chldren responses on any example dffer and a term representng relatve qualty of an nternal weghtng functon, whch can be conservatvely kept at 0 f needed. Gudelnes for achevng good results based on the theorem are brefly dscussed. Keywords: Herarchal Estmator, herarchcal model, regresson, functon approxmaton, error, theorem. Introducton Machne learnng s one of the classcal topcs n computer scence [, ]. Ths paper presents some theoretcal fndngs about a machne learnng soluton concerned wth supervsed learnng called Herarchcal Estmator. That meta-algorthm, presented n [3], arranges many smple, possbly relatvely naccurate, functon estmators (approxmators) nto a tree structure and combnes ther results n an attempt to obtan one more accurate. The basc general task of the mentoned technque s to predct values of a random varable Y wth possble values n Y R r beng presented wth values of another

2 84 varable X (wth possble values n X R p ) and knowng some set (or seres) of values of X pared wth values of Y called tranng set D = {(x (k), y (k) ), k {... D }, k : x (k) X, y (k) Y}. Ths may be done by approxmatng functon f : X Y, such that: Y = f(x) + ε, () where ε s some error varable of certan propertes (e.g. havng 0 mean). Because jont probablty P X,Y (so also ε), s not avalable, usually mnmzng of a loss functon, e.g. squared loss over D s attempted nstead [4]. As mentoned above, the man task s predcton, so workng on examples not beng n the tranng set s requred. If a soluton works well only on tranng set, but poorly on unseen examples, t s descrbed as havng low generalzaton. If the technque that s used s parametrc,.e. frst some model s selected and than parameters optmzed, low generalzaton s often result of too complcated model selected [5, 6, 7]... Smlar solutons Herarchcal Estmator attempts to combne many less accurate estmators nto more accurate one, so t s loosely related to the Theory of Weak Learnablty [8]. Its executon may be seen as buldng a problem model n an ncremental manner startng from a smple one and ncreasng complexty. Because some parts of t gude the creaton and workng of others, t may be consdered herarchcal. It creates a tree structure that s automatcally adapted to the problem beng learned, so ts smlarty to well known AdaBoost [9] s at most moderate. Another dfference s that whle orgnal AdaBoost sets the weght of component models for all examples, Herarchcal Estmator assgns dfferent weghts to the experts based on the example beng evaluated. Ths makes t more smlar to Herarchcal Mxture of Experts (see [0]), but ts operaton dffers sgnfcantly, even when constructve algorthms lke [] are consdered. For example, HME has expert nodes n leaf nodes, whle Herarchcal Estmator has them n all nodes and they all solve some subproblem of the orgnal problem. The outputs of nternal nodes can be used both for evaluatng the result and, after addtonal processng and possbly ncludng other varables, for weghtng results of component estmators. Ths also consttutes the most sgnfcant of many dfferences between Herarchcal Estmator and regresson trees M5 []. Probably the most smlar soluton to Herarchcal Estmator s Herarchcal Classfer [3], based on smlar premses. Its detals are strongly connected to the classfcaton task though and that forced many dfferences [3]. Herarchcal Estmator s desgned for predctng contnuously-valued number or vector outputs, so ts scope s dfferent than that of Herarchcal Classfer. The meta-algorthm nature of Herarchcal Estmator s also more explct than n the case of Herarchcal Classfer.

3 85.. Herarchcal Estmator... Basc defntons In a very general sense, Herarchcal Estmator s a functon HE : X Y, that uses a tree structure where node ndces are from some ndex set I. Let be the number of chldren (possbly 0) of the vald tree node wth ndex (called for the sake of brevty node ). Two functons are assgned to each vald node [3]:. a functon estmator (approxmator) g : X Y that solves some subproblem of the orgnal problem, n [3] smple neural networks are used for ths task,. competence functon C : {0,..., } X [0, ] whch values are used as weghts for results of chldren nodes and the result of estmator n node when the value of estmator for a gven example s calculated, as descrbed n Eq. (). We assgn each chld a number among ts sblngs. P : I N I s the functon that returns global node ndex of a chld based on the parent s ndex and that chld number,.e. P (, j) gves the ndex of the j-th chld of node. Defnton (Herarchcal Estmator node response). The recursve formula for retrevng response of some node on kth example s [3]: g (x (k) ) = j= g P (,j) (x (k) ) C (j, x (k) ) + C (0, x (k) ) g (x (k) ), () where C (j, x (k) ) =. (3) Defnton (Herarchcal Estmator response). The Herarchcal Estmator response for a gven example s the response of the root (ts ndex denoted here as r): HE(x) = g r (x). (4) For a leaf C (0, x (k) ) = and g (x (k) ) = g (x (k) ). A more compact verson of the defnton arses when we dentfy result of the estmator n gven node g (x (k) ) wth a result of a vrtual zeroth chld g P (,0) : g (x (k) ) = g P (,j) (x (k) ) C (j, x (k) ). (5) When aggregatng the result, an example s frst propagated down the tree startng from root. Weghts are proposed for a gven example and each chld node by

4 86 functon C and only those chldren that acheved a non-zero value are used. Ths means that example s not propagated through the whole tree, but only certan paths and branches. The propagaton along gven path stops f t reaches a node n whch C (0, x (k) ) =, usually, but not necessarly, a leaf node. It s very mportant that functon C depends on example beng evaluated. Therefore, although for any gven example the response of the Herarchcal Estmator s a weghted mean of the response of some nodes, the whole Herarchcal Estmator s not a lnear combnaton of the estmators n the nodes.... Useful terms competence In dscusson about Herarchcal Estmator two more defntons wll be very useful [3]: Defnton 3 (Competence area). Competence area s the set of all feature vectors that a gven node may possbly be requred to evaluate. Defnton 4 (Competence set). Competence set contans all examples from a gven set (also f that set s only known from context of the term use) that fall nto competence area of the node. An example can fall nto the competence set or competence area of a node f ths s a vald feature vector and the node s root, or f for some gven set S (a set of all possble vectors for competence area) and the gven node beng a jth chld of node, the competence functon from node s non-zero. In the latter case the competence set s desgnated as S P (,j) and follows: S P (,j) = {(x (k), y (k) ) (x (k), y (k) ) S C (j, x (k) ) > 0}. (6) Ths can be also appled to vrtual chld Learnng The whole structure of Herarchcal Estmator s found whle learnng from examples, so at least a bref descrpton of learnng algorthm s needed for full understandng the consequences of theoretcal fndngs descrbed n ths artcle. Procedure of learnng Herarchcal Estmator on a tranng set D s:. Create root node and make D ts tranng set.. Buld a functon estmator (possbly smple) n the processed node (later called node ).

5 87 3. Compute E(S, g ) mean squared error or some other error measure for the gven node and ts competence set (whch s not necessarly dentcal to tranng set D ). If t s smaller than some preset value ( the goal ) stop algorthm for ths branch. If, on the other hand, ths error ts greater than that of ts parent (on the same set), stop the algorthm for ths branch, but also delete ths node. 4. If the soluton s becomng too complex wth respect to some preset parameter (usually maxmum tree depth) stop the algorthm (for ths branch). Ths condton s placed to lmt the learnng tme. 5. Buld (a) Tranng sets for the chldren nodes {D P (,)... D P (,) } ( also needs to be found). Ths s usually done by creatng a functon U, such that (x (k), y (k) ) D P (,j) U (j, x (k), y (k) ) > 0. Because competence sets generally should overlap (as ndcated n [3]) tranng sets usually also wll. (b) Competence functon C. As these tasks are closely related, they are usually performed together [3]. 6. Run ths algorthm for the chldren of the gven node from pont. In [3], the creaton of competence functon C and dvdng the tranng set s based prmarly on the responses of the estmator n the node. Usually t nvolves some form of fuzzy clusterng e.g. Fuzzy C-Means [4] wth cluster number selecton technque, descrbed n [5]. For example, n the smplest, but not very effectve form, outputs of estmator n the node are fuzzy-clustered, each cluster consttutes one tranng set for a chld and competence functon value s the membershp of the gven example n the gven cluster. In one of the more sophstcated methods, both outputs of the estmator n node and true values are clustered by means of fuzzy clusterng. Then a corellaton matrx s made between clusters n estmator outputs and clusters n true values. Fnally, the rows of such matrx are clustered. The competence functon s based on fndng the membershps of a gven example n each row, by usng membershps of the example n response clusters, corellaton matrx and a chosen set of fuzzy operators, and then combnng ths nformaton, agan usng fuzzy operators, wth the membershps of each row n fnal clusters. The tranng sets are found n a smlar way, but nformaton about true values s also used. Paper [3] presents ths method n detal and n two varants as well as one other method. It should be mentoned that because soluton presented n [3] uses Artfcal Neural Networks as estmators n the nodes, the data gven as ther nput should be adequatly prepared, normalzed among others. Ths s sometmes not a trval task [6, 7] though usually standard normalzaton procedures are used.

6 Detals As t can be seen from defntons above, certan mportant detals have to be determned separately. Ths concerns not only the selecton of estmators n nodes (lke n many other solutons, e.g. AdaBoost) but also the exact form of competence functon and creatng tranng sets for successor nodes. Several versons of such detals are descrbed n [3] and ther performance evaluated on several datasets. They are nspred by the theorems cted n Secton.. and ther proofs.. Error Structure of Herarchcal Estmator.. Prelmnary Notons In ths secton, several notons wll be used that were not explaned above, because ther scope s more lmted. For convenence they are grouped here. Most of them appear n a smlar form as n [3]. S = {(x (k), y (k) ) k =,..., } s used for a set of examples on whch the estmator (or a gven node) s evaluated. s the sze of that set. Please recall that each x (k) X R p, y (k) Y R r S P (,j) s the sze of the set S P (,j), a competence set of jth chld of node wthn set S. e((x (k), y (k) ), g) a squared error of estmator g on example (x (k), y (k) ) e((x (k), y (k) ), g) = Ths notaton can be shortened as n: r l= (y (k) l g(x (k) ) l ). (7) e (,j) (k) = e((x (k), y (k) ), g P (,j) ), (8) ẽ (,j) (k) = e((x (k), y (k) ), g P (,j) ), (9) e () (k) = e((x (k), y (k) ), g ), (0) η,j (k) a short way for denotng the dfference between target functon value on k-th example of gven set and the result of j-th chld of node for ths example; η,j (k) = g P (,j) (x (k) ) y (k), x (k) S P (,j), η,j (k) = g P (,j) (x (k) ) y (k), x (k) S P (,j). ()

7 89 Error functon can be easly created from η: r ẽ (,j) (k) = η,j (k) l. () l= E(S, g) a mean squared error of estmator g on the set S E(S, g) = e((x (k), y (k) ), g). (3) k= C s a characterstc (ndcator) functon of competence set (or area) C (j, x (k) ) = {, C (j, x (k) ) > 0 0, otherwse. (4) Note, that C multpled by C s stll C. n k s the number of such j for whch C (j, x (k) ) > 0 so t s the number of chldren actually used on a gven example (possbly ncludng vrtual chld 0). n max s ts maxmum on the whole set S: n max = max k:(x (k),y (k) ) S n k n s used f n k s constant over all examples that are consdered, so the k ndex can be omtted... Exstng theorems about error of Herarchcal Estmator In [3] several facts were proved about Herarchcal Estmator squared error. For the purpose of ths artcle the frst of them s of most nterest. Theorem. For any node n Herarchcal Estmator suppose that: S s a competence set of node, for each example n set S, n k s constant: k : (x (k), y (k) ) S, C (j, x (k) ) = n, (5) where n > 0,

8 90 C fulflls r k= l= j:c (j,x (k) )>0 k= l= r n η,j (k) l C (j, x (k) ) j:c (j,x (k) )>0 η,j (k) l (6) Then E(S, g ) E(S, g ) j= S P (,j) n ( E(SP (,j), g ) E(S P (,j), g P (,j) ) ) (7) In other words, f we always use n chldren for an example (or one less, but use estmator n a gven node) and errors acheved when the gven competence functon s used are no greater than f the same chldren were used, but weghted equally, the error s no greater than the error of estmator dmnshed by dfferences between ts error on competence sets of chldren and the chldren errors on that sets. It s not a suprsng result, but one of the corollares proved n the artcle [3] (Corollary ) states that the fnal nequalty can be easly made strct t s enough that the used chldren nodes have dfferent errors on one example. Ths theorem and ts proof brought some more detaled nformaton on what s needed for the soluton to work properly. The assumpton 5 about constant number of used chldren s nconvenent (though necessary for the gven form of the theorem), so modfed verson of theorem was proved, exchangng t for another (consdered weaker by author) [3]: Theorem. Consder node and example set S. Here ponts (5) and (6) from Theorem are replaced by: r k= l= k= l= j:c (j,x (k) )>0 r n max ( η,j (k) l C (j, x (k) ) j:c (j,x (k) )>0 j>0 n max n k + η,j (k) l η,0 (k) l n max. (8) And k : (x (k), y (k) ) S, C (0, x (k) ) > 0, (9) The concluson s then S P (,j) ( E(S, g ) E(S, g ) n max E(SP (,j), g ) E(S P (,j), g P (,j) ) ) (0) j=

9 9 The assumpton 5 about constant number of used chldren s replaced by a requrement that the estmator n a gven (parent) node s always used (9). It may be n many cases less restrctng than that of the frst theorem and possbly also more techncal, as we can use arbtrary small values of competence functon for that node. The thess changed accordngly and can be called somewhat weaker. The sketch of proof s also n [3], the techncal detals are n [8]. These two theorems lad foundaton for several corollares, also proved n [3]. One of the most mportant (apart from the one mentoned above, about strct nequalty) states that f the condtons of ths theorem are met and each chld node gves better average results than ts parent on the chld s competence set, then addng nodes to tree decreases error on the respectve set (on whch the condtons are met). Unfortunately, strct meetng assumptons of those theorems s not easy on examples that were not used for tranng. However, t was not establshed that those are necessary condtons, just suffcent ones, so, for example, t s not perfectly clear what really happens f one or more are not met. That s why a bt more detaled analyss s attempted n ths paper..3. The new theorem concernng error components The theorems from [3] mentoned several condtons suffcent for the soluton to work well and ponted at several places n whch the nequalty n the Theorem mght be made strct, but dd not formally answer the queston about performance of the soluton when not all condtons are perfectly met or how large the dfference can be. The theorem presented below tres to formally shed some lght on ths. As the Theorem was the basc one, the new theorem s a modfcaton of that one. Below s addtonal notaton, that was not needed for the prevous theorems, but s necessary now. τ s the notaton correspondng to assumpton gven n (6) relatve qualty condton on functon C. τ = k= l= r τ kl, where τ kl s just the dfference between the error on example k on the coordnate l when the actual competence functon C s used and the error n case the same estmators would be used (as C s nonzero f and only f C s nonzero) for evaluaton of that example, but the results were weghted equally, τ kl = η,j (k) l C (j, x (k) ) η,j (k) l C (j, x (k) ) n. ()

10 9 The notaton δ s used to descrbed one of the error components, ts full meanng wll be best explaned durng proof. δ = r k= l= δ kl δ kl = η,j (k) l C (j, x (k) ) C (j, x (k) ) η,j (k) l C (j, x (k) ) C (j, x (k) ). () Accordng to Cauchy-Schwarz nequalty, δ kl s never postve. The new theorem s: Theorem 3. For any node n Herarchcal Estmator suppose that: S s a competence set of node. As n Theorem, for each example n set S, n k s constant: where n > 0. k : (x (k), y (k) ) S, C (j, x (k) ) = n, (3) Then E(S, g ) = n S P (,j) E(S P (,j), g P (,j) ) + τ + n δ. (4) The frst term s not suprsng mean of errors of chldren weghted by szes of ther competence sets wthn the man set, but there are two more. One that s never postve (δ, t s usually negatve) and another one, that corresponds to qualty of competence functon C relatve to a functon that chooses the same estmators for each example, but weghts them equally (τ). Ths one can qute easly be kept 0. Proof. The proof s analogcal to that of the Theorem. Frst, we take squared error defntons (ncludng Eq. (8) and (3)): E(S, g ) = k= l= r ( g (x (k) ) l y (k) l ), and apply the man equaton for response () to them: E(S, g ) = r ( ) g P (,j) (x (k) ) C (j, x (k) ) y (k) l l k= l=

11 93 As the sum of C (j, x (k) ) by defnton (Eq. (3)) equals to for each example, we can expand the equaton and then collapse wth a convenent notaton of η (see Eq. ()) : E(S, g ) = = = k= l= k= l= k= l= r ( g P (,j) (x (k) )) l C (j, x (k) ) r (( g P (,j) (x (k) )) l y (k) l ) C (j, x (k) ) r η,j (k) l C (j, x (k) ) C (j, x (k) ) y (k) l We can extract the term τ usng ts defnton Eq. (): E(S, g ) = r η,j (k) l C (j, x (k) ) k= l= = r η,j (k) l C (j, x (k) ) n = k= l= k= l= r η,j (k) l C (j, x (k) ) n + τ kl + τ (5) Because values C are 0 or, rasng them to any power greater than 0 does not change them: E(S, g ) = r η,j (k) l C (j, x (k) ) C (j, x (k) ) + τ (6) n k= l= At ths pont we can apply notaton δ () E(S, g ) = (7) = r η,j (k) l C (j, x (k) ) C (j, x (k) ) + δ kl + τ n k= l= The fact that, accordng to Cauchy-Schwarz nequalty, δ kl s never postve s qute mportant here. Assumpton (3) requres that C (j, x (k) ) = C (j, x (k) ) = n. So we can wrte: E(S, g ) = r n k= l= η,j (k) l C (j, x (k) ) n + δ kl + τ (8)

12 94 then extract δ, concurrently smplfyng /n n to /n E(S, g ) = r η,j (k) l C (j, x (k) ) + τ + n n δ, (9) k= l= then reorder sums and factors: E(S, g ) = C (j, x (k) ) n k= r η,j (k) l + τ + n δ. In ths form t s easy to apply defnton of squared error (9) and observaton about η (), rememberng that rasng C to postve power does not change t: E(S, g ) = l= C (j, x (k) ) ẽ (,j) (k) + τ + n n δ (30) k= and use the fact that C s a characterstc functon of S P (,j) to apply Eq. (3) E(S, g ) = n S P (,j) E(S P (,j), g P (,j) ) + τ + n δ Whch ends the proof. Analogcally to Corollary n [3], we can show that δ = 0 s n fact a rather specal case, so n most cases t s negatve. Corollary (Of δ). δ kl s zero only f all errors of used approxmators are the same: δ kl = 0 = j, o : C (j, x (k) ) > 0 C (o, x (k) ) > 0 η,o (k) l = η,j (k) l Proof. Because non-postveness of δ kl () comes from Cauchy-Schwarz theorem, t could only be 0 f the two vectors for whch t s appled were lnearly dependent. In case of two real, non-null vectors one of them would have to be dentcal to the second one, just scaled by some number. Ths should apply to vectors ( C (j, x (k) )) ) j= and ( η,j (k) l C (j, x (k) ) ) j= so each η,j (k) l should have the same value, whch s the thess of the corollary. Change of error n the node durng addng a subtree. For assessng the plausblty of the Herarchcal Estmator, followng observaton, based on Theorem 3, may be of use. If the assumptons of theorem 3 hold, then: E(S, g ) E(S, g ) = S P (,j) ( E(SP (,j), g ) E(S P (,j), g P (,j) ) ) + τ + n n δ. (3)

13 95 That equaton may be used to descrbe the dfference of error n the node wth (E(S, g )) and wthout (E(S, g )) the subtree rooted n t, n the manner smlar to the thess of Theorem 3. One of the components of that dfference (/n δ) s never postve (see Eq. ), and s negatve f only chldren errors dffer on some examples, as ndcated by Corollary. Another one (τ, a relatve qualty of competence functon, Eq. ) can be kept at 0 f needed. If the whole dfference s negatve, the exstence of the subtree decreases soluton error on the gven set. Of course, for ths to happen, the remanng component, a mean of dfferences between the mean errors of estmator n the node and the mean errors of chldren of the node (wth ther subtrees, f they have them) on the chldren competence sets, should not cause ncrease greater than the decrease caused by τ + /n δ. On the tranng set, ths ncrease s guaranteed to be non-postve (see pt. 3 n learnng algorthm n Sect...3.). Keepng t low on unknown examples s one of man concerns when creatng competence functons and dvdng tranng set [3]. The proof begns wth addton of the term E(S, g ) to both sdes: E(S, g ) = E(S, g ) S P (,j) ( E(SP (,j), g ) E(S P (,j), g P (,j) ) ) + τ + n n δ. Then, we can transform the left sde accordng to Theorem 3 (Eq. 4), achevng: n = E(S, g ) S P (,j) E(S P (,j), g P (,j) ) + τ + n δ = S P (,j) ( E(SP (,j), g ) E(S P (,j), g P (,j) ) ) + τ + n n δ. (3) Next we can subtract the term τ + δ from both sdes and arrange the sums n dfferently n the rght sde: n = = S P (,j) E(S P (,j), g P (,j) ) = S P (,j) n (E(S P (,j), g ) ( E(S P (,j), g ) E(S P (,j), g P (,j) ) ) ) ( ) S P (,j) E(S P (,j), g ) n S P (,j) n ( E(SP (,j), g ) E(S P (,j), g P (,j) ) ) We just extracted another term that s on rght sde of (3) S P (,j) ( E(SP (,j), g ) E(S P (,j), g P (,j) ) ), so we can cancel t out of equa- n

14 96 ton, whch then acheves form: ( ) S P (,j) E(S P (,j), g ) = E(S, g ) n Agan, we wll transform the left sde. As C s a characterstc functon of S P (,j), we can expand mean squared error usng defntons (0) and (3). Then rearrange sums agan: ( ) S P (,j) E(S P (,j), g ) = n = e () (k) C (j, x (k) ) n k= e () (k) C (j, x (k) ). n Because assumpton (3): C (j, x (k) ) = n stll holds, we may use defnton of mean square error (3) and get e () (k) n k= So the Equaton (3) s true. k= C (j, x (k) ) = (e () (k)) = E(S, g ). k= The change of error durng tree growng. The last observaton wll be descrbed nformally here, but analogcal corollary wth more formal proof can be found n [3] (Corollares 5 and 6). It concerns the change of error of the whole tree when a new subtree s added for a gven node. Obvously n such case E(S, g ) changes from E(S, g ) to a dfferent value, as descrbed by Eq. (3). Ths causes change n one of the E(S P (u,j), g P (u,j) ) of ts parent u, proportonally to the sze of the competence set. The same thng happens one level up and the change s propagated to the root and whole estmator. 3. Dscusson Theorem proved n ths artcle specfcally gves the components of mean squared error for the Herarchcal Estmator:. The error of estmator n nodes E(S, g ), both n leaves (where they are E(S, g )) and nternal nodes.

15 97. The relatve qualty of competence functon τ. Ths qualty s measured wth respect to the reference functon that selects the same chldren as the assessed one, but weghts them equally (and has, by defnton, τ = 0). 3. δ whch s never postve and s negatve f only chldren results dffer on an example, so usually reduces the error. It requres the number of used estmators n a gven node to be constant. Ths can be easly forced by always usng n estmators that are consdered best and possbly gvng some of them very low weghts. However, ths can nfluence the term τ, so developng a theorem lftng the requrement seems to be urgent. A possble way to do that may be to reuse the technque from Theorem presented n [3]. Perhaps the most mportant concluson that could be drawn from the theoretcal consderatons above, especally Theorem 3 and Corollary, s that mean squared error of the whole wll be lower than the weghted mean of errors of the nvolved chld nodes and estmator n the node f more than one of them are used and they have non-dentcal errors. Though smlar concluson may be drawn from Theorems n [3] here t s descrbed a bt more precsely. Ths decrease n error can be renforced f the competence functon s able to assgn greater weghts to the chldren that gve lower errors, but t s not necessary. An mprovement over theoretcal bass from [3] allows to draw followng concluson, stronger than before. Accordng the observaton from the end of the prevous secton and other theorems, addng a subtree to a node n exstng tree can lower the mean squared error of the whole Herarchcal Estmator even f we are not able to assure that all chldren nodes have lower errors on ther competence sets than ther parent, or that competence functon offers gan over the reference functon (τ close to 0). It s just enough that the loss on them does not exceed the gan from δ. Theorems proved n [3] dd not allow to state t so clearly. Such concluson s sgnfcant because t s generally not easy to guarantee that a chld node has lower error on examples that were not avalable durng tranng. Mostly because t s a dffcult task for a competence functon to assgn the examples to the rght estmators,.e. the ones that would made low errors on them. Falng to do that ncreases the errors of the approxmators that actually receved the example. Another, though maybe easer to avod, problem s that n a gven node there may not be any functon estmators (n chldren nor the approxmator n the node) that would perform well on a gven example because of e.g. generalzaton problems. Based on these conclusons, one may try to formulate practcal gudelnes for constructon detaled solutons, n a manner smlar to [3]. For example:. It would be good f competence area represented truly smaller and somewhat separated problem,.e. f the chld was able to acheve greater accuracy wthout sgnfcant threat of overfttng, ncreased learnng tme or competence functon assgnng wrong examples.. An example should be evaluated by more than one chld (possbly ncludng vrtual, the estmator n gven node) so that δ could be negatve. 3. It s better f the chldren have dfferent errors from each other on the gven example rather than smlar to make δ even lower.

16 98 4. Choosng the rght chldren by the competence functon seems to be a more mportant task then assgnng them exact weghts, because the soluton can work well also f τ are 0 all chosen chldren are weghted equally. Stll, negatve average τ can decrease the error. Unsurprsngly, those gudelnes are very smlar to those from [3]. Some of them are approxmated n [3] as requrement that examples wthn one competence area should be smlar (gudelne ) whle tranng sets should be rather dssmlar ( and 3) and further consderatons about what smlarty measure to use follow. An mportant trat of all error components found n the theorem descrbed n ths artcle s that they can be drectly measured durng tranng and valdatng, so t s possble to measure where the error comes from, at least to some degree. Refnements of the soluton could even automatcally use such measures to mprove the soluton performance. 4. References [] Bshop C.; Pattern recognton and machne learnng, Sprnger, Berln, Hedelberg, New York, 006. [] Hand D., Mannla H., Smyth P.; Prncples of Data Mnng, MIT Press, 00. [3] Brodowsk S., Podolak I. T.; Herarchcal Estmator, Expert Systems wth Applcatons, 38(0), 0, pp [4] Haste T., Tbshran R., Fredman J.; The Elements of Statstcal Learnng, Sprnger, Berln, Hedelberg, New York, 00. [5] Russell S. J., Norvg P.; Artfcal Intellgence: A Modern Approach, Pearson Educaton, 003. [6] Chrstan N., Shawe-Taylor J.; Support Vector Machnes and other kernel based learnng methods, Cambrdge Unversty Press, 000. [7] Scholkopf B., Smola A.; Learnng wth kernels, MIT Press, Cambrdge, 00. [8] Schapre R. E.; The Strength of Weak Learnablty, Machne Learnng, 5(), 990, pp [9] Freund Y., Schapre R.; A decson theoretc generalzaton of onlne learnng and an applcaton to boostng, Journal of Computer and System Scences, 55, 997, pp [0] Jordan M. I., Jacobs R. A.; Herarchcal mxtures of experts and the EM algorthm, Neural Computaton, 994, pp [] Sato K., Nakano R.; A constructve learnng algorthm for an HME, IEEE Internatonal Conference on Neural Networks, 3, 996, pp [] Qunlan J. R.; Learnng wth contnuous classes, Proceedngs of the 5-th Australan Conference on Artfcal Intellgence, 99, pp [3] Podolak I. T.; Herarchcal classfer wth overlappng class groups, Expert Systems wth Applcatons, 34(), 008, pp

17 99 [4] Pal N., Bezdek J.; On cluster valdty for the fuzzy c-means model, IEEE Transactons on Fuzzy Systems, 3(3), 995, pp [5] Brodowsk S.; A Valdty Crteron for Fuzzy Clusterng, n: Jedrzejowcz P., Nguyen N. T., Hoang K. (ed.), Computatonal Collectve Integllgence ICCCI 0, Sprnger, Berln, Hedelberg, 0. [6] Beleck A., Belecka M., Chmelowec A.; Input Sgnals Normalzaton n Kohonen Neural Networks, n: Rutkowsk L., Tadeusewcz R., Zadeh L., Zurada J. (ed.), Artfcal Intellgence and Soft Computng ICAISC 008, Sprnger, Berln, Hedelberg, 008. [7] Barszcz T., Belecka M., Beleck A., Wójck M.; Wnd turbnes states classfcaton by a fuzzy-art neural network wth a stereographc projecton as a sgnal normalzaton, Proceedngs of the 0th nternatonal conference on Adaptve and natural computng algorthms, 0, pp [8] Brodowsk S.; Adaptuj acy sȩ herarchczny aproksymator, Master s thess, Jagellonan Unversty, 007. Receved May 9, 00

18 wersja.0