> PNN05-P762 < Reduced Patten Taining Based on Task Decomposition Using Patten Distibuto Sheng-Uei Guan, Chunyu Bao, and TseNgee Neo Abstact Task Decomposition with Patten Distibuto (PD) is a new task decomposition method fo multilayeed feedfowad neual netwoks. Patten distibuto netwok is poposed that implements this new task decomposition method. We popose a theoetical model to analyze the pefomance of patten distibuto netwok. A method named Reduced Patten Taining is also intoduced, aiming to impove the pefomance of patten distibution. Ou analysis and the expeimental esults show that educed patten taining impoves the pefomance of patten distibuto netwok significantly. The distibuto module s classification accuacy dominates the whole netwok s pefomance. Two combination methods, namely Coss-talk based combination and Genetic Algoithm based combination, ae pesented to find suitable gouping fo the distibuto module. Expeimental esults show that this new method can educe taining time and impove netwok genealization accuacy when compaed to a conventional method such as constuctive backpopagation o a task decomposition method such as Output Paallelism. Index Tems Coss-talk Based Combination, Full Patten Taining, Genetic Algoithm Based Combination, Patten Distibuto, Reduced Patten Taining, Task Decomposition, M I. INTRODUCTION ULTILAYERED feedfowad neual netwoks have been widely used in solving classification poblems. Howeve, they still exhibit some dawbacks when applied to lage scale eal-wold poblems. One common dawback is that lage netwoks tend to intoduce high intenal intefeence because of the stong coupling among the hidden-laye weights []. The influences fom two o moe output units could cause the hidden-laye weights to compomise to non-optimal values due to the intefeence in thei weight-updating diections duing the weight-updating pocess [2]. Vaious task decomposition methods have been poposed to ovecome this dawback [2-0, 8-2, 23-26]. Instead of using a single, lage feedfowad netwok (classic netwok), task decomposition methods divide a poblem into a set of smalle and simple sub-poblems based on Manuscipt eceived Novembe 23, 2005. Sheng-Uei Guan is with School of Engineeing and Design, Bunel Univesity, Uxbidge, Middlesex, UB8 3PH, UK (e-mail: steven.guan@bunel.ac.uk). Chunyu Bao and TseNgee Neo ae with the Depatment of Electical and Compute Engineeing, National Univesity of Singapoe, Singapoe 9260, Singapoe (e-mail: chunyubao@gmail.com). divide-and-conque. The esults obtained fom solving these sub-poblems ae integated togethe and the solution fo the oiginal poblem is obtained. Anand et al. poposed a method that splits a K -class poblem into K two-class sub-poblems [3]. Each sub-netwok is tained to lean one sub-poblem only. Theefoe, each sub-netwok is used to disciminate one class of pattens fom pattens belonging to the emaining classes, theeby esulting in K modules in the oveall stuctue. Anothe method divides the K -class poblem into K two-class sub-poblems [4]. A 2 module is designated to lean each sub-poblem while taining pattens belonging to the othe K 2 classes ae ignoed. The final oveall solution is obtained by integating all the tained modules into a min-max modula netwok. A poweful extension to the above class decomposition method, Output Paallelism, is poposed by Guan [2, 5, 6, 7, 8]. Using output paallelism, a poblem can be divided into seveal sub-poblems as chosen, each of which is composed of the whole input vecto and a faction of the output vecto. Each module (fo one sub-poblem) is esponsible fo poducing a faction of the output vecto of the oiginal poblem. These modules can be gown and tained in paallel. Instead of decomposing the poblem with a high dimensional output space into seveal sub-poblems, each with a low dimensional output space, Lu decomposes the poblem into seveal smalle-size sub-poblems [0]. Pattens ae classified by a ough sieve module (non-modula netwok) at the beginning and those pattens that ae not classified successfully will be pesented to anothe sieve module. This pocess continues until all the pattens ae classified coectly. The sieve modules ae added to the netwok adaptively with the pogess of taining. Although these methods ae efficient, thee ae still some dawbacks associated with them. Fistly, the methods poposed in [3] and [4] split the poblem into a set of two-class sub-poblems. If the oiginal K-class poblem is complex (K is lage), a lage numbe of modules will be needed to lean the sub-poblems and thus esulting in excessive computational cost. Secondly, although the dimension (numbe of output class) of each sub-poblem in [2] and [3] is smalle than the oiginal poblem, the size of each sub-poblem s taining patten set is still as lage as the oiginal poblem. Theefoe, each module will have long taining time and ineffective leaning especially when the oiginal poblem is lage with many taining pattens.
> PNN05-P762 < 2 Module Class input patten The distibuto module Module 2 input patten The distibuto module Module Module 2 Class K () Class K () + Class K () +K (2) Module Class K () +K (2) ++K (-) + Fig.. A typical Patten Distibuto netwok. Module Class K Lastly, the method poposed in [0] only educes the size of the poblem but not the dimension of the poblem. The intenal intefeences (that exists within each module due to the coupling of output units) ae not educed. In this pape, we popose a new task decomposition method called Task Decomposition with Patten Distibuto (PD) to ovecome the dawbacks mentioned above. A special module called distibuto module is intoduced in ode to impove the pefomance of the whole netwok. The distibuto module and the othe modules in the PD netwok ae aanged in a hieachical stuctue. The distibuto module has a highe position as compaed to the othe modules in the netwok. This means an unseen input patten will be ecognized by the distibuto module fist. The stuctue of a typical PD netwok is shown in Figue. Each output of the distibuto module consists of a faction of the oveall output classes in the oiginal poblem. The PD method could shoten the taining time and impove the genealization accuacy of a netwok compaed with odinay task decomposition methods. In this pape, the PD method will be discussed in details. In Section 2, a theoetical model is pesented to compae the pefomance of a PD netwok with a typical task decomposition netwok Output Paallelism netwok. In section 3, we intoduce the Reduced Patten Taining method to impove the PD netwok s pefomance. Because of the impotance of the distibuto module, we pesent in Section 4 two combination methods, Coss-talk based combination and GA based combination, to find good gouping fo the distibuto module. In Section 5, the expeimental esults ae shown and analyzed. Conclusions ae pesented in Section 6. II. A THEORETICAL MODEL FOR THE PATTERN DISTRIBUTOR NETWORK Thee ae two types of modules in a Patten Distibuto netwok, distibuto module and non-distibuto modules (fo simplicity, non-distibuto modules ae ust called modules). Nomally, a PD netwok consists of one distibuto module and seveal non-distibuto modules. Class decomposition is often used to in solving classification poblems. Compaed with odinay methods in which only a Fig. 2. The PD netwok used to solve a K-class poblem. input patten Module Module 2 Module Class Class K () Class K () + Class K () +K (2) Class K () +K (2) ++K (-) + Class K Fig. 3. The OP netwok used fo the same K-class poblem. neual netwok is constucted to solve the poblem, class decomposition divides the poblem into seveal sub-poblems and tains a neual netwok module fo each sub-poblem. Then the esults fom these modules ae integated to obtain the solution fo the oiginal poblem. Output Paallelism (OP) is a typical class decomposition method. Hee we pesent a model to show that the PD method has bette pefomance than the OP method when the ecognition ate of the distibuto module is guaanteed. Conside a classification poblem with K output classes. To solve the poblem, a PD netwok with one distibuto module and non-distibuto modules is constucted. See Figue 2 fo details. Thee ae outputs in the distibuto module and each non-distibuto module is connected to an output of the distibuto module. Each output of the distibuto module consists of a combination of seveal classes. Fo an unknown patten, the distibuto module ecognizes and dispatches it to only one of the outputs. Then the connected non-distibuto module continues the classification pocess to specify which class the patten belongs to. In othe wods, a non-distibuto module needs to ecognize the patten among seveal classes. Assume Module is a non-distibuto module that needs to ecognize K () classes. Diffeent non-distibuto modules ae
> PNN05-P762 < 3 assumed to have no ovelapping classes, we have the following: K = K () = ( ) Figue 3 shows the coesponding OP netwok used to solve the K-class poblem above. Fo the convenience of compaison, we assume that the OP netwok has the same output gouping as the PD netwok. Thee ae also modules in the OP netwok and Module needs to ecognize K () classes among all the pattens. When an unknown input patten is pesented to the OP netwok, it is pocessed by each module (Module to Module ), and the final esult is obtained by integating the esults fom Module to Module. In the PD netwok, a non-distibuto module only ecognizes the pattens dispatched to it by the distibuto module. These pattens mostly likely belong to one of the classes coveed by that module. Of couse, the distibuto module may make wong decisions and send wong pattens to that module. The OP netwok is diffeent. Each module needs to ecognize all the pattens. In othe wods, Module in the OP netwok needs to diffeentiate the pattens belonging to it fom those pattens which do not. Now we denote the pobability of eo incued by Module pocessing the pattens that belong to one of the classes of Module i by p i. If we do not implement winne-take-all abitation, a patten can be egaded as wongly classified if one o moe modules give wong decisions. When a test patten belonging to one of the classes of module entes the netwok, the pobability of eo in the OP netwok can be witten in the following fom: ( OP) = i + 2 i + + i i= i= i= i i 2 i p p ( p ) p ( p ) p ( p ) + p p ( p ) + p p ( p ) + + p p ( p ) + 2 i 3 i ( ) i i= i= i= i,2 i,3 i, + p p p p 2 ( ) The fist tems epesent the pobability of a test patten being classified wongly by one module. The following ( ) 2 tems epesent the pobability of the test patten being classified wongly by two modules, and so on. Equation (2) can be ewitten as: ( OP) p = p + p + + p ( p p + p p + + p p ) 2 2 3 ( ) + ( p p p + + p p p ) + ( ) ( p p p ) (3) 2 3 ( 2) ( ) 2 Hee is the numbe of modules and is nomally not a lage numbe. So the numbe of tems in the above equation is not a lage numbe. p is a small positive eal numbe. In othe i wods, p p is much smalle than p. We can ignoe the i k i tems of the poduct of two and moe p s. Thus, i p p + p + + p (4) ( OP) 2 The numbe of test pattens classified wongly by the OP netwok is: (2) ( OP) ( OP) N = N p = N ( p + p2 + + p ) = N pk = = = k = (5) whee N is the numbe of pattens belonging to the classes of Module. It can also be witten as: ( OP) N = N pk + N2 pk 2 + + Nk pkk + + N pk k = ( ) (6) Now we define p as the pobability of eo when Module k k* pocesses the pattens not belonging to the classes of Module k. Equation (6) can be evised as: N ( OP ) = k= ( N p + ( N N ) p ) (7) k kk k k* whee p is the pobability of eo when Module k pocesses kk the pattens belonging to it, N is the numbe of test pattens and N k is the numbe of pattens belonging to the classes of Module k. It should be mentioned that in the above OP netwok, each module can be tained sepaately using all the taining pattens, wheeas fo the PD netwok, we can also tain these modules sepaately. If we use all the taining pattens to tain these modules, then the weights and hidden units of the non-distibuto modules will be the same as those of the coesponding modules in the OP netwok. Afte the taining of the PD netwok is completed, the distibuto module will be the fist to classify any unseen input patten. The coesponding output unit in the patten distibuto will have the lagest output value among all the output units. Then only the coesponding module will be activated. Afte that, the input patten is pesented to this module only and then this module will complete the classification pocess. Only the distibuto module and the coesponding module ae used in the classification pocess. Let p 0 be the pobability of eo of the distibuto module. Then the numbe of test pattens which ae classified wongly by the distibuto module is M = N (8) 0 p 0 Assume the distibuto module classifies pattens wongly in a unifom manne. In othe wods, the numbe of wongly classified pattens by the distibuto module to each non-distibuto module is popotional to the numbe of pattens enteing that non-distibuto module. The numbe of coect pattens that ente Module is N p ). Then, the numbe of ( 0 pattens classified wongly by Module is witten as: M = N p ) p (9) ( 0 Thus, the numbe of pattens classified wongly by the PD netwok can be expessed as: ( PD ) M = M i = N p0 + ( p0 ) N p i= 0 i= (0) Compaing the OP netwok with the PD netwok, we have ( OP) ( PD) N M = [ N p + ( N N ) p * ] N p0 ( p0) N p = =
> PNN05-P762 < 4 () = ( N N ) p N p + p N p * 0 0 = = Simila to the analysis made ealie, than p * and p 0. So ( OP) ( PD) N M ( N N ) p * N p0 = p 0 p is much smalle (2) Now we have deived the condition unde which the PD netwok can achieve bette classification accuacy than the OP netwok: = ( N N ) p > N p * 0 i (3) We know that each module needs to pocess all the test pattens in an OP netwok, while in a PD netwok each non-distibuto module only needs to pocess a sub-set of the test pattens. Intuitively, if the numbe of wongly-classified pattens by the distibuto module in the PD netwok is smalle than the sum of the numbe of pattens wongly classified by each module when pocessing pattens not belonging to it in the coesponding OP netwok, the PD netwok will pefom bette. Discussions:. Class decomposition can still be applied to the modules of the OP netwok and PD netwok so that these modules can be futhe decomposed into sub-modules. If each sub-module is used to ecognize one class fom all the pattens, then thee will be N sub-modules in the whole OP netwok. Of couse, these sub-modules may belong to diffeent modules. Figue 4(a) shows an example of a 6-class OP netwok. Thee ae two modules that ae futhe patitioned into 6 sub-modules. Figue 4(b) shows a fully decomposed OP netwok fo this 6-class poblem. In both OP netwoks, all the taining pattens ae used to tain these sub-modules. So the sub-modules in Figue 4(a) ae the same as thei countepats in Figue 4(b). In Figue 4(a), the sub-modules ae gouped into two modules. Fo an unknown patten, the outputs fom Sub-modules, 2, 3 ae consideed togethe to give the esult of Module, simila fo Module 2. Then the esults fom Module and Module 2 ae consideed togethe to give the final output. In the OP netwok of Figue 4(b), the outputs fom all the sub-modules ae consideed altogethe to give the final output. In fact, thee is little diffeence between the OP netwoks in Figue 4(a) and 4(b). Note that the non-distibuto modules in the PD netwok ae the same as the countepats in the OP netwok. Thus, by decomposing the modules into sub-modules, we can compae the pefomance of the PD netwok with that of the fully decomposed OP netwok. In most of ou expeiments, we used netwoks like such. 2. 2. In Equation (4), we have ignoed the situation in which two o moe modules make wong decisions at the Input patten Input patten same time because the situation appeas much less fequently compaed to the situation in which only one module makes wong decisions. If we do conside that ( OP) situation, p will be a little smalle than + p + p. 2 p + Module Sub-module Sub-module2 Sub-module3 Module 2 Sub-module4 Sub-module5 Sub-module6 (a) Sub-module Sub-module2 Sub-module3 Sub-module4 Sub-module5 Sub-module6 (b) Fig. 4. Two OP netwoks fo a 6-class poblem. 3. In the above model, we do not conside the implementation of winne-take-all fo the OP netwok. In eality, winne-take-all is used fo selecting a unit among seveal candidate units to poduce the final output. The pupose of a conventional winne-take-all netwok is to select a unit with the highest activation stength fom a set of candidates. Using winne-take-all, the netwok may still choose the coect output even if some modules make wong decisions. Fo example, conside a test patten that belongs to Class A in Module. When the patten entes Module of the OP netwok, Module poduces the coect answe Class A. Howeve, when the patten entes Module 2 of the OP netwok, Module 2 gives an incoect answe and thinks it belongs to Class B. If the output coesponding to Class A is lage than that of Class B, the OP netwok can still give a coect decision. Using winne-take-all will slightly educe the final classification eo of the OP
> PNN05-P762 < 5 netwok than not using it. the test pattens ae pesented. III. MOTIVATION FOR REDUCED PATTERN TRAINING In a PD netwok, an unseen patten is fistly classified by the distibuto module to decide which module will continue to pocess it. Then the coesponding module will be activated. Thus only two modules ae used to pocess that patten. Now we look at all the test pattens. We note that each non-distibuto module only pocesses a subset of the test pattens. In othe wods, each non-distibuto module only needs to ecognize the pattens belonging to it if the distibuto module classifies all the test pattens coectly. Also, if the distibuto module classifies some pattens wongly, the mistake can not be coected by the late modules. This motivates us to tain a non-distibuto module using only the pattens belonging to it. Such a method is called Reduced Patten Taining. Similaly, the method of using the whole taining set to tain each non-distibuto module is called Full Patten Taining (FPT). When we tain Module using FPT, the module will cay infomation of the instances that do not belong to its own classes. Such infomation does not contibute to the classification accuacy of Module. So it is useless. Also, taining time would be educed when taining using RPT compaed with FPT. Moeove, taining Module togethe with unnecessay pattens may educe the ability of Module to classify the pattens belonging to Module coectly. Thee ae two aspects. Fistly, the obective of taining is to let each module each its best classification accuacy when pocessing the pattens dispatched to it. Using FPT, a module may be able to attain its best pefomance when it needs to pocess all the test instances. Howeve, it may not attain its best pefomance when pocessing only a subset of the test instances. Secondly, fo pattens not belonging to Module it would have the outputs as 0 duing the leaning pocess. (In ou expeiments, if a patten belongs to some class, the coesponding output is, othewise, 0). With the intoduction of those pattens not belonging to Module, thee ae much moe pattens with an output label 0 than pattens with an output label in the leaning pocess. So the pattens with an output label 0 will be moe influential in updating the weights and theefoe in computing the taining eo function. In contast, the pattens with an output label will become less influential in the decision of weight updates. Afte the taining pocess is ove, it is likely that the tained netwok may mislabel some test pattens, in paticula those pattens with an output label. Fom the above obsevations, we conclude those unnecessay pattens ae hamful to the module taining. Ou expeimental esults confimed that RPT is cucial fo a PD netwok to obtain good pefomance. Reduced patten taining might not be applicable to OP, because the modules in an OP netwok opeate in paallel and each module must deal with all the test pattens in the test pocess. Taining these modules using educed pattens may lead to infomation loss. And it may lead to poo accuacy when IV. COMBINATION OF CLASSES IN THE DISTRIBUTOR MODULE Fom the analysis in Section 2, it can be seen that the pefomance of a PD netwok depends geatly on the accuacy of the distibuto module. How to goup the classes and combine them becomes a key issue in designing a PD netwok. We define two concepts combination and combination set. If some classes ae gouped togethe, we call them a combination. The combination of class A, class B and class C is denoted as {A, B, C}. Once some classes ae combined, they will fom a new class. A combination set is an aggegation of combinations whee each class in the oiginal poblem appeas once and exactly once. Fo example, in a 6-class poblem, {{,2,3},{4,5},{6}} is a combination set. Hee we pesent two methods to find an appopiate combination set in the distibuto module. A. Coss-talk Based Combination The basic idea is to find classes which ae close in the featue space and combine them togethe. We fist poect a d-dimensional (d is the numbe of the input classes) input space to a one-dimensional space using Fishe s linea disciminant (FLD) method [2]. The distances between the centes of diffeent classes ae calculated. Then these distances ae aanged to fom a table which is called Coss-talk table. If the distance between two classes in the Coss-talk table is elatively small, then the two classes ae likely to be close in the featue space. Thus, we choose and combine those classes that have elatively smalle distances fom each othe in the Coss-talk table. B. Genetic Algoithm Based Combination The basic idea of this method is to find an optimal o nea-optimal combination set though evolution. Fist, we define ou chomosome encoding. A binay sting of specific length is often used to encode a chomosome in canonical genetic algoithms, but it is not suitable hee. Thus, we define chomosome accoding to the following pinciples. A chomosome consists of a sequence of combination numbes, wheein each class is encoded with its combination numbe. The length of a chomosome is equal to the numbe of the classes. Assume chomosome encoding always stats with the smallest class numbe and inceases as follows. Fo example, 22333 is a chomosome fo a 6-class poblem. in the fist place means class belongs to combination. Similaly, numbe 3 in the fouth place means class 4 belongs to combination 3. The coesponding combination set of this chomosome is {{}, {2,3}, {4,5,6}}. Thee is a need fo nomalization, howeve. Let us look at anothe example, chomosome 233. It is obvious that chomosome 233 and chomosome 22333 epesent the same combination set (though the odeing diffes). Theefoe, chomosome 233 can be nomalized as 22333. Fo convenience, we convet all the chomosomes into a fom like 22333. This pocess is called standadizing the
> PNN05-P762 < 6 chomosomes. The pocedue of standadizing a chomosome is shown in the Appendix. Then we ceate an initial population of chomosomes. Afte geneating the initial population, each chomosome is evaluated and assigned a fitness value. Hee we use a simple neual netwok fo the evaluation and use the classification eo f of the validation data set to calculate the fitness: f fitness = 2 (4) f ave whee f ave is the aveage of classification eos based on the validation data set fo all the chomosomes in the population. f is also called evaluation value. If f 2 is smalle than 0, fave fitness = 0. The execution of ou genetic algoithm can be viewed as a two-stage pocess. It stats with the cuent population. Then selection is applied to the cuent population to geneate an intemediate population. Afte that, mutation and cossove ae applied to the intemediate population to ceate the next population. We use stochastic univesal sampling to fom the intemediate population []. Assume that the population is laid out in andom ode as in a pie gaph in which each individual is assigned space on the pie gaph in popotion to fitness. Next an oute oulette wheel is placed aound the pie with N equally spaced pointes (N is the numbe of the population). A single spin of the oulette wheel will now simultaneously pick all N membes of the intemediate population. Afte the constuction of the intemediate population, cossove and mutation ae used to geneate the next population. Cossove is applied to andomly paied chomosomes with a pobability p c. Conside two chomosomes: 2233 and 2223. The andom cossove point is chosen, fo example, afte the 4 th place. Then the numbes in the 5 th and 6 th places ae exchanged and new chomosomes ae fomed. Hee the new chomosomes ae 2223 and 2233. Afte cossove, mutation is applied to andom chomosomes with a pobability p m. Afte a chomosome is selected fo mutation, a place is andomly selected fo mutation and the numbe in that place is andomly chosen. Afte the cossove and mutation is complete, standadize the chomosomes. Then the next population is evaluated and becomes the cuent population. Then the above pocess is epeated. Thee is anothe impotant paamete N max - maximum numbe of classes in a combination. Using this paamete, we kick out some chomosomes diectly. Fo a 6-class poblem, if we choose N max =3, then chomosome 22 will be eliminated, because combination {,3,4,5} has fou classes. The pupose of setting N max is to avoid the existence of non-distibuto modules with many classes. If thee ae many classes in a non-distibuto module, it is unlikely that this module can ecognize pattens with a high classification ate. With a chomosome like, thee is only one gouping and the ob of the distibuto would be tivial. Fo this exteme case, the classification eo of the distibuto module is obviously 0 because the distibuto module combines all the classes togethe. Such an exteme case can be avoided using the paamete N max. V. EXPERIMENTAL RESULTS AND ANALYSIS A. Electonic Image Files (Optional) Constuctive Backpopagation (CBP) algoithm was used to tain the netwok in the expeiments. Please efe to [3] fo details. CBP can educe the excessive computational cost significantly and it does not equie any pio knowledge concening decomposition. In this pape, RPROP is used with the following paametes: η + =.2, η - =0.5, 0 = 0., max = 50, min =.0e-6, (η + /η - is the incease/decease paamete, 0 is the initial update-value and max / min stands fo the uppe/lowe limit of the update-value) with initial weights selected fom 0.250.25 andomly. Please efe to [4] fo details. In ode to avoid lage computational cost and ovefitting, a method called ealy stopping based on validation set is used as the stopping citeia. Please efe to [22] fo details. The set of available pattens is divided into thee sets: a taining set is used to tain the netwok, a validation set is used to evaluate the quality of the netwok duing taining and to measue ovefitting, and a test set is used at the end of taining to evaluate the esultant netwok. The size of the taining, validation, and test sets is 50%, 25% and 25% of the poblem s total available pattens. Fou benchmak classification poblems, namely Vowel, Glass, Segmentation, and Lette Recognition wee used to evaluate the pefomance of the new modula netwok Task Decomposition with Patten Distibuto. These classification poblems wee taken fom the PROBEN benchmak collection [5] and Univesity of Califonia at Ivine (UCI) epositoy of machine leaning database [6]. In the set of expeiments undetaken, the fist thee classification poblems wee conducted 20 times and the Lette Recognition poblem was conducted 8 times (due to the long taining time). All the hidden units and output units use the sigmoid activation function and E th is set at 0.. When a hidden unit addition was equied, 8 candidates wee tained and the best one selected. All the expeiments wee simulated on a Pentium IV 2.4GHZ PC. The sub-poblems wee solved sequentially and the CPU time expended was ecoded espectively. B. Expeiments fo PD netwok based on full and educed patten taining ) Glass: This data set is used to classify glass types. The data set consists of 9 inputs, 6 outputs, and 643 pattens (divided into 32 taining pattens, 6 validation pattens, and 6 test pattens). These pattens wee nomalized and scaled so that each component lies within [0, ]. Figue 5 shows the OP netwok stuctue used fo this poblem. The OP netwok is composed of 6 sub-modules and each sub-module ecognizes one class fom all the pattens. As descibed in Discussion in Section 2, these sub-modules ae combined into 2 modules in the OP netwok. The sub-modules
> PNN05-P762 < 7 input patten Module Sub-module Sub-module2 Sub-module3 Module 2 Sub-module4 Sub-module5 Sub-module6 Fig. 5. The OP netwok used fo the Glass poblem. input patten Distibuto Module Module Sub-module Sub-module2 Sub-module3 Module 2 Sub-module4 Sub-module5 Sub-module6 Fig. 6. The PD netwok used fo the Glass poblem. Class Class 3 Class 5 Class 2 Class 4 Class 6 Class Class 3 Class 5 Class 2 Class 4 Class 6 which ecognize class, class 3 and class 5 ae combined into Module and the emaining sub-modules ae gouped into Module 2. Table I lists some data which ae used in expession (3). Hee N i epesents the numbe of the pattens in the test data set belonging to the classes of Module i while N denotes the oveall numbe of the pattens. p ii is the pobability of eo when Module i pocesses the pattens belonging to Module i and p i* is the pobability of eo when Module i pocesses the pattens not belonging to Module i. Now we show that Discussion 2 in Section 2 is easonable. Thee ae two modules in the OP netwok. Fom Table I, we have p = 8.442% and p2 = p* = 4.7340%. So p p2 = 0.40%, which is much smalle than p and p. It is simila that 2 p22 p2 = 0.39%, which is much smalle than p and 2 p. Ignoing these tems 22 has little effect to the final esults. In othe wods, the situation in which two o moe modules making wong decisions at the same time can be ignoed. Now we follow up Discussion 3 in Section 2 the effect of winne-take-tall. Fom Table I, we can TABLE I CLASSIFICATION ERROR IN DIFFERENT OP MODULES FOR THE GLASS DATA Output Paallelism N i p ii N-N i p i* Module 67 8.442 94 4.7340 Module 2 94 8.383 67 2.642 Method Odinay method (no task decomposition) Output Paallelism (2 modules, 6 sub-modules) Patten Distibuto ( distibuto module, 2 modules, 6 sub-modul es ) The distibuto module netwok (FPT) netwok TABLE II RESULTS FOR THE GLASS DATA Taining time (s) Hidden Units compute the classification eo befoe the implementation of winne-take-all, which is N p + N p + N p + N p =. The esult is 2 * 2 22 2* 7.7562% Indep. Paam. C. eo 68. 46 796 6.0870 63.7 97.7 253.5 2848.5 4.2547 82.9 30.6 387.2 2.4224 85.2 298.7 82.9 94.3 280.9 3200.5 0.0 39.2 443.8 7.826 slightly lage than the esult using winne-take-all, which is 4.2547% (see Table II). It also matches ou analysis in Discussion 3, Section 2. The PD netwok stuctue fo this poblem is shown in Figue 6. The distibuto module has two outputs, one has the combination {,3,5} while the othe has {2,4,6}. Module consists of 3 sub-modules, identical to its countepat in the OP netwok, and same fo Module 2. Table II shows the expeimental esults of the odinay method, the OP method, the PD method with Full Patten Taining (FPT) and the PD method with Reduced Patten Taining. The odinay method is a method in which a single-module neual netwok was constucted to solve the poblem. Constuctive Backpopagation (CBP) algoithm is still used in the odinay method. Indep. Paam. stands fo the total numbe of independent paametes (i.e., the numbe of weights and biases in the netwok). C. Eo stands fo classification eo. Taining time (in paallel) is the maximum taining time among all the modules (all modules wee tained in paallel). Taining time (in seies) stands fo the sum of taining time fo all the modules (all modules wee tained in seies). Using the odinay and the OP methods, the classification eos wee 6.0870% and 4.2547% espectively, while using the PD method, the classification eos wee 0% fo FPT and 7.826% fo RPT. Compaing with the classification eos
> PNN05-P762 < 8 fom the fome two appoaches, the classification eos obtained by the PD netwok ae much smalle. It can be also noted that the classification eo is futhe educed when using RPT instead of FPT. Now we explain why the PD netwok can achieve smalle classification eo than the othe two methods. Accoding to ou analysis, if Expession (3) is satisfied, the PD netwok will have bette classification accuacy. Using the data in Table I, we 2 have ( N N ) p * 5. 9. Fom the classification eo p 0 of = the distibuto module in Table II, we find N p 0 3. 9. Thus, Condition (3) is satisfied, which means that using the PD netwok will get smalle classification eo. Fom Table II, we can see that the numbe of hidden units and the numbe of independent paametes in the PD netwok ae lage than those in the odinay netwok and the OP netwok. This can be attibuted to the fact that the PD netwok has moe modules than the othe two. Fom Table II, we can also note the changes of the taining time using the above thee methods. With seies taining, the taining time of FPT (298.7s) is lage than those of the odinay netwok (68.s) and the OP netwok (97.7s) due to a lage numbe of modules in the PD method. Howeve, the taining time of RPT (94s) is educed compaed to that of FPT and is thus compaable to the taining time of the othe two netwoks. The eason fo this is that the numbe of taining instances used in RPT is smalle than that in FPT. With paallel taining, the taining time of the PD netwok (RPT o FPT) is simila to those of the othe two methods, and it is even shote than that of the odinay method. Fom the above analysis, we see that the PD method, especially RPT, pefoms bette than the othe methods. 2) Vowel: The input pattens of this data set ae 0 element eal vectos epesenting vowel sounds that belong to one of classes. It has 990 pattens in total (they ae divided into 495 taining pattens, 248 validation pattens, and 247 test pattens). The pattens wee nomalized and scaled so that each component lies within [0, ]. The distibuto module has 3 outputs, {,2,3}, {4,5,6,7} and {8,9,0,}. Module ecognizes classes, 2, 3 and consists of 3 sub-modules. Module 2 ecognizes classes 4,5,6,7 and consists of 4 sub-modules, while Module 3 ecognizes classes 8,9,0, and consists of 4 sub-modules. The OP netwok has the same Module, Module 2 and Module 3 as the PD netwok. The expeimental esults of the odinay method, the OP method and the PD method fo the Vowel data ae listed in Table IV. Using the odinay method and the OP method, the classification eos wee 37.660% fo the odinay method and 25.5466% fo the OP method espectively, while using the PD method, the classification eos wee 24.8987% fo FPT and 8.7045% fo RPT. The classification eo obtained by FPT is much smalle than the classification eo of the odinay method and esembles that of the OP method. While fo RPT, the classification eo is deceased to 8.7045%, which is much smalle than those of FPT and the othe two methods. We can compute the numbe of wongly-classified pattens using the TABLE III CLASSIFICATION ERROR IN DIFFERENT OP MODULES FOR THE VOWEL DATA Output Paallelism data in Table III to explain why the PD method can get smalle classification eos than the othe two methods. We have 3 = ( N N ) p 9.9 while N p 0 6. 5. Expession (3) is * N i p ii N-N i p i* Module 69 9.855 78 2.9775 Module 2 96 32.0833 5 3.2450 Module 3 82 34.756 65 5.8788 Method Odinay method (no task decomposition) Output Paallelism (3 modules, sub-modules) Patten Distibuto ( distibuto module, 3 modules and sub-modul es) The distibuto module netwok (FPT) netwok TABLE IV RESULTS FOR THE VOWEL DATA Taining time (s) Hidden Units Indep. Paam. C. eo 237.9 23.6 640.2 37.660 58.7 48.9 84.4 2333.8 25.5466 7 24.5 376 6.6802 7 534.3 7 245.6 20.6 2730.2 24.8987 229.4 2955.8 8.7045 satisfied. Thus the PD netwok has smalle classification eos. Fom Table IV, we can see that the numbe of hidden units and the numbe of independent paametes in the PD netwok (RPT o FPT) ae lage than those in the odinay and OP netwoks. Table IV also shows the taining time using these methods. Using seies taining, the taining time of FPT (534.3s) is longe than those of the odinay netwok (237.9s) and the OP netwok (48.9s). The taining time of RPT (245.6s) is much educed compaed to that of FPT and is also smalle than those of the fome two netwoks. If paallel taining is used, the taining pocess of the PD netwok can save moe time. Fom the above analysis, we see that RPT outpefoms the othes. 3) Segmentation: This data set consists of 8 inputs, 7 outputs, and 230 pattens (55 taining pattens, 578 validation pattens, and 577 test pattens). The pattens wee nomalized and scaled so that each component lies within [0, ]. The distibuto module has 2 outputs, {3,4,5} and {,2,6,7}. Module ecognizes classes 3, 4, 5 and consists of 3 sub-modules. Module 2 ecognizes classes,2,6,7 and consists of 4 sub-modules. The OP netwok has the same module composition as the PD netwok. Table VI shows the simulation esults of the odinay method, the OP method, the PD method (FPT and RPT). Using the odinay method and the OP method, the classification eos
> PNN05-P762 < 9 TABLE V CLASSIFICATION ERROR IN DIFFERENT OP MODULES FOR THE SEGMENTATION DATA Output Paallelism N i p ii N-N i p i* Module 246 0.429 33 0.247 Module 2 33 0.925 246 0.6098 Method Odinay method (no task decomposition) Output Paallelism (7 sub-modules) Patten Distibuto ( distibuto module, 2 modules and 7 sub-modu les) TABLE VI RESULTS FOR THE SEGMENTATION DATA The distibuto module netwok (FPT) netwok Taining time (s) Hidden Units wee 5.7366% and 5.820% espectively, while using the PD method, the classification eos wee 5.449% fo FPT and 4.60% fo RPT. Fom Table V, we have ( N N ) p * 2.3. Fom Table VI, we find N p0 6.0. So Expession (3) is not satisfied and FPT has a lage classification eo than the OP netwok. It is also noted that the classification eo is deceased when using RPT to eplace FPT. Fom Table VI, we can see that the numbe of hidden units and the numbe of independent paametes in the PD netwoks ae lage than those in the odinay and OP netwoks. Fom Table VI, we also notice changes in taining time using the above thee methods. Unde seies taining, the taining time of FPT (229.2s) is lage than the taining times of the odinay netwok (693.8s) and the OP netwok (79.6s) due to a lage numbe of modules in the PD netwok. Howeve, the taining time of RPT (706.9s) is educed compaed to that of FPT and the OP netwok and is thus compaable to the taining time of the odinay method. With paallel taining, the taining time of RPT is the smallest one. Fom the above analysis, we see that RPT pefoms bette than the othe methods. 4) Lette ecognition: The goal of this data is to ecognize digitized pattens. Each element of the input vecto is a numeical attibute computed fom a pixel aay containing the lettes. This data set consists of 6 inputs, 26 outputs, and 20000 pattens (0000 taining pattens, 5000 validation pattens, and 5000 test pattens). All the pattens wee nomalized and scaled so that each component lies within [0, ]. The distibuto 2 = Indep. Paam. C. eo 693.8 29 887 5.7366 60.2 79.6 52. 375 5.820 23.4 3.9 329.9.0399 002.2 229.2 23.4 706.9 28.5 2754.9 5.449 28.9 2762.9 4.60 TABLE VII CLASSIFICATION ERROR IN DIFFERENT OP MODULES FOR THE LETTER DATA Output Paallelism module has 4 outputs, {,2,3,4,5,6,7}, {8,9,0,,2,3,4}, {5,6,7,8,9,20} and {2,22,23,24,25,26}. Module ecognizes classes,2,3,4,5,6,7. Due to the long taining time of this poblem, Module is not futhe divided into sub-modules. Module 2 ecognizes classes 8,9,0,,2,3,4, Module 3 ecognizes classes 5,6,7,8,9,20 and Module 4 ecognizes classes 2,22,23,24,25,26. The OP netwok has the same module composition as the PD netwok. Fo a fai compaison with the PD netwok, sub-modules ae not used in the OP netwok. The expeimental esults of the odinay method, the OP method and the PD method fo the Lette data ae listed in Table VIII. Using the odinay method and the OP method, the classification eos wee 2.672% fo the odinay method and 9.260% fo the OP method espectively. Using the PD method, the classification eo wee 20.55% fo FPT and 5.855% fo RPT. The classification eo obtained by FPT esembles the classification eos using the odinay method and the OP method. Using RPT, the classification eo is much smalle than the classification eos of the othe thee netwoks. Fom 2 Table VII, we have ( N N ) p 330. 3. Fom Table VIII, = we find N p 0 609. 8 N i p ii N-N i p i* Module 359 20.833 364 2.856 Module 2 333 24.82 3667.084 Module 3 95 25.09 3805 3.035 Module 4 3.05 3889.826 Method Odinay method (no task decomposition) Output Paallelism (4 modules) Patten Distibuto ( distibuto module, 4 modules ) The distibuto module netwok (FPT) netwok TABLE VIII RESULTS FOR THE LETTER DATA Taining time (s) Hidden Units * Indep. Paam. C. eo 20845.05 73.6 3607 2.672 559 82.6 250 8497 60 26723.8 250 4094.5 73.4 6586.8 9.260 29.5 409.0 2.95 386.25 8384.5 20.55 344.25 739.0 5.855. So Expession (3) is not satisfied, which means that FPT has a lage classification eo. Fom Table VIII, we see that the numbe of hidden units and the numbe of independent paametes in the PD netwok ae lage
> PNN05-P762 < 0 than those in the odinay and OP netwoks. Table VIII also shows the taining time using these methods. Unde seies taining, the taining time of FPT (26723.8s) is lage than those of the odinay netwok (82.6s) and the OP netwok (20845.05s). The taining time of RPT (4094.5s) is geatly deceased compaed to that of FPT and is also smalle than those of the fome two netwoks. If paallel taining was used, the taining pocess of RPT could save moe time. Fom the above analysis, we can see that RPT pefoms bette than the othes. C. Coss-talk based combination fo distibuto modules Two classification poblems, namely Segmentation and Glass wee used in the expeiments. ) Segmentation: The Coss-talk table fo this poblem is obtained using the method descibed in Section 4., as shown in Table IX. As mentioned in Section 4., if the distance between two classes in the Coss-talk table is elatively small, then these two classes ae likely to be close in the featue space. Fom Table IX, we want to find classes which ae close to each othe and combine them. Fo simplicity, we use d(i,) to denote the distance between class i and class. Fom the table, d(3,5) is.6476, d(4,5) 3.068 and d(3,4) 6.7673. These distances ae elatively small compaed with othe distance figues. Thus, we combine classes 3,4,5 togethe. In the emaining fou classes, class is elatively close to classes 3,4,5. Thus, we combine,3,4,5 togethe. Now look at classes 2,6,7. Class 2 and class 6 ae elatively close and we combine them togethe. The final combination set is {{,3,4,5},{2,6},{7}}. We use anothe combination set {{,2,7},{3,4},{5,6}}fo compaison with the above set. In this set, we combine togethe the classes with elatively lage distances. The expeimental esults fo these two patitions ae shown in Table X. Table X shows that the distibuto module s classification eo as well as the oveall classification eo ae educed when the classes close to each othe ae combined togethe. 2) Glass: The Coss-talk table fo this data set is shown in Table XI. We can see that the distances among class, class 2 and class 3 ae 0.4467, 0.348 and 0.50, which ae much smalle than the othe distances. So classes, 2, 3 ae combined. In the emaining classes, it seems that class 4 is close to class 2. Howeve, d(4,) and d(4,3) is vey lage. Class 4 is not added to combination {,2,3}. Note that class 4, class 5 and class 6 have elatively small distances. Thus, classes 4,5,6 ae combined. Thus, the final combination set is {{,2,3}, {4,5,6}}. We use anothe combination set {{3,4,6},{,2,5}}fo compaison with the above set. In this set, we combine togethe the classes with elatively lage distances. The expeimental esults fo the two diffeent patitions ae shown in Table XII. Fom Table XII, it is confimed that the distibuto module s classification eo as well as the oveall classification eo ae educed when the classes close to each othe ae combined togethe. TABLE IX CROSS-TALK TABLE FOR THE SEGMENTATION DATA 2 3 4 5 6 7 0 233.0 9.948 7.66 8.876 40.34 53.9 2 233.0 0 72.53 7.2 53.0 42.95 58.8 3 9.948 72.53 0 6.767.648 8.55 40.66 4 7.66 7.2 6.767 0 3.062 4.458 54.70 5 8.876 53.0.648 3.062 0 6.34 29.02 6 40.34 42.95 8.55 4.458 6.34 0 60.95 7 53.9 58.8 40.66 54.70 29.02 60.95 0 TABLE X RESULTS FOR THE SEGMENTATION PROBLEM USING CROSS-TALK BASED COMBINATION Gouping of Output classes Module {,3,4,5} Module 2{2,6} Module 3{7} Module {,2,7} Module 2{3,4} Module 3{5,6} The distibuto module s Classification eo Classification eo 0.040 4.687 4.7834 5.3900 TABLE XI CROSS-TALK TABLE FOR THE GLASS DATA 2 3 4 5 6 0 0.4467 0.348 3.0807 7.773 0.8269 2 0.4467 0 0.50.2606.963 5.8246 3 0.348 0.50 0 26.32 9.2086 7.6053 4 3.0807.2606 26.32 0 4.6975 6.8534 5 7.773.963 9.2086 4.6975 0 2.657 6 0.8269 5.8246 7.6053 6.8534 2.657 0 D. Genetic Algoithm based combination fo distibuto TABLE XII RESULTS FOR THE GLASS PROBLEM USING CROSS-TALK BASED COMBINATION Gouping of Output classes Module {,2,3} Module 2{4,5,6} Module {3,4,6} Module 2{,2,5} The distibuto module s Classification eo Classification eo 2.4224 7.5776 4.5963 8.5093
> PNN05-P762 < TABLE XIII RESULTS OF THE SEGMENTATION PROBLEM USING GA BASED COMBINATION Gouping of Output classes Module {,3,4,5} Module 2 {6,7} Module 3 {2} modules The distibuto module s Classification eo Classification eo 0.073 4.532 TABLE XIV RESULTS OF THE GLASS PROBLEM USING GA BASED COMBINATION Gouping of Output classes Module {,3,5} Module 2{2,4,6} The distibuto module s Classification eo Classification eo 2.4224 7.826 To compae with the esults in section 5.3, the same classification poblems, namely Segmentation and Glass wee used in the expeiments. In the expeiments, we chose the pobability of cossove p c =0.2 and the pobability of mutation p m =0.2. Fo each chomosome, the classification eo of the validation set is computed 5 times. The evaluation value is the aveage of the classification eos fom 5 uns. ) Segmentation: In the expeiments, we set the maximum numbe of combination N max =4. The population numbe is 20. Due to long computation time, only 30 geneations wee bed in ou expeiments. We identified the best chomosome 233, o {{,3,4,5},{6,7},{2}}. The expeimental esults ae shown in Table XIII. Compaing the esults in Table XIII with those in Table X, it can be seen that using GA based combination, the classification eo of the distibuto module is deceased fom 0.040% to 0.073%. The classification eo of the whole netwok is slightly bette than that using Coss-talk based combination. 2) Glass: In the expeiments, we set the maximum numbe of combination N max =3. The population numbe is 2. Due to long computation time, only 30 geneations wee bed in ou expeiments. We identified the best chomosome 222, o {{,3,5},{2,4,6}}. The expeimental esults ae shown in Table XIV. Compaing the esults in Table XIV with those in Table XII, it can be seen that using GA based combination, the classification eo of the distibuto module is equal to that using Coss-talk based combination. The classification eo of the whole netwok is slightly lage than that using Coss-talk based combination. In the above two sets of expeiments, it took epochs fo the Glass poblem and 4 epochs fo the Segmentation poblem to locate the best chomosome. With the inceasing numbe of classes, the numbe of epochs equied to locate the best chomosome will also be inceased. In the above two examples, we see that the classification eo of the distibuto module using GA based combination seems bette than o equal to that using Coss-talk based combination. Howeve, the whole netwok s pefomance using GA based combination is not always bette than that using Coss-talk based combination. It is also elated to the ecognition ates of the non-distibuto modules. It can be seen that GA based combination may not be a good choice compaed with Coss-talk based combination in these two examples, due to the fact that the impovement in classification ate is tivial while much moe computation is needed fo GA based combination. Fo poblems with a lage numbe of classes whose Coss-talk computation is moe costly and hade to analyze, GA based combination may be a bette choice. On the othe hand, we may conside geneating some initial chomosomes based on the Coss-talk analysis to futhe impove the quality of GA based combination. VI. CONCLUSIONS This pape pesented a unique task decomposition appoach called Task Decomposition with Patten Distibuto (PD). In this design, a special module called distibuto module was intoduced in ode to impove the accuacy of the whole netwok. A theoetical model was shown to compae the pefomance of PD with that of Output paallelism (OP) a typical class decomposition method. The analysis showed that PD can outpefom OP when the classification accuacy of the distibuto module is guaanteed. The expeimental esults confimed this. In ode to futhe impove the pefomance of PD, Reduced Patten Taining was intoduced. Reduced Patten Taining appaently inceased the accuacy of the PD netwok. Accoding to ou model, the distibuto module s classification accuacy dominated the whole netwok s pefomance. Two combination methods, Coss-talk based combination and GA based combination, wee poposed to find good class gouping fo the distibuto module. Coss-talk based combination could find a suitable combination set fo the distibuto module. GA based combination could find the optimal (o nea-optimal) combination set fo the distibuto module, with a lage computation cost. Ou expeimental esults confimed the effectiveness of the combination methods poposed. We will continue to impove the combination methods in the futue. We hope to design new combination methods which not only can find optimal o nea-optimal sets fo the distibuto module but also educe futhe the computation time. In ou pape, the numbe of distibuto module is esticted to one. This can be elaxed by having multi-level PD netwoks with two o moe distibuto modules. How to educe futhe the taining patten set while etaining the ecognition ate is also on ou futue eseach agenda. APPENDIX The pocedue of standadizing a chomosome is shown as follows: () Add a minus sign - to all the places. Fo example, a place with numbe 3 now becomes -3. Chomosome 233 becomes (-2)(-3)(-3)(-)(-)(-).
> PNN05-P762 < 2 (2) Set t=. Find the numbe in the fist place and find all the places with the same numbe as the fist one. Change the numbes in the fist place and all the matching places into t. In the above example, chomosome (-2)(-3)(-3)(-)(-)(-) becomes (-3)(-3)(-)(-)(-). (3) Set t=t+. Scanning fom left to ight, find the leftmost place whose numbe is negative and find all the following places whose numbe is the same. Change the numbes in these places into t. In the above example, when t=2, chomosome (-3)(-3)(-)(-)(-) becomes 22(-)(-)(-). (4) Repeat Step (3) until all the places have positive numbes inside. ACKNOWLEDGMENT The authos thank the eviewes fo thei valuable comments. REFERENCES [] R. A. Jacobs and M. I. Jodan, M. I. Nowlan, and G. E. Hinton, Adaptive mixtues of local expets, Neual Computation, vol. 3, no., pp.79-87, 99. [2] S. U. Guan and S. C. Li, Paallel gowing and taining of neual netwoks using output paallelism, IEEE Tansactions on Neual Netwoks, 542-550, Vol. 3, No. 3, May 2002 [3] R. Anand, K. Mehota, C. K. Mohan and S. Ranka, Efficient classification fo multiclass poblems using modula neual netwoks, IEEE Tansactions on Neual Netwoks, vol. 6, no., pp. 7 24, 995. [4] B. L. Lu and M. Ito, Task decomposition and module combination based on class elations: A modula neual netwok fo patten classification, IEEE Tansactions on Neual netwoks, vol. 0, no. 5, pp. 244 256, 999. [5] S. U. Guan and S. C. Li, An appoach to paallel gowing and taining of neual netwoks, in Poceeding of 2000 IEEE Intenational Symposium on Intelligent Signal Pocessing and Communication Systems, Honolulu, Hawaii, vol. 2, pp. 0 04, 2000. [6] S. U. Guan and J. Liu, Featue Selection fo Modula Netwoks Based on Incemental Taining, Jounal of Intelligent Systems, vol. 3, pp. 45-70, 2004. [7] S. U. Guan and F. M. Zhu, Modula Featue Selection Using Relative Impotance Factos, Intenational Jounal of Computational Intelligence and Applications, vol. 4, pp. 57-75, 2004. [8] S. U. Guan and F. M. Zhu, Class Decomposition fo GA-based Classifie Agents A Pitt Appoach, IEEE Tansactions on Systems, Man, and Cybenetics, Pat B: Cybenetics, vol. 34, pp. 38-392, 2004. [9] S. U. Guan and R. Kiuthika, Recusive Pecentage Based Hybid Patten (RPHP) Taining fo Cuve Fitting, In Poc of the IEEE Int. Conf. on Cybenetics and Intelligent Systems (CIS), pp. 445-450, 2004. [0] B. L. Lu, H. Kita, and Y. Nishikawa, A multisieving neual-netwok achitectue that decomposes leaning tasks automatically, in Poceedings of IEEE Confeence on Neual Netwoks, Olando, FL, pp. 39-324, 994. [] J. Bake, Reducing bias and inefficiency in the selection algoithm, Poceedings of the Second Intenational Confeence on Genetic Algoithms and Thei Application, Cambidge, Massachusetts, United States, pp. 4-2, 987. [2] R. O. Duda, and P.E. Hat, Patten Classification and Scene Analysis, New Yok: Academic Expess, 973 [3] M. Lehtokangas, Modeling with constuctive backpopagation, Neual Netwoks, vol. 2, pp.707-76, 999. [4] M. Riedmille and H. Baun, A diect adaptive method fo faste backpopagation leaning: the RPROP algoithm, in Poceedings of the IEEE Intenational Confeence on Neual Netwoks, pp. 586-59, 993 [5] L. Pechelt, PROBEN: A set of neual netwok benchmak poblems and benchmaking ules, Technical Repot 2/94, Depatment of Infomatics, Univesity of Kalsuhe, Gemany, 994. [6] C.L. Blake and C.J. Mez, UCI Repositoy of machine leaning databases [http://www.ics.uci.edu/~mlean/mlrepositoy.html]. Ivine, CA: Univesity of Califonia, Depatment of Infomation and Compute Science, 998. [7] C. S. Squies and J. W. Shavlik, (99). Expeimental analysis of aspects of the cascade-coelation leaning achitectue, in Machine Leaning Reseach GoupWoking Pape 9-, Depatment of Compute Science, Univ.Wisconsin, Madison, 99. [8] L. Pechelt, Investigation of the CasCo family of leaning algoithms, Neual Netwoks, vol. 0, no. 5, pp. 885 896, 997. [9] N. J. Nilsson, Leaning Machines: Foundations of Tainable Patten Classifying Systems. New Yok: McGaw-Hill, 965; eissued as The Mathematical Foundations of Leaning Machines. San Mateo, CA: Mogan Kaufmann, 990. [20] M. R. Bethold and J. Diamond, Constuctive taining of pobabilistic neual netwoks, Neuocomputing, vol. 9, pp. 67 83, 998 [2] S. Ishihaa and T. Nagano, Text-independent speake ecognition utilizing neual-netwok techniques, Tech. Rep. IEICE, vol. NC93-2, pp. 7 77, 994, in Japanese. [22] G. Auda, M. Kamel and H. Raafat, Modula neual netwok achitectues fo classification, in IEEE Intenational Confeence on Neual Netwoks, Vol. 2, 996, pp.279-284. [23] Y. Li, M. Dong and R. Kothai, Classifiability-based omnivaiate decision tees, IEEE Tansactions on Neual Netwoks, vol. 6, no.6, pp. 547 560, 2005. [24] K. Z. Mao and G. B. Huang, Neuon selection fo RBF neual netwok classifie based on data stuctue peseving citeion, IEEE Tansactions on Neual Netwoks, vol. 6, no.6, pp. 53 540, 2005. [25] A. Szymkowiak-Have, M. A.Giolami and J. Lasen, Clusteing via kenel decomposition, IEEE Tansactions on Neual Netwoks, vol. 7, no., pp. 256 264, 2006. [26] N. Takahashi and T. Nishi, Global Convegence of Decomposition Leaning Methods fo Suppot Vecto Machines, IEEE Tansactions on Neual Netwoks, vol. 7, no.6, pp. 362-369, 2006. Sheng-Uei Guan eceived his M.Sc. & Ph.D. fom the Univesity of Noth Caolina at Chapel Hill. He is cuently a pofesso in the School of Engineeing and Design at Bunel Univesity, UK. Pof. Guan has woked in a pestigious R&D oganization fo seveal yeas, seving as a design enginee, poect leade, and manage. Afte leaving the industy, he oined Yuan-Ze Univesity in Taiwan. He seved as deputy diecto fo the Computing Cente and the chaiman fo the Depatment of Infomation & Communication Technology. Late he oined the Electical & Compute Engineeing Depatment at National Univesity of Singapoe as an associate pofesso. Chunyu Bao eceived the B. Sc. and M. Sc. degees in physics fom Beiing Univesity, P.R. China, in 999 and 2002, espectively. He is cuently pusuing the Ph.D. degee in the Depatment of Electical and Compute Engineeing, National Univesity of Singapoe, Singapoe. His eseach inteests include machine leaning, neual netwoks, patten classification and clusteing. TseNgee Neo eceived his B. Eng. degee fom the Depatment of Electical and Compute Engineeing, National Univesity of Singapoe, Singapoe.