Ensembling Neural Networks: Many Could Be Better Than All

Transcription

1 Artfcal Intellgence, 22, vol.37, no.-2, Ensemblng eural etworks: Many Could Be Better Than All Zh-Hua Zhou*, Janxn Wu, We Tang atonal Laboratory for ovel Software Technology, anng Unversty, anng 293, P.R.Chna Abstract eural network ensemble s a learnng paradgm where many neural networks are ontly used to solve a problem. In ths paper, the relatonshp between the ensemble and ts component neural networks s analyzed from the context of both regresson and classfcaton, whch reveals that t may be better to ensemble many nstead of all of the neural networks at hand. Ths result s nterestng because at present, most approaches ensemble all the avalable neural networks for predcton. Then, n order to show that the approprate neural networks for composng an ensemble can be effectvely selected from a set of avalable neural networks, an approach named GASE s presented. GASE trans a number of neural networks at frst. Then t assgns random weghts to those networks and employs genetc algorthm to evolve the weghts so that they can characterze to some extent the ftness of the neural networks n consttutng an ensemble. Fnally t selects some neural networks based on the evolved weghts to make up the ensemble. A large emprcal study shows that, comparng wth some popular ensemble approaches such as Baggng and Boostng, GASE can generate neural network ensembles wth far smaller szes but stronger generalzaton ablty. Furthermore, n order to understand the workng mechansm of GASE, the bas-varance decomposton of the error s provded n ths paper, whch shows that the success of GASE may le n that t can sgnfcantly reduce the bas as well as the varance. Keywords: eural networks; eural network ensemble; Machne learnng; Selectve ensemble; Boostng; Baggng; Genetc algorthm; Bas-varance decomposton. Introducton eural network ensemble s a learnng paradgm where a collecton of a fnte number of neural networks s traned for the same task [42]. It orgnates from Hansen and Salamon s work [2], whch shows that the generalzaton ablty of a neural network system can be sgnfcantly mproved through ensemblng a number of neural networks,.e. tranng many neural networks and then combnng ther predctons. Snce ths technology behaves remarkably well, recently t has become a very hot topc n both neural networks and machne learnng communtes [4], and has already been successfully appled to dversfed areas such as face recognton [6, 22], optcal character recognton [9, 9, 3], scentfc mage analyss [5], medcal dagnoss [6, 47], sesmc sgnals classfcaton [4], etc. In general, a neural network ensemble s constructed n two steps,.e. tranng a number of component neural networks and then combnng the component predctons. * Correspondng author. Tel.: , fax: E-mal addresses: zhouzh@nu.edu.cn (Z.-H. Zhou), wux@a.nu.edu.cn (J. Wu), tangwe@a.nu.edu.cn (W. Tang).

2 As for tranng component neural networks, the most prevalng approaches are Baggng and Boostng. Baggng s proposed by Breman [3] based on bootstrap samplng []. It generates several tranng sets from the orgnal tranng set and then trans a component neural network from each of those tranng sets. Boostng s proposed by Schapre [39] and mproved by Freund et al. [, 2]. It generates a seres of component neural networks whose tranng sets are determned by the performance of former ones. Tranng nstances that are wrongly predcted by former networks wll play more mportant roles n the tranng of later networks. There are also many other approaches for tranng the component neural networks. Examples are as follows. Hampshre and Wabel [7] utlze dfferent obect functons to tran dstnct component neural networks. Cherkauer [5] trans component networks wth dfferent number of hdden unts. Macln and Shavlk [29] ntalze component networks at dfferent ponts n the weght space. Krogh and Vedelsby [28] employ cross-valdaton to create component networks. Optz and Shavlk [34] explot genetc algorthm to tran dverse knowledge based component networks. Yao and Lu [46] regard all the ndvduals n an evolved populaton of neural networks as component networks. As for combnng the predctons of component neural networks, the most prevalng approaches are pluralty votng or maorty votng [2] for classfcaton tasks, and smple averagng [33] or weghted averagng [35] for regresson tasks. There are also many other approaches for combnng predctons. Examples are as follows. Wolpert [45] utlzes learnng systems to combne component predctons. Merz and Pazzan [3] employs prncpal component regresson to determne the approprate constrant for the weghts of the component networks n combnng ther predctons. Jmenez [24] uses dynamc weghts determned by the confdence of the component networks to combne the predctons. Ueda [43] explots optmal lnear weghts to combne component predctons based on statstcal pattern recognton theory. ote that there are some approaches usng a number of neural networks to accomplsh a task n the style of dvde-and-conquer [23, 25]. However, n those approaches, the neural networks are n fact traned for dfferent sub-tasks nstead of for the same task, whch makes those approaches usually be categorzed nto mxture of experts nstead of ensembles, and the dscusson of them s beyond the scope of ths paper. It s worth menton that when a number of neural networks are avalable, at present most ensemble approaches employ all of those networks to consttute an ensemble. Yet the goodness of such a process has not been formally proved. In ths paper, from the vewpont of predcton,.e. regresson and classfcaton, the relatonshp between the ensemble and ts component neural networks s analyzed, whch reveals that ensemblng many of the avalable neural networks may be better than ensemblng all of those networks. Then, n order to show that those many neural networks can be effectvely selected from a number of avalable neural networks, an approach named GASE (Genetc Algorthm based Selectve Esemble) s presented. Ths approach selects some neural networks to consttute an ensemble accordng to some evolved weghts that could characterze the ftness of ncludng the networks n the ensemble. An emprcal study on twenty bg data sets show that n most cases, the performance of the neural network ensembles generated by GASE outperform those generated by some popular ensemble approaches such as Baggng and Boostng n that GASE utlzes far less component neural networks but acheves stronger generalzaton ablty. Moreover, ths paper employs the bas-varance decomposton to analyze the emprcal results, whch shows that the success of GASE may owe to ts ablty of sgnfcantly reducng the bas along wth the varance. The rest of ths paper s organzed as follows. In Secton 2, the relatonshp between the ensemble and ts component neural networks s analyzed. In Secton 3, GASE s presented. In Secton 4, a large emprcal study s reported. In Secton 5, the bas-varance decomposton of the error s provded. Fnally n Secton 6, contrbutons of ths paper are summarzed and several ssues for future works are ndcated. 2

3 2. Should we ensemble all the neural networks? In order to know whether t s a good choce to ensemble all the avalable neural networks, ths secton analyzes the relatonshp between the ensemble and ts component neural networks. ote that snce regresson and classfcaton have dstnct characterstcs, the analyses are separated nto two subsectons. 2.. Regresson Suppose the task s to use an ensemble comprsng component neural networks to approxmate a functon f: R m R n, and the predctons of the component networks are combned through weghted averagng where a weght w ( =, 2,, ) satsfyng both Eq. () and Eq. (2) s assgned to the -th component network f. w () = w = (2) The l-th output varable of the ensemble s determned accordng to Eq. (3) where f,l s the l-th output varable of the -th component network. f l = wf (3) = For convenence of dscusson, here we assume that each component neural network has only one output varable,.e. the functon to be approxmated s f: R m R. But note that the followng dervaton can be easly generalzed to stuatons where each component neural network has more than one output varables. ow suppose x R m s sampled accordng to a dstrbuton p(x), the expected output of x s d(x), and the actual output of the -th component neural network s f (x). Then the output of the ensemble on x s:, l f ( x) w f ( x) = (4) = The generalzaton error E (x) of the -th component neural network on x and the generalzaton error E ( x ) of the ensemble on x are respectvely: E ( x) ( f ( x) d( x) ) 2 = (5) ( ) 2 E ( x) f ( x) d( x) = (6) Then the generalzaton error of the -th component neural network and that of the ensemble,.e. E and E, on the dstrbuton p(x) are respectvely: ( ) E = dxp( x E x (7) ) E dxp( x) E ( x) = (8) ow we defne the correlaton between the -th and the -th component neural networks as: It s obvous that C satsfes both Eq. () and Eq. (): C ( ) ( x) ( f ( x) d( x) ) f ( x) d( x) = dxp (9) C = E () C = C () 3

4 Consderng Eq. (4) and Eq. (6) we get: = = (2) E ( x) = w f ( x) d( x) w f ( x) d( x) Then consderng Eq. (8), Eq. (9), and Eq. (2) we get: E = ww C (3) = = For convenence of dscusson, here we assume that all the component neural networks have equal weghts,.e. w = / ( =, 2,, ). In other words, here we assume that the component predctons are combned va smple averagng. Then Eq. (3) becomes: 2 E C / = = = (4) ow suppose that the k-th component neural network s excluded from the ensemble. Then the generalzaton error of the new ensemble s: E ' = C /( ) 2 (5) = = k k From Eq. (4) and Eq. (5) we can derve that f Eq. (6) s satsfed then E s not smaller than E ', whch means that the ensemble excludng the k-th component neural network s better than the one ncludng the k-th component neural network. E 2 Ck + E k /( 2 ) (6) = k Then consderng Eq. (6) along wth Eq. (4), we get the constrant on the k-th component neural network that should be excluded from the ensemble: ( ) C Ck + E (7) k = = = k It s obvous that there are cases where Eq. (7) s satsfed. For an extreme example, when all the component neural networks are the duplcaton of the same neural network, Eq. (7) ndcates that the sze of the ensemble can be reduced wthout sacrfcng the generalzaton ablty. ow we reach the concluson that n the context of regresson, when a number of neural networks are avalable, ensemblng many of them may be better than ensemblng all of them, and the networks that should be excluded from the ensemble satsfy Eq. (7) Classfcaton Suppose the task s to use an ensemble comprsng component neural networks to approxmate a functon f: R m L where L s the set of class labels, and the predctons of the component networks are combned through maorty votng where each component network votes for a class and the class label recevng the most number of votes s regarded as the output of the ensemble. For convenence of dscusson, here we assume that L contans only two class labels,.e. the functon to be approxmated s f: R m {-, 4

5 +}. But note that the followng dervaton can also be generalzed to stuatons where L contans more than two class labels. ow suppose there are m nstances, the expected output,.e. D, on those nstances s [d, d 2,, d m ] T where d denotes the expected output on the -th nstance, and the actual output of the -th component neural network,.e. f, on those nstances s [f, f 2,, f m ] T where f denotes the actual output of the -th component network on the -th nstance. D and f satsfy that d {-, +} ( =, 2,, m) and f {-, +} ( =, 2,, ; =, 2,, m ) respectvely. It s obvous that f the actual output of the -th component network on the - th nstance s correct accordng to the expected output then f d = +, otherwse f d = -. Thus the generalzaton error of the -th component neural network on those m nstances s: where Error(x) s a functon defned as: E m Error fd m = ( ) = (8) f x = Error ( x) =.5 f x = f x = (9) ow we ntroduce a vector Sum as [Sum, Sum 2,, Sum m ] T where Sum denotes the sum of the actual output of all the component neural networks on the -th nstance, 2.e. Sum Then the output of the neural network ensemble on the -th nstance s: where Sgn(x) s a functon defned as: = f (2) = f Sgn( Sum ) = (2) f x > Sgn( x) = f x = f x < (22) It s obvous that f {-,, +} ( =, 2,, m). If the actual output of the ensemble on the -th nstance s correct accordng to the expected output then f d = +; f t s wrong then f d = -; otherwse f d =, whch means that there s a te on the -th nstance, e.g. three component networks vote for + whle other three networks vote for -. Thus the generalzaton error of the ensemble s: ( ) m m = E = Error f d (23) ow suppose that the k-th component neural network s excluded from the ensemble. Then the output of the new ensemble on the -th nstance s: f ' Sgn( Sum fk) and the generalzaton error of the new ensemble s: = (24) The set of two class labels are often denoted as {, }. However, usng {-, +} here s more helpful for followng dervaton. 2 Here the class labels,.e. - and +, are regarded as ntegers, whch s the proft of usng {-, +} nstead {, } n denotng the class labels. 5

6 ( ) m E ' = Error f ' d (25) m = From Eq. (23) and Eq. (25) we can derve that f Eq. (26) s satsfed then E s not smaller than E ', whch means that the ensemble excludng the k-th component neural network s better than the one ncludng the k-th component neural network. m Error ( Sgn( Sum ) { ) ( ( ) d Error Sgn Sum fk d ) } = Then consderng that the excluson of the k-th component neural network won t mpact the output of the ensemble on the -th nstance where Sum >, and consderng the propertes of the combnaton of the functons Error(x) and Sgn(x) when x {-,, +} and y {-, +},.e. Error ( Sgn( x) ) Error ( Sgn( x y) ) = Sgn( x + y) (27) 2 we get the constrant on the k-th component neural network that should be excluded from the ensemble: m = { Sum } (( k) ) (26) Sgn Sum + f d (28) It s obvous that there are cases where Eq. (28) s satsfed. For an extreme example, when all the component neural networks are the duplcaton of the same neural network, Eq. (28) ndcates that the sze of the ensemble can be reduced wthout sacrfcng the generalzaton ablty. ow we reach the concluson that n the context of classfcaton, when a number of neural networks are avalable, ensemblng many of them may be better than ensemblng all of them, and the networks that should be excluded from the ensemble satsfy Eq. (28). 3. Selectve ensemble of neural networks In Secton 2 we have proved that ensemblng many of the avalable neural networks may be better than ensemblng all of those networks n both regresson and classfcaton, and the networks that should not be ncluded n the ensemble satsfy Eq. (7) and Eq. (28) respectvely. However, excludng those bad neural networks from the ensembles s not an easy task as we may have magned. Let s look around Eq. (7) and Eq. (28) agan. It s obvous that even wth the assumptons such as there s only one output varable n regresson and there are only two class labels n classfcaton, the computatonal cost requred by those equatons for dentfyng the neural networks that should not on the ensembles s stll too extensve to be met n real-world applcatons. In ths secton we present a practcal approach,.e. GASE, to fnd out the neural networks that should be excluded from the ensemble. The basc dea of ths approach s a heurstcs,.e. assumng each neural network can be assgned a weght that could characterze the ftness of ncludng ths network n the ensemble, then the networks whose weght s bgger than a pre-set threshold λ could be selected to on the ensemble. Here we explan the motvaton of GASE from the context of regresson. Suppose the weght of the -th component neural network s w, whch satsfes both Eq. () and Eq. (2). Then we get a weght vector w = (w, w 2,, w ). Snce the optmum weghts should mnmze the generalzaton error of the ensemble, consderng Eq. (3), the optmum weght vector w opt can be expressed as: 6

7 w opt = arg mn wwc (29) w = = w opt.k,.e. the k-th (k =, 2,, ) varable of w opt, can be solved by lagrange multpler, whch satsfes: wwc 2λ w = = = = w opt. k (3) Eq. (3) can be smplfed as: Consderng that w opt.k satsfes Eq. (2), we get: w opt. kck = λ (3) = k Ck = w opt. k = (32) C = = It seems that we can solve w opt from Eq. (32). But n fact, ths equaton rarely works well n real-world applcatons. Ths s because when a number of neural networks are avalable, there are often some networks that are qute smlar n performance, whch makes the correlaton matrx (C ) be an rreversble or llcondtoned matrx so that Eq. (32) cannot be solved. However, although we cannot solve the optmum weghts of the neural networks drectly, we can try to approxmate them n some way. Look around Eq. (29) agan, we may fnd that t could be vewed as defnng an optmzaton problem. Consderng that genetc algorthm has been shown as a powerful optmzaton tool [5], GASE s developed. GASE assgns a random weght to each of the avalable neural networks at frst. Then t employs genetc algorthm to evolve those weghts so that they can characterze to some extent the ftness of the neural networks n onng the ensemble. Fnally t selects the networks whose weght s bgger than a pre-set threshold λ to make up the ensemble. It s worth notng that f every evolved weght s bgger than λ, then all the avalable neural networks wll on the ensemble. We beleve that ths corresponds to the stuaton where all the component networks satsfy nether Eq. (7) n regresson nor Eq. (28) n classfcaton. ote that GASE can be appled to not only regresson but also classfcaton because the am of the evolvng of the weghts s only to select the component neural networks. In partcular, the component predctons for regresson are combned va smple averagng nstead of weghted averagng. Ths s because we beleve that usng the weghts both n the selecton of the component neural networks and n the combnaton of the component predctons s easy to suffer overfttng, whch s supported by experments descrbed n Secton 4. Here GASE s realzed by utlzng the standard genetc algorthm [5] and a floatng codng scheme that represents each weght n 64 bts. Thus each ndvdual n the evolvng populaton s coded n 8 bytes where s the number of the avalable neural networks. ote that GASE can also be realzed by employng other knds of genetc algorthms and codng schemes. In each generaton of the evoluton, the weghts are normalzed so that they can compare wth the pre-set threshold λ. Currently GASE uses a qute smple normalzaton scheme,.e. 7

8 w ' = w / w (33) = In order to evaluate the goodness of the ndvduals n the evolvng populaton, a valdaton data set bootstrap sampled from the tranng set s used. Let E V w denote the estmated generalzaton error of the ensemble correspondng to the ndvdual w on the valdaton set V. It s obvous that E V w can express the goodness of w,.e. the smaller E V w s, the better w s. So, GASE uses f(w) = / E V w as the ftness functon. It s worth menton that wth the help of Eq. (4), E V w can be evaluated effcently for regresson tasks. But snce we have not such knd of ntermedate result n the dervaton presented n Secton 2.2, the evaluaton of E V w for classfcaton tasks s relatvely tme-consumng. The GASE approach s summarzed n Fg., where T bootstrap samples S, S 2,, S T are generated from the orgnal tranng set and a component neural network t s traned from each S t, an ensemble * s bult from, 2,, T whose output s the average output of the component networks n regresson, or the class label receved the most number of votes n classfcaton. Input: tranng set S, learner L, trals T, threshold λ Procedure:. for t = to T { 2. S t = bootstrap sample from S 3. t = L(S t ) 4. } 5. generate a populaton of weght vectors V 6. evolve the populaton where the ftness of a weght vector w s measured as f(w) = / E w 7. w* = the evolved best weght vector Output: ensemble * * ( x) = Ave t ( x) for regresson * wt >λ * ( x) = argmax y Y * w > λ: ( x ) = y t t for classfcaton Fg.. The GASE approach. 4. Emprcal study In order to know how well GASE works, a large emprcal study s performed. Ths secton brefly ntroduces the approaches used to compare wth GASE, then presents the nformaton on the data sets, then descrbes the expermental methodology, and fnally reports on the expermental results. 4.. Baggng and Boostng In our experments, GASE s compared wth two prevalng ensemble approaches,.e. Baggng and Boostng. The Baggng algorthm [3] employs bootstrap samplng [] to generate many tranng sets from the orgnal tranng set, and then trans a neural network from each of those tranng sets. The component predctons are combned va smple averagng for regresson tasks and maorty votng for classfcaton tasks. In classfcaton tasks, tes are broken arbtrarly. The Boostng algorthms used for classfcaton and regresson are AdaBoost [2] and AdaBoost.R2 [8] respectvely. Both algorthms sequentally generates a seres of neural networks, where the tranng nstances 8

9 that are wrongly predcted by the prevous neural networks wll play more mportant role n the tranng of later networks. The component predctons are combned va weghted averagng for regresson tasks and weghted votng for classfcaton tasks, where the weghts are determned by the algorthms themselves. ote that there are two ways,.e. resamplng [3] and reweghtng [36], n determnng the tranng sets used n Boostng. In our experments resamplng s employed because neural networks can not explctly support weghted nstances. Moreover, t s worth menton that Boostng requres that a weak learnng algorthm whose error s bounded by a constant strctly less than.5. In practce, ths requrement cannot be guaranteed especally when dealng wth multclass tasks. In our experments, nstead of abortng the learnng process when the error bound s breached, we generate a bootstrap sample from the orgnal tranng set and contnue up to a lmt of 2 such samples at a gven tral. Such an opton has been adopted by Bauer and Kohav [] before Data sets Twenty bg data sets are used n our experments, each of whch contans at least, nstances. Among those data sets, ten are used for regresson whle the remans are used for classfcaton. The nformaton on the data sets used for regresson s tabulated n Table. 2-d Mexcan Hat and 3-d Mexcan Hat have been used by Weston et al. [44] n nvestgatng the performance of support vector machnes. Fredman #, Fredman #2, and Fredman #3 have been used by Breman [3] n testng the performance of Baggng. Gabor, Mult, and SnC have been used by Hansen [8] n comparng several ensemble approaches. Plane has been used by Rdgeway et al. [37] n explorng the performance of boosted nave Bayesan regressors. In our experments, the nstances contaned n those data sets are generated from the functons lsted n Table. The constrants on the varables are also shown n Table, where U[x, y] means a unform dstrbuton over the nterval determned by x and y. ote that n our experments some nose terms have been added to the functons, but we have not shown them n Table because the focus of our experments s on the relatve performance nstead of the absolute performance of the compared approaches. All the data sets used for classfcaton are from UCI machne learnng repostory [2], whch has been extensvely used n testng the performance of dversfed knds of classfers. Here the data sets are selected accordng to the crteron that after the removal of nstances wth mssng values, each data set should contan at least, nstances. The Credt (German) we used s the numercal verson donated by Strathclyde Unversty. In Image segmentaton, a constant attrbute s removed. In Allbp and Sck, seven useless nomnal attrbutes are removed. In Hypothyrod and Sck-euthyrod, sx useless nomnal attrbutes are removed. Besdes, n Allbp, Sck, Hypothyrod, and Sck-euthyrod, a contnuous attrbute that has a great number of mssng values s removed. The nformaton on the data sets used n our experments s tabulated n Table Expermental methodology In our experments, -fold cross valdaton s performed on each data set, where ten neural network ensembles are traned by each compared approach n each fold. For Baggng and Boostng, each ensemble contans twenty neural networks. But for GASE, the component networks are selected from twenty neural networks, that s, the number of networks n an ensemble generated by GASE s far less than twenty. 9

10 Table Data sets used for regresson data set functon varable sze 2-d Mexcan Hat sn x y = sn c x = x x ~ U[ 2 π,2π] 5, 3-d Mexcan Hat 2 2 sn x + x y = sn c x + x2 = x + x [ ] x ~ U 4 π,4π 3, Fredman # y= sn ( π x ) ( ) 2 x2 + 2 x3.5 + x4+ 5x5 ~ [,] x ~ U[,] 2 2 x2 ~ U[ 4 π,56π] Fredman #2 y = x + x2x3 xx 2 4 x3 ~ U [,] x ~ U 4 [, ] x ~ U[,] xx 2 3 x2 ~ U[ 4 π,56π] Fredman #3 x2x4 y = tan x3 ~ U[,] x x U [ ] π x U 5, 4 ~, Gabor y = exp 2( x x2) cos 2π ( x+ x2) 2 ~ [,] Mult y = xx2+.56xx x2x5+ 2.6x3x4x5 ~ [,] Plane y=.6x+.3x2 ~ [,] Polynomal y = + 2x+ 3x + 4x + 5x ~ [,] SnC sn ( x) y = x ~ [,2π ] x 5, 3, x U 3, x U 4, x U, x U 3, U 3, Table 2 Data sets used for classfcaton data set class attrbute nomnal contnuous sze Allbp ,643 Chess ,96 Credt (German) 2 24, Hypothyrod , Image segmentaton 7 8 2,3 LED-7 7 2, LED-24 24, Sck ,643 Sck-euthyrod , Waveform , The tranng sets of the ensembles are bootstrap sampled from the tranng set of the fold. In order to ncrease the dversty of those ensembles, the sze of ther tranng sets s roughly half of that of the fold. For example, for a data set wth, nstances, the tranng set of each fold comprses 9 nstances, and each of the tranng sets of the ensembles contans 45 nstances that are bootstrap sampled from those 9 nstances. The tranng sets of the neural networks used to consttute the ensembles are bootstrap sampled from the tranng set of the ensembles. Such a methodology s helpful n estmatng the bas and varance [4] of the ensemble approaches, whch wll be descrbed n Secton 5.

11 Here the genetc algorthm employed by GASE s realzed by the GAOT toolbox developed by Houck et al. [2]. The genetc operators, ncludng select, crossover, and mutaton, and the system parameters, ncludng the crossover probablty, the mutaton probablty, and the stoppng crteron, are all set to the default values of GAOT. The pre-set threshold λ used by GASE s set to.5. The valdaton set used by GASE s bootstrap sampled from ts tranng set. The neural networks n the ensembles are traned by the mplementaton of Backpropagaton algorthm [38] n MATLAB [7]. Each network has one hdden layer that comprses fve hdden unts. The parameters such as the learnng rate are set to the default values of MATLAB. Here we do not optmze the archtecture and the parameters of those networks because we care the relatve performance of the compared ensemble approaches nstead of ther absolute performance. Durng the tranng process, the generalzaton error of each network s estmated n each epoch on a valdaton set. If the error does not change n fve consecutve epochs, the tranng of the network s termnated n order to avod overfttng. The valdaton set used by a neural network s bootstrap sampled from ts tranng set. In order to know how well the compared ensemble approaches work,.e. how sgnfcant the generalzaton ablty s mproved by utlzng those ensemble approaches, n our experments we also test the performance of sngle neural networks. For each data set, n each fold, ten sngle neural networks are traned. The tranng sets, the archtecture, the parameters, and the tranng process of those neural networks are all crafted n the same way as that of the networks used n ensembles Results The result of an approach n each fold s the average result of ten learnng systems (ensembles or sngle neural networks) generated by the approach n the fold, and the reported result s the average result of ten folds,.e. the -fold cross valdaton results. For regresson tasks, the error s measured as mean squared error on test nstances. For classfcaton tasks, the error s measured as the number of the test nstances correctly predcted dvded by the number of the test nstances. The comparson results on regresson and classfcaton are shown n Fg.2 and Fg.3 respectvely. ote that snce we care relatve performance nstead of absolute performance, the error of Baggng, Boostng, and GASE has been normalzed accordng to that of the sngle neural networks. In other words, the error of sngle neural networks s regarded as., and the reported error of Baggng, Boostng, and GASE s n fact the rato aganst the error of the sngle neural networks. Moreover, n each of those two fgures there s a subfgure ttled average whch shows the average of the compared approaches on all those regresson/classfcaton tasks. Fg.2 shows that all the three ensemble approaches are consstently better than sngle neural networks n regresson. Parwse two-taled t-tests ndcate that GASE s sgnfcantly better than both Baggng and Boostng n most regresson tasks,.e. 2-d Mexcan Hat, Fredman #, Fredman #2, Gabor, Mult, Polynomal, and SnC. As for the remanng three tasks, n Fredman #3 and Plane all the three ensemble approaches obtan smlar performance, n 3-d Mexcan Hat GASE s better than Baggng but worse than Boostng. ote that n half of those ten tasks,.e. 2-d Mexcan Hat, Fredman #2, Gabor, Polynomal, and SnC, the performance of GASE s so good that the s reduced to the degree close to zero. So, we beleve that GASE s better than both Baggng and Boostng when utlzed n regresson, whch s supported by the subfgure ttled average n Fg.2. Fg.3 shows that GASE s consstently better than sngle neural networks n classfcaton. Moreover, parwse two-taled t-tests ndcate that GASE s sgnfcantly better than both Baggng and Boostng n half tasks,.e. Chess, Credt (German), Hypothyrod, Sck, and Sck-euthyrod. As for the remanng tasks, n

12 d Mexcan Hat d Mexcan Hat Fredman # Fredman # Fredman # Gabor Mult Plane Polynomal Baggng Boostng GASE SnC average Fg. 2. Comparson of the of Baggng, Boostng, and GASE on regresson tasks. LED-7 all the three ensemble approaches obtan smlar performance, n Image segmentaton and Waveform- 4, GASE s far better than Boostng but comparable to Baggng, n LED-24 GASE s far better than Boostng but slghtly worse than Baggng, and n Allbp GASE s worse than Boostng but comparable to Baggng. So, we beleve that GASE s better than both Baggng and Boostng when utlzed n classfcaton, whch s supported by the subfgure ttled average n Fg.3. In summary, Fg.2 and Fg.3 show that GASE s superor to both Baggng and Boostng n both regresson and classfcaton, whch strongly supports our theory formally proved n Secton 2 that t may be a better choce to ensemble many nstead of all neural networks at hand. Fg.2 and Fg.3 also show that Baggng s consstently better than a sngle neural network n both regresson and classfcaton, but the performance of Boostng s not so stable. There are tasks such as 3-d Mexcan Hat and Allbp where Boostng obtans the best performance, but there are also tasks such as Credt (German), LED-24, and Waveform-4 where the performance of Boostng s even worse than that of sngle neural networks. Such observaton s accordant wth those reported n prevous works [, 32]. 2

13 Allbp Chess Credt (German) Hypothyrod Image segmentaton LED LED Sck Sck-euthyrod Waveform average Baggng Boostng GASE Fg. 3. Comparson of the of Baggng, Boostng, and GASE on classfcaton tasks. We also compare GASE wth ts two varants on those twenty data sets wth -fold cross valdaton. The frst varant s GASE-w that uses the evolved weghts to select the component neural networks but combnes the predctons of the selected networks wth the normalzed verson of ther evolved weghts. In other words, weghted averagng or weghted votng s used nstead of smple averagng or maorty votng for combnng the predctons of the selected networks. The second varant s GASE-wa that also uses genetc algorthm to evolve the weghts but does not select the component neural networks accordng to the evolved weghts. In other words, all the avalable neural networks are kept n the ensembles and ther predctons are combned va weghted averagng or weghted votng wth the normalzed verson of ther evolved weghts. ote that the computatonal cost of GASE-w and GASE-wa s smlar to that of GASE because the man dfference of those approaches only les n the utlzaton of the evolved weghts. The comparson results on regresson and classfcaton are shown n Table 3 and Table 4 respectvely. ote 3

14 that snce we care relatve performance nstead of absolute performance, the error of GASE, GASE-w, and GASE-wa has been normalzed accordng to that of the sngle neural networks. It s also worth menton that each ensemble generated by GASE-wa contans twenty component neural networks, but each ensemble generated GASE-w contans the same number of component networks as that generated by GASE, whch s far less than twenty. The average number of component neural networks used by GASE n consttutng an ensemble s also shown n Table 3 and Table 4. Table 3 Comparson of the of GASE, GASE-w, and GASE-wa on regresson tasks data set GASE GASE-w GASE-wa num. of networks used by GASE 2-d Mexcan Hat d Mexcan Hat Fredman # Fredman # Fredman # Gabor Mult Plane Polynomal SnC average Table 4 Comparson of the of GASE, GASE-w, and GASE-wa on classfcaton tasks data set GASE GASE-w GASE-wa num. of networks used by GASE Allbp Chess Credt (German) Hypothyrod Image segmentaton LED LED Sck Sck-euthyrod Waveform average Parwse two-taled t-tests ndcate that GASE s sgnfcantly better than GASE-w on almost all the classfcaton data sets. We beleve that ths s because usng the evolved weghts both n the selecton of the component neural networks and the combnaton of the component predctons s easy to suffer overfttng. There s no sgnfcant dfference between GASE and GASE-w n regresson. We beleve that ths s because the regresson data sets we used are artfcally generated whle most of the classfcaton data sets we used are from real-world tasks, whch leads to that the nose n the regresson data sets s far less than that n the classfcaton data sets. So, overfttng s easer to happen to the classfcaton data sets than the regresson data sets n our experments. Parwse two-taled t-tests also ndcate that there s no sgnfcant dfference between the generalzaton ablty of the ensembles generated by GASE and those generated by GASE-wa. We beleve that ths s because GASE-wa does not use the evolved weghts to select component neural networks, overfttng may 4

15 not be so serous as n GASE-w. But snce the sze of the ensembles generated by GASE s about only 9% (3.7/2.) n classfcaton and 36% (7./2.) n regresson of the sze of the ensembles generated by GASE-wa, and those two approaches are wth smlar computatonal cost, we beleve that GASE s better than GASE-wa. 5. Bas-varance decomposton In order to explore the reason of the success of GASE, the bas-varance decomposton s employed to analyze the emprcal results of Baggng, Boostng, and GASE. Ths secton brefly ntroduces the basvarance decomposton and then presents the decomposton results. 5.. Bas and varance The bas-varance decomposton [4] s a powerful tool for nvestgatng the workng mechansm of learnng approaches. Gven a learnng target and the sze of tranng set, t breaks the expected error of a learnng approach nto the sum of three non-negatve quanttes,.e. the ntrnsc nose, the bas, and the varance. The ntrnsc nose s a lower bound on the expected error of any learnng approach on the target. The bas measures how closely the average estmate of the learnng approach s able to approxmate the target. The varance measures how much the estmate of the learnng approach fluctuates for the dfferent tranng sets of the same sze. At present there are several knds of bas-varance decomposton schemes [4, 26, 27]. Here we adopt the one proposed by Kohav and Wolpert [26]. Let Y H be the random varable representng the label of an nstance n the hypothess space, and Y F be the random varable representng the label of an nstance n the target. Then the bas and the varance are expressed as Eq. (34) and Eq. (35) respectvely. bas ( ) ( ) 2 = PY = yx PY = yx (34) 2 x F H 2 y Y varance ( ) 2 x = PYH = yx (35) 2 y Y Accordng to Kohav and Wolpert [26], for estmatng the bas and varance of a learnng approach, the orgnal data set s splt nto two parts, that s, D and E. Then, tranng sets are sampled from D, whose sze s roughly half of that of D to guarantee that there are not many duplcate tranng sets n those tranng sets even for small D. After that, the learnng approach s ran on each of those tranng sets and the bas and varance are estmated wth Eq.(34) and Eq.(35). The whole process can be repeated several tmes to mprove the estmates. Snce t s dffcult to estmate the ntrnsc nose n practce, the actual bas-varance decomposton scheme of Kohav and Wolpert [26] generates a bas term that ncludes the ntrnsc nose. Therefore the bas plus the varance should be equal to the average error. However, f an ensemble approach employs maorty votng n classfcaton, then the sum of the bas and the varance generated by such a decomposton scheme may not be strctly equal to the average error. evertheless, ths s no a serous problem n our scenaros because such a problem also occurs n some other bas-varance decomposton schemes [4] and the generated bas and varance are stll useful n explorng the reason of the success of GASE Results Wth the expermental methodology descrbed n Secton 4.3, t s easy for us to estmate the bas and 5

16 varance of the compared approaches accordng to Kohav and Wolpert [26] s decomposton scheme. In detal, n our experments, 9% data of the orgnal data set s used as the orgnal tranng set whle the remanng % data s used as the test set. From the orgnal tranng set, ten tranng sets whose sze s roughly half of the orgnal tranng set are sampled. Then, the ensemble approaches are ran on each of those ten tranng sets and ther bas and varance are estmated wth Eq.(34) and Eq.(35). Such a process s repeated for ten tmes to mprove the estmates. The bas of the compared ensemble approaches on regresson and classfcaton are shown n Fg.4 and Fg.5 respectvely, and the varance of them are shown n Fg.6 and Fg.7. ote that snce we care relatve performance nstead of absolute performance, the bas/varance of Baggng, Boostng, and GASE has been normalzed accordng to that of sngle neural networks. In other words, the bas/varance of sngle neural networks s regarded as., and the reported bas/varance of Baggng, Boostng, and GASE s n fact the rato aganst the bas/varance of the sngle neural networks. Moreover, n each of those fgures there s a subfgure ttled average whch shows the average /varance of the compared approaches on all those regresson/classfcaton tasks d Mexcan Hat d Mexcan Hat Fredman # Fredman #2 Fredman #3 Gabor Mult Plane Polynomal SnC average Baggng Boostng GASE Fg. 4. Comparson of the of Baggng, Boostng, and GASE on regresson tasks. 6

17 Allbp Chess Credt (German) Hypothyrod Image segmentaton LED LED Sck Sck-euthyrod Waveform average Baggng Boostng GASE Fg. 5. Comparson of the of Baggng, Boostng, and GASE on classfcaton tasks. Fg.4 shows that n most regresson tasks,.e. 2-d Mexcan Hat, 3-d Mexcan Hat, Fredman #2, Gabor, Mult, Polynomal, and SnC, Boostng can sgnfcantly reduce the bas, and the degree of ts reducton s bgger than that of Baggng except n Mult. As for the remanng three tasks, n Fredman #3 both Boostng and Baggng cannot reduce the bas, n Fredman # and Plane Boostng even ncreases the bas. Therefore t seems that Boostng s better than Baggng n reducng the bas but ts performance s not very stable, whch s accordant wth the observatons reported n prevous works [, 4]. Parwse two-taled t-tests ndcate that n almost all the regresson tasks except 3-d Mexcan Hat, Fredman #3, and Plane, GASE s sgnfcantly better than Boostng n reducng the bas. In partcular, GASE s ablty of reducng the bas s so good that n Fredman #2, Polynomal, and SnC the s even reduced to the degree close to zero. Therefore we beleve that n regresson tasks, GASE s the best among the compared ensemble approaches n reducng the bas, whch s supported by the subfgure ttled average n Fg.4. Fg.5 shows that n maorty classfcaton tasks,.e. Allbp, Chess, Hypothyrod, Image segmentaton, 7

18 d Mexcan Hat d Mexcan Hat Fredman # Fredman # Fredman # Gabor Mult Plane Polynomal SnC average Baggng Boostng GASE Fg. 6. Comparson of the of Baggng, Boostng, and GASE on regresson tasks. Sck, and Sck-euthyrod, Boostng can reduce the bas, but Baggng can only reduce the bas n Allbp and Chess. Moreover, when Boostng cannot reduce the bas, such as n Credt (German), LED-7, LED-24, and Waveform-4, nether can Baggng. Therefore t seems that Boostng s more effectve than Baggng n reducng the bas, whch s accordant wth the observatons reported n prevous works [, 4]. Parwse two-taled t-tests ndcate that when Boostng can sgnfcantly reduce the bas n the classfcaton tasks, GASE can also do so although the degree of ts reducton may not be so large as that of Boostng. Therefore we beleve that n classfcaton tasks, although GASE s ablty of reducng the bas s not so good as that of Boostng, t s stll better than that of Baggng, whch s supported by the subfgure ttled average n Fg.5. So, from Fg.4 and Fg.5, we beleve that the success of GASE may partally owe to ts ablty of sgnfcantly reducng the bas. Fg.6 shows that Baggng can sgnfcantly reduce the varance n all regresson tasks, but the performance of Boostng s not so stable. There are tasks such as 2-d Mexcan Hat, Gabor, and SnC where 8

19 Allbp Chess Credt (German) Hypothyrod Image segmentaton LED LED Sck Sck-euthyrod Waveform average Baggng Boostng GASE Fg. 7. Comparson of the of Baggng, Boostng, and GASE on classfcaton tasks. Boostng reduces the varance more sgnfcantly than Baggng, but there are also tasks such as Plane where Boostng greatly ncreases the varance. Parwse two-taled t-tests ndcate that GASE can also sgnfcantly reduce the varance n all the regresson tasks. Moreover, GASE s ablty n reducng the varance s even sgnfcantly better than that of Baggng n almost half of those tasks,.e. Fredman #2, Gabor, Polynomal, and SnC. Therefore we beleve that n regresson tasks, GASE s the best among the compared ensemble approaches n reducng the varance, whch s supported by the subfgure ttled average n Fg.6. Fg.7 shows that Baggng can sgnfcantly reduce the varance n all classfcaton tasks, but the performance of Boostng s not so stable. There are tasks such as Allbp, Chess, LED-7, and Sck where Boostng greatly reduces the varance, but there are also tasks such as Credt (German), LED-24, and Waveform-4 where Boostng greatly ncreases the varance. Parwse two-taled t-tests ndcate that when Baggng can sgnfcantly reduce the varance n the 9

20 classfcaton tasks, GASE can also do so although the degree of ts reducton may not be so large as that of Baggng. Therefore we beleve that n classfcaton tasks, although GASE s ablty of reducng the varance s not so good as that of Baggng, t s stll better than that of Boostng, whch s supported by the subfgure ttled average n Fg.7. So, from Fg.6 and Fg.7, we beleve that the success of GASE may partally owe to ts ablty of sgnfcantly reducng the varance. In summary, from Fg.4 to Fg.7 we fnd that n regresson tasks GASE can do better than both Baggng and Boostng n reducng both the bas and the varance, and n classfcaton tasks GASE s better than Baggng n reducng the bas and s better than Boostng n reducng the varance. So, we beleve that the success of GASE may le n that t has the ablty of sgnfcantly reducng both the bas and the varance smultaneously. We guess that GASE can reduce the bas because t effcently utlzes the tranng data n that t employs a valdaton set that s bootstrap sampled from the tranng set, and t can reduce the varance because t combnes multple versons of the same learnng approach. However, those guesses should be ustfed by rgorous theoretcal analyss. 6. Conclusons At present, most neural network ensemble approaches utlze all the avalable neural networks to consttute an ensemble. However, the goodness of such a process has not yet been formally proved. In ths paper, the relatonshp between the ensemble and ts component neural networks s analyzed, whch reveals that t may be a better choce to ensemble many nstead of all the avalable neural networks. Ths theory may be useful n desgnng powerful ensemble approaches. Then, n order to show the feasblty of the theory, an ensemble approach named GASE s presented. A large emprcal study shows that GASE s superor to both Baggng and Boostng n both regresson and classfcaton because t utlzes far less component neural networks but acheves stronger generalzaton ablty. ote that although GASE has obtaned mpressve performance n our emprcal study, we beleve that there are approaches that could do better than GASE along the way that GASE goes,.e. ensemblng many nstead of all avalable neural networks under certan crcumstances. The reason s that GASE has not been fnely tuned because ts am s only to show the feasblty of our theory. In other words, the am of GASE s ust to show that the networks approprate for consttutng the ensemble could be effectvely selected from a collecton of avalable neural networks. So, ts performance mght at least be mproved through utlzng better ftness functons, codng schemes, or genetc operators. In the future we hope to use some other large-scale data sets such as IST to test GASE and tune ts performance, and then apply t to real-world applcatons. Moreover, t s worth menton that fndng stronger ensemble approaches based on the recognton that many could be better than all s an nterestng ssue for future works. In order to explore the reason of the success of GASE, the bas-varance decomposton s employed n ths paper to analyze the emprcal results. It seems that the success of GASE manly les n that GASE could reduce the bas as well as the varance. We guess that GASE can reduce the bas because t effcently utlzes the tranng data n that t employs a valdaton set bootstrap sampled from the tranng set, and t can reduce the varance because t combnes multple verson of the same learnng approach. Rgorous theoretcal analyss may be necessary to ustfy those guesses, whch s another nterestng ssue for future works. 2

Ensembling Neural Networks: Many Could Be Better Than All

Boosting as a Regularized Path to a Maximum Margin Classifier

(Almost) No Label No Cry

Who are you with and Where are you going?

MANY of the problems that arise in early vision can be

Do Firms Maximize? Evidence from Professional Football

Assessing health efficiency across countries with a two-step and bootstrap analysis *

As-Rigid-As-Possible Image Registration for Hand-drawn Cartoon Animations

Face Alignment through Subspace Constrained Mean-Shifts

Complete Fairness in Secure Two-Party Computation

The Relationship between Exchange Rates and Stock Prices: Studied in a Multivariate Model Desislava Dimitrova, The College of Wooster

TrueSkill Through Time: Revisiting the History of Chess

can basic entrepreneurship transform the economic lives of the poor?

From Computing with Numbers to Computing with Words From Manipulation of Measurements to Manipulation of Perceptions

Turbulence Models and Their Application to Complex Flows R. H. Nichols University of Alabama at Birmingham

As-Rigid-As-Possible Shape Manipulation

DISCUSSION PAPER. Is There a Rationale for Output-Based Rebating of Environmental Levies? Alain L. Bernard, Carolyn Fischer, and Alan Fox

4.3.3 Some Studies in Machine Learning Using the Game of Checkers

Why Don t We See Poverty Convergence?

The Developing World Is Poorer Than We Thought, But No Less Successful in the Fight against Poverty

DISCUSSION PAPER. Should Urban Transit Subsidies Be Reduced? Ian W.H. Parry and Kenneth A. Small

The Global Macroeconomic Costs of Raising Bank Capital Adequacy Requirements

Ciphers with Arbitrary Finite Domains

EVERY GOOD REGULATOR OF A SYSTEM MUST BE A MODEL OF THAT SYSTEM 1

Income per natural: Measuring development as if people mattered more than places

Finance and Economics Discussion Series Divisions of Research & Statistics and Monetary Affairs Federal Reserve Board, Washington, D.C.

Alpha if Deleted and Loss in Criterion Validity 1. Appeared in British Journal of Mathematical and Statistical Psychology, 2008, 61, 275-285

What to Maximize if You Must

Should marginal abatement costs differ across sectors? The effect of low-carbon capital accumulation

How To Prove That A Multplcty Map Is A Natural Map

UPGRADE YOUR PHYSICS