ECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble

1 ECE544NA Fnal Project: Robust Machne Learnng Hardware va Classfer Ensemble Sa Zhang, szhang12@llnos.edu Dept. of Electr. & Comput. Eng., Unv. of Illnos at Urbana-Champagn, Urbana, IL, USA Abstract In ths paper, we propose to use classfer ensemble (CE) as a method to enhance the robustness of machne learnng (ML) kernels n presence of hardware error. Dfferent ensemble methods (Baggng and Adaboost) are explored wth decson tree (C4.5) and artfcal neural network (ANN) as base classfers. Smulaton results show that ANN s nherently tolerant to hardware errors wth up to 1% hardware error rate. Wth smple majorty votng scheme, CE s able to effectvely reduce the classfcaton error rate for almost all tested data sets, wth maxmum test error reducton of 48%. For tree ensemble, Adaboost wth decson stump as weak learner gves best results; whle for ANN, baggng and boostng outperform each other dependng on data set. I. INTRODUCTION Ths project s motvated by recently works appeared n machne learnng (ML) on slcon [1]. There s a growng nterest n VLSI and crcut area to effcently brng ML kernels onto slcon, largely due to the performance and power lmtaton of software only approaches. On the other hand, n deep sub-mcro era, devces exhbt statstcal behavors whch degrade the system relablty. To reduce the energy consumpton of crcuts, sub/near threshold computng has been proposed to offer 1X power reducton at the expense of large delay varaton as shown n Fg. 1. To enhance the relablty and robustness of ML kernels on slcon, Verma et al. propose to utlze ML algorthm to effcently learn the error behavor and make correct predcton/classfcaton n presence of hardware errors. Km [2] et al. shows that some ML algorthm nherently has tolerance to hardware errors. These work motvate us to look nto methods to effectvely enhance the robustness of ML kernels n presence of hardware errors. Classfer ensemble (CE, also referred to as Multple Classfer System) have been employed to enhance the performance of sngle classfer system [3], [4]. As summarzed n [5] [8], popular ensemble methods nclude: Baggng, Adaboost, Bayesan votng, Random forest, Rotaton forest, Error correctng output codng, etc. In terms of decson combnng, [9], [1] summarzes typcal classfer fuson methods. Therefore, n ths paper, we explore the dea of usng CE to enhance the robustness of ML kernels. Dfferent ensemble methods (Baggng and Adaboost) are explored wth decson tree (C4.5) and artfcal neural network (ANN) as base classfers. Smulaton results show that ANN s nherently tolerant to hardware errors wth up to 1% hardware error rate. Wth smple majorty votng scheme, CE s able to effectvely reduce the classfcaton error rate for almost all tested data sets, wth maxmum test error reducton of 48%. For tree ensemble, Adaboost wth decson stump as weak learner gves best results; whle for ANN, baggng and boostng outperform each other dependng on data set. The rest of the report s organzed as follows: secton 2 gves background of the CE methods we use n the project, secton 3 presents error model, and smulaton setup; secton 4 presents smulaton results wth concluson provded n secton 5. A. Ensemble Methods II. BACKGROUND We gve a bref overvew of methods to construct classfer ensembles. 1) Average Bayes Classfer: In the average Bayes settng, every classfer output s nterpreted as a posteror probablty of class membershp, condton on the nput and the hypothess. h(x) = P (f(x) = y x, h) (1) Here we assume the hypothess follow a dstrbuton n the hypothess space H. If the hypothess space s small, t s possble to enumerate all possble classfers (h(x)). The fnal output s the average of these hypothess, weghted by the posteror probablty of h(x), as shown n the followng equaton: P (f(x) = y S, x) = h H h(x)p (h S) (2) Fg. 1. Wth reduced supply voltage, energy effcency s mproved but delay varaton s larger. Usng Bayes rule, we can wrte the posteror probablty P (h S) as:

2 P (h S) P (S h)p (h) (3) The practcal dffculty n mplementng ths method les n that when the hypothess space s large, t s dffcult to enumerate all h(x), also t s not always possble to know the pror of each of the hypothess. Fg. 2 provdes a llustraton of the algorthm. The true mappng whch the algorthm tres to learn s denoted as f(x). To construct the ensemble, all hypothess h(x) n the space H s learned, and the fnal mappng s a weghted average based on (2). Fg. 2. Average Bayes Classfer 2) Baggng: Bootstrap aggregatng s a popular method to construct ensembles. The basc dea s to frst generate many tranng sets from orgnal tranng samples by random samplng wth replacement, then tran multple classfers, each based on one of the tranng sets, as shown n Fg. 3. By random samplng wth replacement, each tranng set contans on average 63.3% of the orgnal tranng set. The fnal decson s made by takng the majorty votng of ndvdual weak classfers. Baggng has been shown to mprove the performance of unstable classfers, such as neural networks, decson trees etc. 3) Adaptve Boostng: Adaboost s another popular method for ensemble generaton. The basc dea can be depcted n the Fg. 4. The tranng has T teratons; each teraton wll produce a new weak classfer. The basc dea s that after each teraton step, the examples are re-weghted so that mssclassfed examples get hgher weght. In the subsequent teraton step, the tranng process wll focus more on classfyng the mss-classfed samples from prevous teraton. Fg. 5 gves the algorthm flow of Adaboost. Note that the basc dea of Adaboost s very smlar wth Baggng, wth two notable dfferences: 1) n generatng dfferent tranng sample, Baggng uses random sample wth replacement whle Adaboost uses the accuracy of prevous classfers to gude the selecton of subsequent tranng samples; and 2) Baggng uses smple majorty votng to generate decson whle Adaboost uses weghted average. It can be shown that the selecton of tranng sample dstrbuton and weghts n Adaboost wll mnmze an exponental loss functon defned as Fg. 3. Baggng Fg. 5. Adaboost algorthm L(f(x), y) = exp( y f(x )) (4) Proof: We are tryng to fnd a classfer of the form: H(x) = M α f m 1 (x) m=1 to mnmze loss n (4). At step m, we have H m (x) = H m 1 (x) + α m f m (x) The mnmzaton problem can be formulated as:

3 Fg. 4. Adaboost [α m, f m ] = arg mn [α m, f m ] = exp( y (H m 1 (x) + αf(x))) arg mn { N exp( y H m 1 (x)) exp(α)1[y f(x)] + exp( y H m 1 (x)) exp( α)1[y = f(x)]} [α m, f m ] = arg mn {exp(α) N exp( y H m 1 (x)) 1[y f(x)] + exp( α)( exp( y H m 1 (x)) exp( y H m 1 (x)) exp( α)1[y = f(x)])} Ths s equvalent to [α m, f m ] = arg mn {exp(α) N w m 1 () 1[y f(x)] + exp( α)(1 w m 1 ()) exp( α)1[y = f(x)])} where w m 1 () = smplfy the optmzaton nto: exp( y H m 1(x) exp( y H m 1(x)) [α m, f m ] = arg mn {exp( α) + [exp(α) exp( α)], we can further w m 1 () 1[y f(x)]} The optmal α and f can be solved by settng the dervatve wth respect to them to, we can thus obtan the optmal α m, f m as: f m = arg mn f α m = 1 2 log(1 ε m ε m ) w m 1 () 1[y f(x )] where ε m = N w m 1 () 1[y f(x)]. Therefore, we can see that n Adaboost, the dstrbuton and weght are selected such that the loss functon n (4) s mnmzed. 4) Injecton Randomness: The fnal method to construct ensemble s by njectng randomness nto the classfer. For good ensemble performance, the weak classfer used to construct the ensemble need to be unstable. Good unstable classfers are trees or neural networks. In the random njectng method, the ensemble s constructed by smply changng the ntal the weghts of NN, or choosng randomly the features used to splt the trees. B. Decson Combnaton The second mportant aspect of CE s to decde how to combne the output of each classfer to obtan the fnal decson. Many decson combnaton methods have been proposed, as shown n Fg. 6. If the output of each classfer s not a hard decson but a contnuous measure. We can use averagng method to get the ensemble output, popular averagng method ncludes mean, medan, weghted mean etc. If the output of classfer s a dscrete number, votng methods can be used. The smplest votng methods s majorty votng, whch takes the fnal output as the class most of the classfers agree on. Weghted average method can also be used to weght the decson of each classfer based on ther accuracy measure as n the case of Adaboost. A. Hardware error smulaton III. METHODOLOGY Error njecton s performed n smulaton level snce we do not have realstc hardware realzaton of the classfers. The error njecton presents dffculty due to at least two

4 TABLE I EXPERIMENTAL SETUP Base Classfer Ensemble Methods Combnaton ANN Baggng Majorty votng C4.5 Boostng Weghed votng Dec. stump Intal weghts Baggng, smple majorty votng scheme s employed. We use 4 data sets from UCI machne learnng repostory. For each data set, 1 fold cross-valdaton methods are used to evaluate ther performances. Fg. 6. Decson combnaton methods requrements: 1) the mappng from nput space to output error space need to have suffcent randomness, other wse t wll be trval to remove the error by addng some bas to the network output. 2) for same nput pattern, t wll always generate the same random error pattern. To solve ths ssue, we use a nput pattern dependent random number generator for error njecton, as shown n Fg. 7. When presented wth a feature vector, the random generator uses each dmenson as a seed to generate a Gaussan random number; the random number from dfferent dmenson are then added and scaled to provde an output err N(, 1). Ths method ensures that for dfferent feature vector, the generated random errors are dfferent, whle at the same tme same feature vector always results n same error value. The err s then used to nject hardware errors nto trees and ANNs. For decsons tree, we generate two threshold T hreshold 1 and T hreshold 2 for each node, when err falls nto the nterval [T hreshold 1, T hreshold 2 ],the decson of the node s flpped. For ANNs, smlar thresholds are generated for each weght, and f the err falls nto [T hreshold 1, T hreshold 2 ], the weght matrx s perturbed to smulate hardware defects. We can control the error rate by controllng the thresholds,.e. hgher error rate can be acheved f T hreshold 2 T hreshold 1 s ncreased. IV. RESULTS Fg. 8 shows the test error rate vs. ensemble sze for all evaluated ensemble methods. For Baggng wth decson trees, test error rate decreases as ensemble sze ncreases for both hardware error free (red) and hardware error njected (blue) case. However, hardware error degrade the performance of the ensemble and ths degradaton cannot be effectvely compensated for wth ncreased ensemble sze. For Adaboost wth decson tree, Fg. 8 ndcates that the ensemble suffer from over-fttng. As ensemble sze grows, the error rate frst decreases, then starts to ncrease at ensemble sze of 3 and 2 for error free and erroneous case. Adaboost wth decson stump gves best performance n the famly of tree ensembles, the error njected ensemble almost acheves the same error reducton as the error free ensemble. For ANN wth baggng, both error njected and error free ensemble acheve effectve reducton of test error rate wth ncreasng ensemble sze. However, for ANN wth Adaboost, over-fttng starts to occur at ensemble sze of 4 and 3 for error free and erroneous ensemble, respectvely. Fg. 7. Error njecton methodology B. Experment Setup Table I shows the experment setup. In ths project, we explore Baggng and Adaboost wth ANN and C4.5 decson tree and decson stump (a decson tree wth depth 1) as base classfers. When tranng ANN, random ntal weghts are used to further ntroduce dversty nto the ensemble. For Adaboost, we use weghted votng as shown n Fg. 5, and for Fg. 8. Results: test error rate vs. ensemble sze Fg. 9 shows the result for 4 data set from UCI machne learnng repostory at dfferent njected hardware error rate. From the fgure several observaton can be made: 1) In terms of nherent error tolerance, ANN s better than decson trees. Ths can be seen from Fg. 9 by notng that at 3% error rate, erroneous C4.5 already have severely degraded performance at data set 1 and 2, but ANN s able to mantan smlar performance regardless of error njecton

5 Error rate of tree ensemble Error rate of NN ensemble 3%.4.2 C4.5 error free C4.5 error njected C4.5 baggng C4.5 boostng Dec stump boostng.4.2 NN error free NN error njected NN baggng NN boostng 1%.4.4.2.2 2%.4.4.2.2 Fg. 9. Results: test error rate for dfferent data set at error rate of 3%, 1% and 2% tll 1% percent error rate. The dfference can be partally explaned by notng that for decson trees, at each node a hard decson s made, so error have hgh chance to propagate to output, whle for ANN, each layer does not make a hard decson but use a sgmod functon to suppress the output, whch mght help reduce the effect of node errors. 2) For tree ensemble, Adaboost wth decson stump always gve best performance, consstent wth Fg. 8. Boostng decson tree tends to suffer from over-fttng at low error rate. Ths problem s mtgated at hgh error rate due to the fact that at hgh error rate, the ensemble sze tends to be reduced. 3) For ANN ensemble, Adaboost and Baggng can out perform each other dependng on data set, and there s no consstent behavor across dfferent error rate. The over-fttng problem of ANN s less severe compared wth trees, and smlarly the over-fttng gets mtgated at hgh error rate due to the reducton of ensemble sze. V. CONCLUSION AND FUTURE WORK In ths paper, we show that CE s an effectve method to enhance the robustness of ML hardware. Baggng and Adaboost are explored wth decson tree and ANN as base classfers. Smulaton results show that ANN s nherently tolerant to hardware errors wth up to 1% hardware error rate. Wth smple majorty votng scheme, CE s able to effectvely reduce the classfcaton error rate for almost all tested data sets, wth maxmum test error reducton of 48%. For tree ensemble, Adaboost wth decson stump as weak learner gves best results; whle for ANN, baggng and boostng outperform each other dependng on data set. In the future, we can extend the framework to nclude more ensemble methods, to handle mult-class problems, to mtgate the over-fttng problem by usng regularzaton. Also t wll be helpful to mplement part of the algorthm n hardware to evaluate the performance of CE on ML kernels. ACKNOWLEDGMENT The author would lke to thank Prof. Hasegawa-Johnson and Sujeeth Subramanya Bharadwaj for ther help and gudance. REFERENCES [1] N. Verma, K. H. Lee, K. J. Jang, and A. Shoeb, Enablng system-level platform reslence through embedded data-drven nference capabltes n electronc devces, n Acoustcs, Speech and Sgnal Processng (ICASSP), 212 IEEE Internatonal Conference on, 212, pp. 5285 5288. [2] J. Cho, E. P. Km, R. A. Rutenbar, and N. R. Shanbhag, Error reslent mrf message passng archtecture for stereo matchng, n Sgnal Processng Systems (SPS), 213 IEEE Workshop on, 213, pp. 348 353. [3] L. yng Yang, Z. Qn, and R. Huang, Desgn of a multple classfer system, n Machne Learnng and Cybernetcs, 24. Proceedngs of 24 Internatonal Conference on, vol. 5, 24, pp. 3272 3276 vol.5. [4] G. Gacnto, F. Rol, and G. Fumera, Desgn of effectve multple classfer systems by clusterng of classfers, n Pattern Recognton, 2. Proceedngs. 15th Internatonal Conference on, vol. 2, 2, pp. 16 163 vol.2. [5] T. G. Detterch, Ensemble methods n machne learnng, n MULTI- PLE CLASSIFIER SYSTEMS, LBCS-1857. Sprnger, 2, pp. 1 15. [6] J. Rodrguez, L. Kuncheva, and C. Alonso, Rotaton forest: A new classfer ensemble method, Pattern Analyss and Machne Intellgence, IEEE Transactons on, vol. 28, no. 1, pp. 1619 163, 26. [7] D. Optz and R. Macln, Popular ensemble methods: An emprcal study, Journal of Artfcal Intellgence Research, vol. 11, pp. 169 198, 1999. [8] L. B. Statstcs and L. Breman, Random forests, n Machne Learnng, 21, pp. 5 32.

[9] L. Xu, A. Krzyzak, and C. Suen, Methods of combnng multple classfers and ther applcatons to handwrtng recognton, Systems, Man and Cybernetcs, IEEE Transactons on, vol. 22, no. 3, pp. 418 435, 1992. [1] T. K. Ho, J. Hull, and S. Srhar, Decson combnaton n multple classfer systems, Pattern Analyss and Machne Intellgence, IEEE Transactons on, vol. 16, no. 1, pp. 66 75, 1994. 6