Large Scale Extreme Learning Machine using MapReduce

Large Scale Extreme Learnng Machne usng MapReduce Large Scale Extreme Learnng Machne usng MapReduce * L Dong, Pan Zhsong, 3 Deng Zhantao, 4 Zhang Yanyan Insttute of Command Automaton, PLA Unversty of Scence and echnology, anjng, 0007, Jangsu Provnce, PR Chna donggeat006@yahoo.com.cn, hotpzs@hotmal.com, dengzhantao@sna.cn zhyany@gmal.com Abstract Extreme learnng machne (ELM) s a new method n neural networks, contrasted wth conventonal gradent-based algorthms such as BP, It can remarkably shorten the tranng tme, all the learnng process s done by only once. he algorthm cannot deal wth large scale dataset due to the memory lmtaton. Here we mplement a large scale ELM based on MapReduce, the new parallel programmng model. he foundaton of the algorthm s parallel matrx multplcaton, whch has been dscussed somewhere else, but we gves out the whole computaton and I/O cost n detals. Experments on large scale dataset show scalablty of ths method.. Introducton Keywords: ELM, MapReduce, Large scale We are now undergong the ncreasng of massve nformaton and people have emergng demand to deal wth bg data. Recently, large scale machne learnng technology receves hghly attenton as ts performance of mnng not so bg data n past decades. Many tradtonal algorthms could fal dealng wth large scale data smply because they cannot load all the data nto the memory at once, whle these algorthms are desgned based on the hypothess that all data can be read. here are roughly two classes of approaches whch can work, streamng data to an onlne learnng algorthm and parallelzng a batch-learnng algorthm []. hs paper focuses on the second approach. MapReduce s a new parallel programmng model publshed by Google at 004 [], whch can make programs run at large cluster bult up by common computers as well as mult-cores system. hs technology smplfed processng bg data through supplyng hgh level nterfaces, hdng system related detals et al. Usng MapReduce to parallelze machne learnng algorthms begns at Chu et al [3]. hey concluded that algorthms that ft the Statstcal Query model can be wrtten n summaton form whch can be easly parallelzed. Qng He et al [4] gves out several popular parallel classfcaton algorthms based on MapReduce. Cu et al [5] used MapReduce to parallelze an algorthm for fndng communty n a moble socal network. here exsts a few mplementatons of MapReduce, but Apache Hadoop [6] becomes the de-facto standard verson [7] and get wdely used n ndustry [8]. he man experments n ths paper are taken on ths platform.. Extreme learnng machne Extreme learnng machne (ELM) s proposed by Guang-bn Huang at 004 [9], and later the whole detals and expermental results are publshed at 006 [0].It shows that the ELM has great advantages at tranng speed compared wth tradtonal backpropagaton (BP) algorthm. And ELM has better generalzaton performance n most cases. he new research [] shows that ELM and SVM have the same optmzaton objectve functon, whle the formal has mlder constrants. Here gves the bref revew of ELM. Based on rgorously proved n theory that the nput weghts and the hdden layer bas of Snglehdden Layer Forward etworks (SLF) can be chosen randomly, Huang proposed ELM, whch shows how output weghts (lnkng the hdden layer to the output layer) could be analytcally determned. Internatonal Journal of Dgtal Content echnology and ts Applcatons(JDCA) Volume6,umber0,ovember 0 do:0.456/jdcta.vol6.ssue0.7 6

Large Scale Extreme Learnng Machne usng MapReduce For samples ( x, t ),where x [,,..., ] n x x xn R, t [,,..., ] m t t tm R,a SLF has hdden nodes and actve functon g( x) can be expressed as: g ( x) g( wx b) o, j,... () j j where w [,,..., ] w x wm s the weghts connect the nput layer and the th hdden node, [,,..., ] m s the weghts connect the output layer and the th hdden node, b s the bas of th hdden node. gx can be Sgmod or RBF, or even many o s the output of sample j. ( ) j nondfferentable actvaton functons. he tranng error s formula () can be wrtten as Where, j o t, when totally fttng, j j H () H ( w,... w, b,... b, x,... x ) gw ( xb) gw ( x b ) g( wx b) g( w x b ) t, t m m H s called hdden layer output matrx, ts the output of th hdden node. ow the weght w and bas b have generated randomly, the task s to fnd proper. In most cases, lner Equaton () has no solutons, Accordng to the error mnmal prncple ^ mn H ( w,... w, b,... b, x,... x ), the smallest norm least squares soluton of the above lnear system s : ^ H (3) Where H s the Moore Penrose generalzed nverse of matrx H, whch can be acqured by sngular value decomposton (SVD), and also can be computed by the flowng formula, If matrx H H s nonsngular ( H H H) H (4) Or f matrx HH s nonsngular H H ( HH ) (5) 63

Large Scale Extreme Learnng Machne usng MapReduce But f the above two matrces are sngular, accordng to rdge regresson theory, a postve value added to dagonal of orgnal matrx could help to get the stable soluton. So, formula (4) (5) can be wrtten as I H ( H H) H (6) or I H H ( HH ) (7) Huang et al [9] ponts out that calculated by above optmzaton problem. H s consstent wth the followng 3. Parallel Extreme learnng machne 3.. MapReduce mn H (8) When dataset exceeds the memory lmtaton, the hdden layer output matrx H cannot be loaded once, Moore Penrose generalzed nverse of matrx H and object weghts are all can t be calculated by above formulas. Lang et al [] developed an onlne sequental learnng algorthm called OS-ELM, makng the tranng data arrve one by one or chunk by chunk. hs method partly loosens the memory bottle neck when all data can be stored n one computer. When the data scale up to dstrbuted storage, such method may have problem to access data over machnes, and usng only one processng unt may brng up neffcency n ths case. Due to the fact that all data s stored n multple machnes, dstrbuted computng s a good choce. And movng processng to data, namely localty s the hghlght of MapReduce [3]. User mplement MapReduce programs should specfy two operatons: Map and Reduce. Map takes key/value pars as nput and generates mmedate results n the same form; Reduce takes all values whch have the same key, and processes n further step. he data flows and types can be expressed as followng: Map (k,v) lst(k,v) Reduce (k,lst(v)) lst(v) It s to be notced that the data type of Map s output must be the same as Reduce s nput. All the detals of communcaton and synchronzaton are hdden by system as well as fal-torrent mechansm. Many Maps and Reduces can be run smultaneously over machnes. 3. Matrx Multplcaton When the dataset becomes large, for example we have 0 7 samples, t s dffcult to calculate Moore Penrose generalzed nverse of matrx H through (6) or (7).However, the foundaton of them s matrx multplcaton. Sun et al [4] summarzed three schemes of matrx multplcaton n the background to solve MF. Part of them wll be used here. mn nk For common matrx A, B, the basc operaton of multplcaton AB s to dvde A as rows and B as columns, and the element of result matrx s the nner producton of two vectors. 64

Large Scale Extreme Learnng Machne usng MapReduce a a AB b b bk am ab abk a b a b m k m MapReduce can calculate the rows of scheme algorthm-. (9 AB n parallel wthout Reduce operaton. We call ths Map key s the row d of A, value s the correspondng row of A newvalue=value*b wrte(key,newvalue)to HDFS Fgure. Algorthm- Map Algorthm- can work well f m n, and matrx B s shared across over machnes. Large matrx A and the result both stored n rows. Or reversely, calculate the columns of B n parallel, and share A across over machnes. Algorthm- could fal f the two matrces are both large, any one cannot be shared n memory; there s a dfferent dvson method. Matrx A s dvded as columns and B as rows; the result s sum of matrces, each one s the outer producton of two vectors. We call ths scheme algorthm-. AB a a b a b b an bn 0 Where denotes outer producton a b a bk a b a b a b m k m Algorthm- works well when n m, n k. Matrx A s stored n columns, and B n rows. he result can be stored n ether rows or columns. he outer producton and summarzaton both need a MapReduce job. he detals s gven as fgure 5. 65

Large Scale Extreme Learnng Machne usng MapReduce Map key s the row d of A or the column d of B, value s the correspondng row of A or column of B newvalue= row(a) or column(b) Pass(key,newvalue) to phase- Reduce Fgure. Algorthm- phase-i Map Reduce phase-i Map s output newvalue= a b Wrte(key,newvalue)to HDFS Fgure 3. Algorthm- phase-i Reduce Map phase-i Reduce s output Pass(key,value) to phase-ii Reduce Fgure 4. Algorthm- phase-ii Map Reduce phase-ii Map s output newvalue=sum(lst[values]) Wrte(key,newvalue) to HDFS Fgure 5. Algorthm- phase-ii Reduce 3.3 Parallel ELM It s crtcal to choose dfferent matrx multplcaton schemes accordng to partcular demand of algorthm. In ELM, the hdden layer output matrx H commonly has far more rows than columns,, the formal s the number of samples and the later s number of hdden nodes, whch s controllable. A MapReduce job read fles by lnes through Maps, and many Maps access fles H n rows. he multplcaton H H n (6) fts algorthm- perfectly, whle HH n (7) s hard to calculate. What s more, HH s szed by, whch s dffcult to get the nverse matrx. Storng matrx H n rows s equvalent to store H n columns. In ths case, algorthm- could be reduced to one job, the map calculates outer smultaneously. So t s reasonable to store large matrx producton of a row and ts self, and the reduce summarzes. Denote C ( I H H),C, as the number of hdden nodes s controllable,the nverse operaton s easy to mplement n memory through exstng tools such as LAPACK. H CH, H s stored n columns, C can be shared n memory, ths multplcaton can be well done usng algorthm-, one can smultaneously calculate the nner producton of each column of H and C ;the result H s stored n columns. 66

Large Scale Extreme Learnng Machne usng MapReduce H, c s the orgnal object values, c s the dmenson, for regresson equals and classfcaton equals the number of class labels. c, s stored n rows, the multplcaton fts algorthm- perfectly. In predcaton phase, we need to calculate the predcaton matrx Y., can be shared n memory, ths multplcaton can be done usng algorthm-. In testng phase, we get object matrx and predcaton Y are both stored n rows, here we need a MapReduce job to compare the predcted value and actual value of each sample, and then calculate the fnal RMSE for regresson or success rate for classfcaton. 3.4 Cost analyss Recent researches begn to consder dsk a cost and network cost when dealng wth large scale data usng MapReduce [5], nstead of only evaluatng the tradtonal computaton cost. Yu et al [6] ponts out the tranng tme actually contans two parts, tme to run data n memory and tme to access data from dsk. In ELM, the formal part s manly the tmes of multplcaton. Assumng we have large memory, and use formula (6) as the tranng rule, theoretcal cost of ELM s ( C ) tmes multplcaton and loadng samples. If there are k mapers and reducers n the MapReduce runnng system, computaton cost s ( C ) / k tmes multplcaton. However, evaluatng the network cost and I/O cost s a bt more complcated. ot all mapper s output s shuffled to reducers through network. Actually, MapReduce framework mnmzes the network cost by assgnng reduce tasks to the machnes whch have already stored the requred data [3].o smplfy the estmatng, we could assume the rato of actual shuffled data s a constant r. In computng H H, the shuffled data s r ; In computng H, there are no reduces; In computng, the shuffled data of two phases s r( C C ). he total network cost s H etwork[( C C ) r / k] () where etwork (.) means the network cost whch s manly affected by transfer speed. All the mmedate results are stored on dsks, the readng tasks are done by mapers and wrtng tasks by reducers n most cases. he I/O cost of MapReduce jobs n the tranng phase of parallel ELM s c Read(3 C ) k Wrte C C k ( ) (3) where Read (.) and Wrte (.) means the readng and wrtng tme respectvely. able shows above costs of jobs nvolved wth tranng. he major cost comes from computng, but network and dsk I/O s also tme-consumng. 67

Large Scale Extreme Learnng Machne usng MapReduce able. ranng Cost for Parallel ELM 7 0, 0, Data: MB C me: Sec Shuffle Read Wrte me H H 0.99 3940.8 69 CH 0 390 4374 47 H p- 976 6760 0774 6 H p- 0.86 673 0.85 90 Other costs such as solvng nverse matrx and loadng small varables are neglgble when the number of hdden layer nodes s small. 4 Experments 4. Experments setup All experments are conducted on a cluster of 8 common servers; each has a quad-core.8ghz CPU, 8GB memory and B dsk, wth one ggabt Ethernet connected. he operaton system s Lnux server, nstalled wth Hadoop 0.0. and JDK.6. Here shows the capablty of ths parallel algorthm to handle large scale data from two aspects, regresson and classfcaton. 4. Regresson Artfcal dataset snc has been used wdely n regresson problem. All data can be generated by followng sn( x) / x, x 0 yx ( ) (4), x 0 A tranng set and test set can be generated at any scale, where x are randomly dstrbuted n nterval (-0, 0), noses n nterval (-, ) can be added to tranng data whle testng data remans orgnal. he crtera of parallel algorthm nclude RMSE and tranng tme. Fgure6 shows the tranng tme ncludng four parts showed n table as data scale up from 0 7 to 0 8, and the dataset scale vares from 335MB to 3.3GB. he total tranng tme ncreases sub lnearly due to proper matrx multplcaton schemes chosen by the algorthms. 6000 5000 HtH CHt H total 4000 tme(seconds) 3000 000 000 0 3 4 5 6 7 8 9 0 number of samples x 0 7 Fgure 6. ranng tme for regresson 68

Large Scale Extreme Learnng Machne usng MapReduce Fgure7 shows the speedup acheved by addng computaton node. Each node has 5 reducers and 7 mappers n our system. he hghest speedup s 5.7 usng 8 nodes n the experment. 8 7 6 Speedup 5 4 3 speedup deal lnear 3 4 5 6 7 8 umber of node(each has 5 reducers) Fgure 7. Speedup for regresson 4. Classfcaton ELM s essentally a supervsed learnng technology, whch requres the labeled tranng data. But t s hard to fnd large scale dataset labeled wth every sample; In fact automatcally labelng for unlabeled samples s an mportant motvaton of unsupervsed learnng. Above parallel algorthm manly execute the matrx multplcaton n parallel, other to mprove the orgnal algorthm n theory. So here shows the capablty to handle large scale data, all tranng data can be acqured by smply duplcatng orgnal data to a specfy scale. Fgure8 shows the tranng tme ncreases as the dataset scales up. he orgnal dataset s Image segmentaton wth 9 features and 7classes, has been wdely used n classfcaton problem. Smlar to the regresson case, here exsts sub lnearty between data scale and tranng tme. 000 800 600 ranng tme(second) 400 00 000 800 parallel ELM deal lnear 600 400 3 4 5 6 7 8 9 0 Dataset(GB) Fgure 8. ranng tme for classfcaton 5. Concluson In ths paper, we manly mplement parallel extreme learnng machne based on MapReduce. he foundaton of ths mplement s proper formula to calculate the parameters n ELM, and the 69

Large Scale Extreme Learnng Machne usng MapReduce correspondng parallel matrx multplcaton schemes. Experment shows the lnearty between tranng tme and dataset scale. As two man methods to deal wth large scale problem, parallel algorthm and onlne learnng algorthm may have somethng n common. It s an nterestng topc to compare these two methods n the case of ELM, and to dscuss the best appled area of ether n the future. 6. References [] John Langford, Lhong L, ong Zhang, Sparse Onlne Learnng va runcated Gradent, Journal of Machne Learnng Research, vol.0,no.0, pp.777-80, 009. [] J. Dean, S. Ghemawat, Mapreduce: Smplfed data processng on large clusters, In Proceedngs of Operatng Systems Desgn and Implementaton, pp.37 49, 004. [3] Cheng-ao Chu, Sang Kyun Km, Y-An Ln et al, Map-Reduce for machne learnng on multcore, In Proceedngs of Advances n eural Informaton Processng Systems, pp.8 88, 006. [4] Q.He, F.Z.Zhuang, J.C.L et al, Parallel mplementaton of classfcaton algorthms based on mapreduce, In Proceedngs of Rough Sets and Knowledge echnology, pp.655-6, 00. [5] Wen Cu, Guoyong Wang, Ke Xu, "Parallel Communty Mnng n Socal etwork usng Mapreduce", IJAC: Internatonal Journal of Advancements n Computng echnology, vol. 4, no. 5, pp. 445 453, 0 [6] Apache Hadoop, http://hadoop.apache.org/ [7] A. Verma, X. Llora, D. E. Goldberg, et al, Scalng genetc algorthms usng mapreduce, In Proceedngs of Internatonal Conference on Intellgent Systems Desgn and Applcatons, pp.3-8, 009 [8] LEI Le, "owards a Hgh Performance Vrtual Hadoop Cluster", JCI: Journal of Convergence Informaton echnology, vol. 7, no. 6, pp. 9 303, 0 [9] Guang-Bn Huang, Qn-Yu Zhu, Chee-Kheong Sew, Extreme Learnng Machne: A ew Learnng Scheme of Feedforward eural etworks, In Proceedngs of Internatonal Jont Conference on eural etworks, pp.985-990, 004. [0] Guang-Bn Huang, Qn-Yu Zhu, Chee-Kheong Sew, Extreme learnng machne: heory and applcatons, eurocomputng, vol.70, no.-3, pp.489-50, 006. [] G.-B. Huang, H. Zhou, X. Dng et al, Extreme Learnng Machne for Regresson and Multclass Classfcaton, IEEE ransactons on Systems, Man, and Cybernetcs - Part B: Cybernetcs, vol.4, no., pp.53-59, 0. [] Lang -Y, Huang G-B, Saratchandran P et al, A fast and accurate on-lne sequental learnng algorthm for feedforward networks, IEEE ransactons on eural etwork, vol.7, no.6, pp.4 43, 006. [3] J. Ln, C. Dyer, Data-Intensve ext Processng wth MapReduce, Morgan & Claypool Publshers, USA, 00. [4] Zhengguo Sun, ao L, aphtal Rshe, Large-Scale Matrx Factorzaton usng MapReduce, In Proceedngs of IEEE Internatonal Conference on Data Mnng Workshops, pp.4-48, 00 [5] Robson Leonardo Ferrera Cordero, Caetano rana Jr, Agma Juc Machado rana et al, Clusterng Very Large Mult-dmensonal Datasets wth MapReduce, In Proceedngs of ACM SIGKDD Conference on Knowledge Dscovery and Data Mnng, pp.690-698, 0 [6] Hsang-Fu Yu, Cho-Ju Hseh, Ka-We Chang et al, Large Lnear Classfcaton When Data Cannot Ft In Memory, In Proceedngs of ACM SIGKDD Conference on Knowledge Dscovery and Data Mnng, pp.833-84, 00. 70