Big Data Deep Learning: Challenges and Perspectives

Receved Aprl 20, 2014, accepted May 13, 2014, date of publcaton May 16, 2014, date of current verson May 28, 2014. Dgtal Object Identfer 10.1109/ACCESS.2014.2325029 Bg Data Deep Learnng: Challenges and Perspectves XUE-WEN CHEN 1, (Senor Member, IEEE), AND XIAOTONG LIN 2 1 Department of Computer Scence, Wayne State Unversty, Detrot, MI 48404, USA 2 Department of Computer Scence and Engneerng, Oakland Unversty, Rochester, MI 48309, USA Correspondng author: X.-W. Chen (xwen.chen@gmal.com) ABSTRACT Deep learnng s currently an extremely actve research area n machne learnng and pattern recognton socety. It has ganed huge successes n a broad area of applcatons such as speech recognton, computer vson, and natural language processng. Wth the sheer sze of data avalable today, bg data brngs bg opportuntes and transformatve potental for varous sectors; on the other hand, t also presents unprecedented challenges to harnessng data and nformaton. As the data keeps gettng bgger, deep learnng s comng to play a key role n provdng bg data predctve analytcs solutons. In ths paper, we provde a bref overvew of deep learnng, and hghlght current research efforts and the challenges to bg data, as well as the future trends. INDEX TERMS Classfer desgn and evaluaton, feature representaton, machne learnng, neural nets models, parallel processng. I. INTRODUCTION Deep learnng and Bg Data are two hottest trends n the rapdly growng dgtal world. Whle Bg Data has been defned n dfferent ways, heren t s referred to the exponental growth and wde avalablty of dgtal data that are dffcult or even mpossble to be managed and analyzed usng conventonal software tools and technologes. Dgtal data, n all shapes and szes, s growng at astonshng rates. For example, accordng to the Natonal Securty Agency, the Internet s processng 1,826 Petabytes of data per day [1]. In 2011, dgtal nformaton has grown nne tmes n volume n just fve years [2] and by 2020, ts amount n the world wll reach 35 trllon ggabytes [3]. Ths exploson of dgtal data brngs bg opportuntes and transformatve potental for varous sectors such as enterprses, healthcare ndustry manufacturng, and educatonal servces [4]. It also leads to a dramatc paradgm shft n our scentfc research towards data-drven dscovery. Whle Bg Data offers the great potental for revolutonzng all aspects of our socety, harvestng of valuable knowledge from Bg Data s not an ordnary task. The large and rapdly growng body of nformaton hdden n the unprecedented volumes of non-tradtonal data requres both the development of advanced technologes and nterdscplnary teams workng n close collaboraton. Today, machne learnng technques, together wth advances n avalable computatonal power, have come to play a vtal role n Bg Data analytcs and knowledge dscovery (see [5] [8]). They are employed wdely to leverage the predctve power of Bg Data n felds lke search engnes, medcne, and astronomy. As an extremely actve subfeld of machne learnng, deep learnng s consdered, together wth Bg Data, as the bg deals and the bases for an Amercan nnovaton and economc revoluton [9]. In contrast to most conventonal learnng methods, whch are consdered usng shallow-structured learnng archtectures, deep learnng refers to machne learnng technques that use supervsed and/or unsupervsed strateges to automatcally learn herarchcal representatons n deep archtectures for classfcaton [10], [11]. Inspred by bologcal observatons on human bran mechansms for processng of natural sgnals, deep learnng has attracted much attenton from the academc communty n recent years due to ts state-of-the-art performance n many research domans such as speech recognton [12], [13], collaboratve fulterng [14], and computer vson [15], [16]. Deep learnng has also been successfully appled n ndustry products that take advantage of the large volume of dgtal data. Companes lke Google, Apple, and Facebook, who collect and analyze massve amounts of data on a daly bass, have been aggressvely pushng forward deep learnng related projects. For example, Apple s Sr, the vrtual personal assstant n Phones, offers a wde varety of servces ncludng weather reports, sport news, answers to user s questons, and remnders etc. by utlzng deep learnng and more and more data collected by Apple servces [17]. Google apples deep learnng algorthms to massve chunks of messy data obtaned from the Internet for Google s translator, 514 2169-3536 2014 IEEE. Translatons and content mnng are permtted for academc research only. Personal use s also permtted, but republcaton/redstrbuton requres IEEE permsson. See http://www.eee.org/publcatons_standards/publcatons/rghts/ndex.html for more nformaton. VOLUME 2, 2014

Androd s voce recognton, Google s street vew, and mage search engne [18]. Other ndustry gants are not far behnd ether. For example, Mcrosoft s real-tme language translaton n Bng voce search [19] and IBM s bran-lke computer [18], [20] use technques lke deep learnng to leverage Bg Data for compettve advantage. As the data keeps gettng bgger, deep learnng s comng to play a key role n provdng bg data predctve analytcs solutons, partcularly wth the ncreased processng power and the advances n graphcs processors. In ths paper, our goal s not to present a comprehensve survey of all the related work n deep learnng, but manly to dscuss the most mportant ssues related to learnng from massve amounts of data, hghlght current research efforts and the challenges to bg data, as well as the future trends. The rest of the paper s organzed as follows. Secton 2 presents a bref revew of two commonly used deep learnng archtectures. Secton 3 dscusses the strateges of deep learnng from massve amounts of data. Fnally, we dscuss the challenges and perspectves of deep learnng for Bg Data n Secton 4. II. OVERVIEW OF DEEP LEARNING Deep learnng refers to a set of machne learnng technques that learn multple levels of representatons n deep archtectures. In ths secton, we wll present a bref overvew of two well-establshed deep archtectures: deep belef networks (DBNs) [21] [23] and convolutonal neural networks (CNNs) [24] [26]. A. DEEP BELIEF NETWORKS Conventonal neural networks are prone to get trapped n local optma of a non-convex objectve functon, whch often leads to poor performance [27]. Furthermore, they cannot take advantage of unlabeled data, whch are often abundant and cheap to collect n Bg Data. To allevate these problems, a deep belef network (DBN) uses a deep archtecture that s capable of learnng feature representatons from both the labeled and unlabeled data presented to t [21]. It ncorporates both unsupervsed pre-tranng and supervsed fne-tunng strateges to construct the models: unsupervsed stages ntend to learn data dstrbutons wthout usng label nformaton and supervsed stages perform local search for fne tunng. Fg. 1 shows a typcal DBN archtecture, whch s composed of a stack of Restrcted Boltzmann Machnes (RBMs) and/or one or more addtonal layers for dscrmnaton tasks. RBMs are probablstc generatve models that learn a jont probablty dstrbuton of observed (tranng) data wthout usng data labels [28]. They can effectvely utlze large amounts of unlabeled data for explotng complex data structures. Once the structure of a DBN s determned, the goal for tranng s to learn the weghts (and bases) between layers. Ths s conducted frstly by an unsupervsed learnng of RBMs. A typcal RBM conssts of two layers: nodes n one layer are fully connected to nodes n the other layer and there s no connecton for nodes n the same layer (see Fg.1, for example, the nput layer and the frst hdden layer H 1 form a RBM) [28]. Consequently, each node s ndependent of other nodes n the same layer gven all nodes n the other layer. Ths characterstc allows us to tran the generatve weghts W of each RBMs usng Gbbs samplng [29], [30]. FIGURE 1. Illustraton of a deep belef network archtecture. Ths partcular DBN conssts of three hdden layers, each wth three neurons; one nput later wth fve neurons and one output layer also wth fve neurons. Any two adjacent layers can form a RBM traned wth unlabeled data. The outputs of current RBM (e.g., h (1) n the frst RBM marked n red) are the nputs of the next RBM (e.g., h (2) n the second RBM marked n green). The weghts W can then be fne-tuned wth labeled data after pre-tranng. Before fne-tunng, a layer-by-layer pre-tranng of RBMs s performed: the outputs of a RBM are fed as nputs to the next RBM and the process repeats untl all the RBMs are pretraned. Ths layer-by-layer unsupervsed learnng s crtcal n DBN tranng as practcally t helps avod local optma and allevates the over-fttng problem that s observed when mllons of parameters are used. Furthermore, the algorthm s very effcent n terms of ts tme complexty, whch s lnear to the number and sze of RBMs [21]. Features at dfferent layers contan dfferent nformaton about data structures wth hgher-level features constructed from lower-level features. Note that the number of stacked RBMs s a parameter predetermned by users and pre-tranng requres only unlabeled data (for good generalzaton). For a smple RBM wth Bernoull dstrbuton for both the vsble and hdden layers, the samplng probabltes are as follows [21]: p ( h j = 1 v; W ) ( I ) = σ w j v + a j (1) and =1 J p (v = 1 h; W ) = σ w j h j + b (2) where v and h represents a I 1 vsble unt vector and a J 1 hdden unt vector, respectvely; W s the matrx of weghts (w j ) connectng the vsble and hdden layers; a j and b are bas terms; and σ ( ) s a sgmod functon. For the case j=1 VOLUME 2, 2014 515

of real-valued vsble unts, the condtonal probablty dstrbutons are slghtly dfferent: typcally, a Gaussan-Bernoull dstrbuton s assumed and p (v h; W ) s Gaussan [30]. Weghts w j are updated based on an approxmate method called contrastve dvergence (CD) approxmaton [31]. For example, the (t + 1)-th weght for w j can be updated as follows: w j (t + 1) = c w j (t) + α ( v h j data v h j model ) where α s the learnng rate and c s the momentum factor; data and model are the expectatons under the dstrbutons defned by the data and the model, respectvely. Whle the expectatons may be calculated by runnng Gbbs samplng nfntely many tmes, n practce, one-step CD s often used because t performs well [31]. Other model parameters (e.g., the bases) can be updated smlarly. As a generatve mode, the RBM tranng ncludes a Gbbs sampler to sample hdden unts based on the vsble unts and vce versa (Eqs. (1) and (2)). The weghts between these two layers are then updated usng the CD rule (Eq. 3). Ths process wll repeat untl convergence. An RBM models data dstrbuton usng hdden unts wthout employng label nformaton. Ths s a very useful feature n Bg Data analyss as DBN can potentally leverage much more data (wthout knowng ther labels) for mproved performance. After pre-tranng, nformaton about the nput data s stored n the weghts between every adjacent layers. The DBN then adds a fnal layer representng the desred outputs and the overall network s fne tuned usng labeled data and back propagaton strateges for better dscrmnaton (n some mplementatons, on top of the stacked RBMs, there s another layer called assocatve memory determned by supervsed learnng methods). There are other varatons for pre-tranng: nstead of usng RBMs, for example, stacked denosng auto-encoders [32], [33] and stacked predctve sparse codng [34] are also proposed for unsupervsed feature learnng. Furthermore, recent results show that when a large number of tranng data s avalable, a fully supervsed tranng usng random ntal weghts nstead of the pre-traned weghts (.e., wthout usng RBMs or auto-encoders) wll practcally work well [13], [35]. For example, a dscrmnatve model starts wth a network wth one sngle hdden layer (.e., a shallow neural network), whch s traned by back propagaton method. Upon convergence, a new hdden layer s nserted nto ths shallow NN (between the frst hdden layer and the desred output layer) and the full network s dscrmnatvely traned agan. Ths process s contnued untl a predetermned crteron s met (e.g., the number of hdden neurons). In summary, DBNs use a greedy and effcent layer-bylayer approach to learn the latent varables (weghts) n each hdden layer and a back propagaton method for fnetunng. Ths hybrd tranng strategy thus mproves both the generatve performance and the dscrmnatve power of the network. (3) B. CONVOLUTIONAL NEURAL NETWORKS A typcal CNN s composed of many layers of herarchy wth some layers for feature representatons (or feature maps) and others as a type of conventonal neural networks for classfcaton [24]. It often starts wth two alterng types of layers called convolutonal and subsamplng layers: convolutonal layers perform convoluton operatons wth several flter maps of equal sze, whle subsamplng layers reduce the szes of proceedng layers by averagng pxels wthn a small neghborhood (or by max-poolng [36], [37]). Fg. 2 shows a typcal archtecture of CNNs. The nput s frst convoluted wth a set of flters (C layers n Fg. 2). These 2D fltered data are called feature maps. After a nonlnear transformaton, a subsamplng s further performed to reduce the dmensonalty (S layers n Fg. 2). The sequence of convoluton/subsamplng can be repeated many tmes (predetermned by users). FIGURE 2. Illustraton of a typcal convolutonal neural network archtecture. The nput s a 2D mage, whch convolves wth four dfferent flters (.e., h (1), = 1 to 4), followed by a nonlnear actvaton, to form the four feature maps n the second layer (C 1 ). These feature maps are down-sampled by a factor of 2 to create the feature maps n layer S 1. The sequence of convoluton/nonlnear actvaton/subsamplng can be repeated many tmes. In ths example, to form the feature maps n layer C 2, we use eght dfferent flters (.e., h (2), = 1 to 8): the frst, thrd, fourth, and sxth feature maps n layer C 2 are defned by one correspondng feature map n layer S 1, each convolutng wth a dfferent flter; and the second and ffth maps n layer C 2 are formed by two maps n S 1 convolutng wth two dfferent flters. The last layer s an output layer to form a fully connected 1D neural network,.e., the 2D outputs from the last subsamplng later (S 2 ) wll be concatenated nto one long nput vector wth each neuron fully connected wth all the neurons n. the next layer (a hdden layer n ths fgure). As llustrated n Fg. 2, the lowest level of ths archtecture s the nput layer wth 2D N N mages as our nputs. Wth local receptve felds, upper layer neurons extract some elementary and complex vsual features. Each convolutonal layer (labeled Cx n Fg. 2) s composed of multple feature maps, whch are constructed by convolvng nputs wth dfferent flters (weght vectors). In other words, the value of each unt n a feature map s the result dependng on a local receptve feld n the prevous layer and the flter. Ths s 516 VOLUME 2, 2014

followed by a nonlnear actvaton: ( = f y (l) j K j x (l 1) + b j where y (l) j s the j-th output for the l-th convoluton layer C l ; f ( ) s a nonlnear functon (most recent mplementatons use a scaled hyperbolc tangent functon as the nonlnear actvaton functon [38]: f (x) = 1.7159 tanh(2x/3)). K j s a tranable flter (or kernel) n the flter bank that convolves wth the feature map x (l 1) from the prevous layer to produce a new feature map n the current layer. The symbol represents a dscrete convoluton operator and b j s a bas. Note that each flter K j can connect to all or a porton of feature maps n the prevous layer (n Fg. 2, we show a partally connected feature maps between S 1 and C 2 ). The sub-samplng layer (labeled Sx n Fg. 2) reduces the spatal resoluton of the feature map (thus provdng some level of dstorton nvarance). In general, each unt n the sub-samplng layer s constructed by averagng a 2 2 area n the feature map or by max poolng over a small regon. The key parameters to be decded are weghts between layers, whch are normally traned by standard backpropagaton procedures and a gradent descent algorthm wth mean squared-error as the loss functon. Alternatvely, tranng deep CNN archtectures can be unsupervsed. Heren we revew a partcular method for unsupervsed tranng of CNNs: predctve sparse decomposton (PSD) [39]. The dea s to approxmate nputs Xwth a lnear combnaton of some basc and sparse functons. Z = arg X WZ 2 2 + λ Z 1 + α Z D tanh (KX) 2 2 (5) where W s a matrx wth a lnear bass set, Z s a sparse coeffcent matrx, D s a dagonal gan matrx and K s the flter bank wth predctor parameters. The goal s to fnd the optmal bass functon sets W and the flter bank Kthat mnmze the reconstructon error (the frst term n Eq. 5) wth a sparse representaton (the second term), and the code predcton error smultaneously (the thrd term n Eq. 5, measurng the dfference between the predcted code and actual code, preserves nvarance for certan dstortons). PSD can be traned wth a feed-forward encoder to learn the flter bank and also the poolng together [39]. In summary, nspred by bologcal processes [40], CNN algorthms learn a herarchcal feature representaton by utlzng strateges lke local receptve felds (the sze of each flter s normally small), shared weghts (usng the same weghts to construct all the feature maps at the same level sgnfcantly reduces the number of parameters), and subsamplng (to further reduce the dmensonalty). Each flter bank can be traned wth ether supervsed or unsupervsed methods. A CNN s capable of learnng good feature herarches automatcally and provdng some degree of translatonal and dstortonal nvarances. ) (4) III. DEEP LEARNING FOR MASSIVE AMOUNTS OF DATA Whle deep learnng has shown mpressve results n many applcatons, ts tranng s not a trval task for Bg Data learnng due to the fact that teratve computatons nherent n most deep learnng algorthms are often extremely dffcult to be parallelzed. Thus, wth the unprecedented growth of commercal and academc data sets n recent years, there s a surge n nterest n effectve and scalable parallel algorthms for tranng deep models [12], [13], [15], [41] [44]. In contrast to shallow archtectures where few parameters are preferable to avod overfttng problems, deep learnng algorthms enjoy ther success wth a large number of hdden neurons, often resultng n mllons of free parameters. Thus, large-scale deep learnng often nvolves both large volumes of data and large models. Some algorthmc approaches have been explored for large-scale learnng: for example, locally connected networks [24], [39], mproved optmzers [42], and new structures that can be mplemented n parallel [44]. Recently, Deng et al. [44] proposed a modfed deep archtecture called Deep Stackng Network (DSN), whch can be effectvely parallelzed. A DSN conssts of several specalzed neural networks (called modules) wth a sngle hdden layer. Stacked modules wth nputs composed of raw data vector and the out puts from prevous module form a DSN. Most recently, a new deep archtecture called Tensor Deep Stackng Network (T-DSN), whch s based on the DSN, s mplemented usng CPU clusters for scalable parallel computng [45]. The use of great computng power to speed up the tranng process has shown sgnfcant potental n Bg Data deep learnng. For example, one way to scale up DBNs s to use multple CPU cores, wth each core dealng wth a subset of tranng data (data-parallel schemes). Vanhoucke et al. [46] dscussed some aspects of techncal detals, ncludng carefully desgnng data layout, batchng of the computaton, usng SSE2 nstructons, and leveragng SSE3 and SSE4 nstructons for fxed-pont mplementaton. These mplementatons can enhance the performance of modern CPUs more for deep learnng. Another recent work ams to parallelze Gbbs samplng of hdden and vsble unts by splttng hdden unts and vsble unts nto n machnes, each responsble for 1/n of the unts [47]. In order to make t work, data transfer between machnes s requred (.e., when samplng the hdden unts, each machne wll have the data for all the vsble unts and vce verse). Ths method s effcent f both the hdden and vsble unts are bnary and also f the sample sze s modest. The communcaton cost, however, can rse up quckly f large-scale data sets are used. Other methods for large-scale deep learnng also explore FPGA-based mplementaton [48] wth a custom archtecture: a control unt mplemented n a CPU, a grd of multple full-custom processng tles, and a fast memory. In ths survey, we wll focus on some recently developed deep learnng frameworks that take advantage of great computng power avalable today. Take Graphcs Processors Unts VOLUME 2, 2014 517

(GPUs) as an example: as of August 2013, NVIDIA sngle precson GPUs exceeded 4.5 TeraFLOP/s wth a memory bandwdth of near 300 GB/s [49]. They are partcularly suted for massvely parallel computng wth more transstors devoted for data proceedng needs. These newly developed deep learnng frameworks have shown sgnfcant advances n makng large-scale deep learnng practcal. Fg. 3 shows a schematc for a typcal CUDA-capable GPU wth four mult-processors. Each mult-processor (MP) conssts of several streamng multprocessors (SMs) to form a buldng block (Fg. 3 shows two SMs for each block). Each SM has multple stream processors (SPs) that share control logc and low-latency memory. Furthermore, each GPU has a global memory wth very hgh bandwdth and hgh latency when accessed by the CPU (host). Ths archtecture allows for two levels of parallelsm: nstructon (memory) level (.e., MPs) and thread level (SPs). Ths SIMT (Sngle Instructon, Multple Threads) archtecture allows for thousands or tens of thousands of threads to be run concurrently, whch s best suted for operatons wth large number of arthmetc operatons and small access tmes to memory. Such levels of parallelsm can also be effectvely utlzed wth specal attenton on the data flow when developng GPU parallel computng applcatons. One consderaton, for example, s to reduce the data transfer between RAM and the GPU s global memory [50] by transferrng data wth large chunks. Ths s acheved by uploadng as large sets of unlabeled data as possble and by storng free parameters as well as ntermedate computatons, all n global memory. In addton, data parallelsm and learnng updates can be mplemented by leveragng the two levels of parallelsm: nput examples can be assgned across MPs, whle ndvdual nodes can be treated n each thread (.e., SPs). A. LARGE-SCALE DEEP BELIEF NETWORKS Rana et al. [41] proposed a GPU-based framework for massvely parallelzng unsupervsed learnng models ncludng DBNs (n ths paper, they refer the algorthms to stacked RBMs) and sparse codng [21]. Whle prevous models tend to use one to four mllon free parameters (e.g., Hnton & Salakhutdnov [21] used 3.8 mllon parameters for free mages and Ranzato and Szummer used three mllon parameters for text processng [51]), the proposed approach can tran on more than 100 mllon free parameters wth mllons of unlabeled tranng data [41]. Because transferrng data between host and GPU global memory s tme consumng, one needs to mnmze hostdevce transfers and take advantage of shared memory. To acheve ths, one strategy s to store all parameters and a large chunk of tranng examples n global memory durng tranng [41]. Ths wll reduce the data transfer tmes between host and globa memory and also allow for parameter updates to be carred out fully nsde GPUs. In addton, to utlze the MP/SP levels of parallelsm, a few of the unlabeled tranng data n global memory wll be selected each tme to compute the updates concurrently across blocks (data parallelsm) FIGURE 3. An llustratve archtecture of a CUDA-capable GPU wth hghly threaded streamng processors (SPs). In ths example, the GPU has 64 stream processors (SPs) organzed nto four multprocessors (MPs), each wth two stream multprocessors (SMs). Each SM has eght SPs that share control unt and nstructon cache. The four MPs (buldng blocks) also share a global memory (e.g., graphcs double data rate DRAM) that often functons as very-hgh-bandwdth, off-chp memory (memory bandwdth s the data exchange rate). Global memory typcally has hgh latency and s accessble to the CPU (host). A typcal processng flow ncludes: nput data are frst coped from host memory to GPU memory, followed by loadng and executng GPU program; results are then sent back from GPU memory to host memory. Practcally, one needs to pay careful consderaton to data transfer between host and GPU memory, whch may take consderable amount of tme. (Fg. 3). Meanwhle, each component of the nput example s handled by SPs. When mplementng the DBN learnng, Gbbs samplng [52], [53] s repeated usng Eqs. (1-2). Ths can be mplemented by frst generatng two samplng matrces P(h x) and P(x h), wth the (, j)-th element P(h j x ) (.e., the probablty of j-th hdden node gven the -th nput example) and P(x j h ), respectvely [41]. The samplng matrces can then be mplemented n parallel for the GPU, where each block takes an example and each thread works on an element of the example. Smlarly, the weght update operatons (Eq. (3)) can be performed n parallel usng lnear algebra packages for the GPU after new examples are generated. Expermental results show that wth 45 mllon parameters n a RBM and one mllon examples, the GPU-based mplementaton ncreases the speed of DBN learnng by a factor of up to 70, compared to a dual-core CPU mplementaton (around 29 mnutes for GPU-based mplementaton versus more than one day for CPU-based mplementaton) [41]. B. LARGE-SCALE CONVOLUTIONAL NEURAL NETWORKS CNN s a type of locally connected deep learnng methods. Large-scale CNN learnng s often mplemented on GPUs wth several hundred parallel processng cores. CNN tranng nvolves both forward and backward propagaton. For parallelzng forward propagaton, one or more blocks are assgned for each feature map dependng on the sze of maps [36]. Each thread n a block s devoted to a sngle neuron 518 VOLUME 2, 2014

n a map. Consequently, the computaton of each neuron, whch ncludes convoluton of shared weghts (kernels) wth neurons from the prevous layers, actvaton, and summaton, s performed n a SP. The outputs are then stored n the global memory. Weghts are updated by back-propagaton of errors δ k. The error sgnal δ (l 1) k of a neuron k n the prevous layer (l 1) depends on the error sgnals δ (l) j of some neurons n a local feld of the current layer l. Parallelzng backward propagaton can be mplemented ether by pullng or pushng [36]. Pullng error sgnals refers to the process of computng delta sgnals for each neuron n the prevous layer by pullng the error sgnals from the current layer. Ths s not straghtforward because of the subsamplng and convoluton operatons: for example, the neurons n the prevous layer may connect to dfferent numbers of neurons n the prevous layer due to border effects [54]. For llustraton, we plot a onedmensonal convoluton and subsamplng n Fg. 4. As can be seen, the frst sx unts have dfferent number of connectons. We need frst to dentfy the lst of neurons n the current layer that contrbute to the error sgnals of neurons n the prevous layer. On the contrary, all the unts n the current layer have exactly the same number of ncomng connectons. Consequently, pushng the error sgnals from the current layer to prevous layer s more effcent,.e., for each unt n the current layer, we update the related unts n the prevous layer. FIGURE 4. An llustraton of the operatons nvolved wth 1D convoluton and subsamplng. The convoluton flter s sze s sx. Consequently, each unt n the convoluton layer s defned by sx nput unts. Subsamplng nvolves averagng two adjacent unts n the convoluton layer. For mplementng data parallelsm, one needs to consder the sze of global memory and feature map sze. Typcally, at any gven stage, a lmted number of tranng examples can be processed n parallel. Furthermore, wthn each block where comvoluton operaton s performed, only a porton of a feature map can be mantaned at any gven tme due to the extremely lmted amount of shared memory. For convoluton operatons, Scherer et al. suggested the use of lmted shared memory as a crcular buffer [37], whch only holds a small porton of each feature map loaded from global memory each tme. Convoluton wll be performed by threads n parallel and results are wrtten back to global memory. To further overcome the GPU memory lmtaton, the authors mplemented a modfed archtecture wth both the convoluton and subsamplng operatons beng combned nto one step [37]. Ths modfcaton allows for storng both the actvtes and error values wth reduced memory usage whle runnng backpropagaton. To further speedup, Krzhevsky et al. proposed the use of two GPUs for tranng CNNs wth fve convolutonal layers and three fully connected classfcaton layers. The CNN uses Rectfed Lnear Unts (ReLUs) as the nonlnear functon (f (x) = max(0, x)), whch has been shown to run several tmes faster than other commonly used functons [55]. For some layers, about half of the network s computed n a sngle GPU and the other porton s calculated n the other GPU; the two GPUs communcated at some other layers. Ths archtecture takes full advantage of cross-gpu parallelzaton that allows two GPUs to communcate and transfer data wthout usng host memory. C. COMBINATION OF DATA- AND MODEL-PARALLEL SCHEMES DstBelef s a software framework recently desgned for dstrbuted tranng and learnng n deep networks wth very large models (e.g., a few bllon parameters) and large-scale data sets. It leverages large-scale clusters of machnes to manage both data and model parallelsm va multthreadng, message passng, synchronzaton as well as communcaton between machnes [56]. For large-scale data wth hgh dmensonalty, deep learnng often nvolves many densely connected layers wth a large number of free parameters (.e., large models). To deal wth large model learnng, DstBelef frst mplements model parallelsm by allowng users to partton large network archtectures nto several smaller structures (called blocks), whose nodes wll be assgned to and calculated n several machnes (collectvely we call t a parttoned model ). Each block wll be assgned to one machne (see Fg. 5). Boundary nodes (nodes whose edges belong to more than one parttons) requre data transfer between machnes. Apparently, fullyconnected networks have more boundary nodes and often demand hgher communcaton costs than locally-connected structures, and thus less performance benefts. Nevertheless, as many as 144 parttons have been reported for large models n DstBelef [56], whch leads to sgnfcant mprovement of tranng speed. DstBelef also mplements data parallelsm and employs two separate dstrbuted optmzaton procedures: Downpour stochastc gradent descent (SGD) and Sandblaster [56], whch perform onlne and batch optmzaton, respectvely. Heren we wll dscuss Downpour n detals and more nformaton about Sandblaster can be found n the reference [56]. Frst, multple replcas of the parttoned model wll be created for tranng and nference. Lke deep learnng models, large data sets wll be parttoned nto many subsets. DstBelef wll then run multple replcas of the parttoned model to compute gradent descent va Downpour SGD on dfferent subsets of tranng data. Specfcally, DstBelef employs a centralzed parameter server storng and applyng updates for VOLUME 2, 2014 519

FIGURE 5. DstBelef: models are parttoned nto four blocks and consequently assgned to four machnes [56]. Informaton for nodes that belong to two or more parttons s transferred between machnes (e.g., the lnes marked wth yellow color). Ths model s more effectve for less densely connected networks. all parameters of the models. Parameters are grouped nto server shards. At any gven tme, each machne n a parttoned model needs only to communcate wth the parameter server shards that hold the relevant parameters. Ths communcaton s asynchronous: each machne n a parttoned model runs ndependently and each parameter server shard acts ndependently as well. One advantage of usng asynchronous communcaton over standard synchronous SGD s ts fault tolerance: n the event of the falures of one machne n a model copy, other model replcas wll contnue communcatng wth the central parameter server to process the data and update the shared weghts. In practce, the Adagrad adaptve learnng rate procedure [57] s ntegrated nto the Downpour SGD for better performance. DstBelef s mplemented n two deep learnng models: a fully connected network wth 42 mllon model parameters and 1.1 bllon examples, and a locallyconnected convolutonal neural network wth 16 mllon mages of 100 by 100 pxels and 21,000 categores (as many as 1.7 bllon parameters). The expermental results show that locally connected learnng models wll beneft more from DstBelef: ndeed, wth 81 machnes and 1.7 bllon parameters, the method s 12x faster than usng a sngle machne. As demonstrated n [56], a sgnfcant advantage of DstBelef s ts ablty to scale up from sngle machne to thousands of machnes, whch s the key to Bg Data analyss. Most recently, the DstBelef framework was used to tran a deep archtecture wth a sparse deep autoencoder, local receptve felds, poolng, and local contrast normalzaton [50]. The deep learnng archtecture conssts of three stacked layers, each wth sublayers of local flterng, local poolng, and local contrast normalzaton. The flterng sublayers are not convolutonal, each flter wth ts own weghts. The optmzaton of ths archtecture nvolves an overall objectve functon that s the summaton of the objectve functons for the three layers, each amng at mnmzng a reconstructon error whle mantanng sparsty of connectons between sublayers. The DstBelef framework s able to scale up the dataset, the model, and the resources all together. The model s parttoned nto 169 machnes, each wth 16 CPU cores. Multple cores allow for another level of parallelsm where each subset of cores can perform dfferent tasks. Asynchronous SGD s mplemented wth several replcas of the core model and mn-batch of tranng examples. The framework was able to tran as many as 14 mllon mages wth a sze of 200 by 200 pxels and more than 20 thousand categores for three days over a cluster of 1,000 machnes wth 16,000 cores. The model s capable of learnng hgh-level features to detect objects wthout usng labeled data. D. THE COTS HPC SYSTEMS Whle DstBelef can learn wth very large models (more than one bllon parameters), ts tranng requres 16,000 CPU cores, whch are not commonly avalable for most researchers. Most recently, Coates et al. presented an alternatve approach that trans comparable deep network models wth more than 11 bllon free parameters by usng just three machnes [58]. The Commodty Off-The-Shelf Hgh Performance Computng (COTS HPC) system s comprsed of a cluster of 16 GPU servers wth Infnband adapter for nterconnects and MPI for data exchange n a cluster. Each server s equpped wth four NVIDIA GTX680 GPUs, each havng 4GB of memory. Wth well-balanced number of GPUs and CPUs, COTS HPC s capable of runnng very large-scale deep learnng. The mplementaton ncludes carefully desgned CUDA kernels for effectve usage of memory and effcent computaton. For example, to effcently compute a matrx multplcaton Y = WX (e.g., W s the flter matrx and X s the nput matrx), Coates et al. [58] fully take advantage of matrx sparseness and local receptve feld by extractng nonzero columns n W for neurons that share dentcal receptve felds, whch are then multpled by the correspondng rows n X. Ths strategy successfully avods the stuaton where the requested memory s larger than the shared memory of the GPU. In addton, matrx operatons are performed by usng a hghly optmzed tool called MAGMA BLAS matrx-matrx multply kernels [59]. Furthermore, GPUs are beng utlzed to mplement a model parallel scheme: each GPU s only used for a dfferent part of the model optmzaton wth the same nput examples; collectvely, ther communcaton occurs through the MVA- PICH2 MPI. Ths very large scale deep learnng system s capable of tranng wth more than 11 bllon parameters, whch s the largest model reported by far, wth much less machnes. Table 1 summarzes the current progress n large-scale deep learnng. It has been observed n several groups (see [41]) that sngle CPU s mpractcal for deep learnng wth a large model. Wth multple machnes, the runnng tme may not be a bg concern any more (see [56]). However, 520 VOLUME 2, 2014

sgnfcant computatonal resources are needed to acheve the goal. Consequently, major research efforts are towards experments wth GPUs. TABLE 1. Summary of recent research progress n large-scale deep learnng. IV. REMAINING CHALLENGES AND PERSPECTIVES: DEEP LEARNING FOR BIG DATA In recent years, Bg Data has taken center stage n government and socety at large. In 2012, the Obama Admnstraton announced a Bg Data Research and Development Intatve to help solve some of the Naton s most pressng challenges [60]. Consequently, sx Federal departments and agences (NSF, HHS/NIH, DOD, DOE, DARPA, and USGS) commtted more than $200 mllon to support projects that can transform our ablty to harness n novel ways from huge volumes of dgtal data. In May of the same year, the state of Massachusetts announced the Massachusetts Bg Data Intatve that funds a varety of research nsttutons [61]. In Aprl, 2013, U.S. Presdent Barack Obama announced another federal project, a new bran mappng ntatve called the BRAIN (Bran Research Through Advancng Innovatve Neurotechnologes) [62] amng to develop new tools to help map human bran functons, understand the complex lnks between functon and behavor, and treat and cure bran dsorders. Ths ntatve mght test and extend the current lmts of technologes for Bg Data collecton and analyss, as NIH drector Francs Collns stated that collecton, storage, and processng of yottabytes (a bllon petabytes) of data would eventually be requred for ths ntatve. Whle the potental of Bg Data s undoubtedly sgnfcant, fully achevng ths potental requres new ways of thnkng and novel algorthms to address many techncal challenges. For example, most tradtonal machne learnng algorthms were desgned for data that would be completely loaded nto memory. Wth the arrval of Bg Data age, however, ths assumpton does not hold any more. Therefore, algorthms that can learn from massve amounts of data are needed. In spte of all the recent achevement n large-scale deep learnng as dscussed n Secton 3, ths feld s stll n ts nfancy. Much more needs to be done to address many sgnfcant challenges posted by Bg Data, often characterzed by the three V s model: volume, varety, and velocty [63], whch refers to large scale of data, dfferent types of data, and the speed of streamng data, respectvely. A. DEEP LEARNING FROM HIGH VOLUMES OF DATA Frst and foremost, hgh volumes of data present a great challengng ssue for deep learnng. Bg data often possesses a large number of examples (nputs), large varetes of class types (outputs), and very hgh dmensonalty (attrbutes). These propertes drectly lead to runnng-tme complexty and model complexty. The sheer volume of data makes t often mpossble to tran a deep learnng algorthm wth a central processor and storage. Instead, dstrbuted frameworks wth parallelzed machnes are preferred. Recently, mpressve progresses have been made to mtgate the challenges related to hgh volumes. The novel models utlze clusters of CPUs or GPUs n ncreasng the tranng speed wthout scarfyng accuracy of deep learnng algorthms. Strateges for data parallelsm or model parallelsm or both have been developed. For example, data and models are dvded nto blocks that ft wth n-memory data; the forward and backward propagatons can be mplemented effectvely n parallel [56], [58], although deep learnng algorthms are not trvally parallel. The most recent deep learnng framework can handle a sgnfcantly large number of samples and parameters. It s also possble to scale up wth more GPUs used. It s less clear, however, how the deep learnng systems can contnue scalng sgnfcantly beyond the current framework. Whle we can expect the contnuous growth n computer memory and computatonal power (manly through parallel or dstrbuted computng envronment), further research and effort on addressng ssues assocated wth computaton and communcaton management (e.g., copyng data or parameters or gradent values to dfferent machnes) are needed for scalngup to very large data sets. Ultmately, to buld the future deep learnng system scalable to Bg Data, one needs to develop hgh performance computng nfrastructure-based systems together wth theoretcally sound parallel learnng algorthms or novel archtectures. Another challenge assocated wth hgh volumes s the data ncompleteness and nosy labels. Unlke most conventonal datasets used for machne learnng, whch were hghly curated and nose free, Bg Data s often ncomplete resultng from ther dsparate orgns. To make thngs even more complcated, majorty of data may not be labeled, or f labeled, there exst nosy labels. Take the 80 mllon tny VOLUME 2, 2014 521

mage database as an example, whch has 80 mllon lowresoluton color mages over 79,000 search terms [64]. Ths mage database was created by searchng the Web wth every non-abstract Englsh noun n the WordNet. Several search engnes such as Google and Flckr were used to collect the data over the span of sx months. Some manual curaton was conducted to remove duplcates and low-qualty mages. Stll, the mage labels are extremely unrelable because of search technologes. One of the unque characterstcs deep learnng algorthms possess s ther ablty to utlty unlabeled data durng tranng: learnng data dstrbuton wthout usng label nformaton. Thus, the avalablty of large unlabeled data presents ample opportuntes for deep learnng methods. Whle data ncompleteness and nosy labels are part of the Bg Data package, we beleve that usng vastly more data s preferable to usng smaller number of exact, clean, and carefully curated data. Advanced deep learnng methods are requred to deal wth nosy data and to be able to tolerate some messness. For example, a more effcent cost functon and novel tranng strategy may be needed to allevate the effect of nosy labels. Strateges used n sem-supervsed learnng [65] [68] may also help allevate problems related to nosy labels. B. DEEP LEARNING FOR HIGH VARIETY OF DATA The second dmenson for Bg Data s ts varety,.e., data today comes n all types of formats from a varety sources, probably wth dfferent dstrbutons. For example, the rapdly growng multmeda data comng from the Web and moble devces nclude a huge collecton of stll mages, vdeo and audo streams, graphcs and anmatons, and unstructured text, each wth dfferent characterstcs. A key to deal wth hgh varety s data ntegraton. Clearly, one unque advantage of deep learnng s ts ablty for representaton learnng wth ether supervsed or unsupervsed methods or combnaton of both, deep learnng can be used to learn good feature representatons for classfcaton. It s able to dscover ntermedate or abstract representatons, whch s carred out usng unsupervsed learnng n a herarchy fashon: one level at a tme and hgher-level features defned by lower-level features. Thus, a natural soluton to address the data ntegraton problem s to learn data representatons from each ndvdual data sources usng deep learnng methods, and then to ntegrate the learned features at dfferent levels. Deep learnng has been shown to be very effectve n ntegratng data from dfferent sources. For example, Ngam et al. [69] developed a novel applcaton of deep learnng algorthms to learn representatons by ntegratng audo and vdeo data. They demonstrated that deep learnng s generally effectve n (1) learnng sngle modalty representatons through multple modaltes wth unlabeled data and (2) learnng shared representatons capable of capturng correlatons across multple modaltes. Most recently, Srvastava and Salakhutdnov [70] developed a multmodal Deep Boltzmann Machne (DBM) that fuses two very dfferent data modaltes, real-valued dense mage data and text data wth sparse word frequences, together to learn a unfed representaton. DBM s a generatve model wthout fne-tunng: t frst bulds multple stacked-rbms for each modalty; to form a multmodal DBM, an addtonal layer of bnary hdden unts s added on top of these RBMs for jont representaton. It learns a jont dstrbuton n the multmodal nput space, whch allows for learnng even wth mssng modaltes. Whle current experments have demonstrated that deep learnng s able to utlze heterogeneous sources for sgnfcant gans n system performance, numerous questons reman open. For example, gven that dfferent sources may offer conflctng nformaton, how can we resolve the conflcts and fuse the data from dfferent sources effectvely and effcently. Whle current deep learnng methods are manly tested upon b-modaltes (.e., data from two sources), wll the system performance benefts from sgnfcantly enlarged modaltes? Furthermore, at what levels n deep learnng archtectures are approprate for feature fuson wth heterogeneous data? Deep learnng seems well suted to the ntegraton of heterogeneous data wth multple modaltes due to ts capablty of learnng abstract representatons and the underlyng factors of data varaton. C. DEEP LEARNING FOR HIGH VELOCITY OF DATA Emergng challenges for Bg Data learnng also arose from hgh velocty: data are generatng at extremely hgh speed and need to be processed n a tmely manner. One soluton for learnng from such hgh velocty data s onlne learnng approaches. Onlne learnng learns one nstance at a tme and the true label of each nstance wll soon be avalable, whch can be used for refnng the model [71] [76]. Ths sequental learnng strategy partcularly works for Bg Data as current machnes cannot hold the entre dataset n memory. Whle conventonal neural networks have been explored for onlne learnng [77] [87], only lmted progress on onlne deep learnng has been made n recent years. Interestngly, deep learnng s often traned wth stochastc gradent descent approach [88], [89], where one tranng example wth the known label s used at a tme to update the model parameters. Ths strategy may be adapted for onlne learnng as well. To speed up learnng, nstead of proceedng sequentally one example at a tme, the updates can be performed on a mnbatch bass [37]. Practcally, the examples n each mn-batch are as ndependent as possble. Mn-batches provde a good balance between computer memory and runnng tme. Another challengng problem assocated wth the hgh velocty s that data are often non-statonary,.e., data dstrbuton s changng over tme. Practcally, non-statonary data are normally separated nto chunks wth data from a small tme nterval. The assumpton s that data close n tme are pece-wse statonary and may be characterzed by a sgnfcant degree of correlaton and, therefore, follow the same dstrbuton [90] [97]. Thus, an mportant feature of a deep learnng algorthm for Bg Data s the ablty to learn the data as a stream. One area that needs to be explored s deep onlne learnng onlne learnng often scales naturally and 522 VOLUME 2, 2014

s memory bounded, readly parallelzable, and theoretcally guaranteed [98]. Algorthms capable of learnng from non..d. data are crucal for Bg Data learnng. Deep learnng can also leverage both hgh varety and velocty of Bg Data by transfer learnng or doman adapton, where tranng and test data may be sampled from dfferent dstrbutons [99] [107]. Recently, Glorot et al. mplemented a stacked denosng auto-encoder based deep archtecture for doman adapton, where one trans an unsupervsed representaton on a large number of unlabeled data from a set of domans, whch s appled to tran a classfer wth few labeled examples from only one doman [100]. Ther emprcal results demonstrated that deep learnng s able to extract a meanngful and hgh-level representaton that s shared across dfferent domans. The ntermedate hgh-level abstracton s general enough to uncover the underlyng factors of doman varatons, whch s transferable across domans. Most recently, Bengo also appled deep learnng of multple level representatons for transfer learnng where tranng examples may not well represent test data [99]. They showed that more abstract features dscovered by deep learnng approaches are most lkely generc between tranng and test data. Thus, deep learnng s a top canddate for transfer learnng because of ts ablty to dentfy shared factors present n the nput. Although prelmnary experments have shown much potental of deep learnng n transfer learnng, applyng deep learnng to ths feld s relatvely new and much more needs to be done for mproved performance. Of course, the bg queston s whether we can beneft from Bg Data wth deep archtectures for transfer learnng. In concluson, Bg Data presents sgnfcant challenges to deep learnng, ncludng large scale, heterogenety, nosy labels, and non-statonary dstrbuton, among many others. In order to realze the full potental of Bg Data, we need to address these techncal challenges wth new ways of thnkng and transformatve solutons. We beleve that these research challenges posed by Bg Data are not only tmely, but wll also brng ample opportuntes for deep learnng. Together, they wll provde major advances n scence, medcne, and busness. REFERENCES [1] Natonal Securty Agency. The Natonal Securty Agency: Mssons, Authortes, Oversght and Partnershps [Onlne]. Avalable: http:// www.nsa.gov/publc_nfo/_fles/speeches_testmones/2013_08_09 _the_nsa_story.pdf [2] J. Gantz and D. Rensel, Extractng Value from Chaos. Hopknton, MA, USA: EMC, Jun. 2011. [3] J. Gantz and D. Rensel, The Dgtal Unverse Decade Are You Ready. Hopknton, MA, USA: EMC, May 2010. [4] (2011, May). Bg Data: The Next Fronter for Innovaton, Competton, and Productvty. McKnsey Global Insttute [Onlne]. Avalable: http://www.mcknsey.com/nsghts/busness_technology/bg_data_the_ next_fronter_for_nnovaton [5] J. Ln and A. Kolcz, Large-scale machne learnng at twtter, n Proc. ACM SIGMOD, Scottsdale, Arzona, USA, 2012, pp. 793 804. [6] A. Smola and S. Narayanamurthy, An archtecture for parallel topc models, Proc. VLDB Endowment, vol. 3, no. 1, pp. 703 710, 2010. [7] A. Ng et al., Map-reduce for machne learnng on multcore, n Proc. Adv. Neural Inf. Procees. Syst., vol. 19. 2006, pp. 281 288. [8] B. Panda, J. Herbach, S. Basu, and R. Bayardo, MapReduce and ts applcaton to massvely parallel learnng of decson tree ensembles, n Scalng Up Machne Learnng: Parallel and Dstrbuted Approaches. Cambrdge, U.K.: Cambrdge Unv. Press, 2012. [9] E. Crego, G. Munoz, and F. Islam. (2013, Dec. 8). Bg data and deep learnng: Bg deals or bg delusons? Busness [Onlne]. Avalable: http://www.huffngtonpost.com/george-munoz-frank-slamand-ed-crego/bg-data-and-deep-learnn_b_3325352.html [10] Y. Bengo and S. Bengo, Modelng hgh-dmensonal dscrete data wth mult-layer neural networks, n Proc. Adv. Neural Inf. Process. Syst., vol. 12. 2000, pp. 400 406. [11] Y. Marc Aurelo Ranzato, L. Boureau, and Y. LeCun, Sparse feature learnng for deep belef networks, n Proc. Adv. Neural Inf. Process. Syst., vol. 20. 2007, pp. 1185 1192. [12] G. E. Dahl, D. Yu, L. Deng, and A. Acero, Context-dependent pretraned deep neural networks for large-vocabulary speech recognton, IEEE Trans. Audo, Speech, Lang. Process., vol. 20, no. 1, pp. 30 41, Jan. 2012. [13] G. Hnton et al., Deep neural networks for acoustc modelng n speech recognton: The shared vews of four research groups, IEEE Sgnal Process. Mag., vol. 29, no. 6, pp. 82 97, Nov. 2012. [14] R. Salakhutdnov, A. Mnh, and G. Hnton, Restrcted Boltzmann machnes for collaboratve flterng, n Proc. 24th Int. Conf. Mach. Learn., 2007, pp. 791 798. [15] D. Creşan, U. Meler, L. Cambardella, and J. Schmdhuber, Deep, bg, smple neural nets for handwrtten dgt recognton, Neural Comput., vol. 22, no. 12, pp. 3207 3220, 2010. [16] M. Zeler, G. Taylor, and R. Fergus, Adaptve deconvolutonal networks for md and hgh level feature learnng, n Proc. IEEE Int. Conf. Comput. Vs., Nov. 2011, pp. 2018 2025. [17] A. Efrat. (2013, Dec. 11). How deep learnng works at Apple, beyond. Informaton [Onlne]. Avalable: https://www.thenformaton. com/how-deep-learnng-works-at-apple-beyond [18] N. Jones, Computer scence: The learnng machnes, Nature, vol. 505, no. 7482, pp. 146 148, 2014. [19] Y. Wang, D. Yu, Y. Ju, and A. Acero, Voce search, n Language Understandng: Systems for Extractng Semantc Informaton From Speech, G. Tur and R. De Mor, Eds. New York, NY, USA: Wley, 2011, ch. 5. [20] J. Krk. (2013, Oct. 1). Unverstes, IBM jon forces to buld a bran-lke computer. PCWorld [Onlne]. Avalable: http://www.pcworld.com/ artcle/2051501/unverstes-jon-bm-n-cogntve-computng-researchproject.html [21] G. Hnton and R. Salakhutdnov, Reducng the dmensonalty of data wth neural networks, Scence, vol. 313, no. 5786, pp. 504 507, 2006. [22] Y. Bengo, Learnng deep archtectures for AI, Found. Trends Mach. Learn., vol. 2, no. 1, pp. 1 127, 2009. [23] V. Nar and G. Hnton, 3D object recongton wth deep belef nets, n Proc. Adv. NIPS, vol. 22. 2009, pp. 1339 1347. [24] Y. LeCun, L. Bottou, Y. Bengo, and P. Haffner, Gradent-based learnng appled to document recognton, Proc. IEEE, vol. 86, no. 11, pp. 2278 2324, Nov. 1998. [25] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, Natural language processng almost from scratch, J. Mach. Learn. Res., vol. 12, pp. 2493 2537, Nov. 2011. [26] P. Le Callet, C. Vard-Gaudn, and D. Barba, A convolutonal neural network approach for objectve vdeo qualty assessment, IEEE Trans. Neural Netw., vol. 17, no. 5, pp. 1316 1327, Sep. 2006. [27] D. Rumelhart, G. Hnton, and R. Wllams, Learnng representatons by back-propagatng errors, Nature, vol. 323, pp. 533 536, Oct. 1986. [28] G. Hnton, A practcal gude to tranng restrcted Boltzmann machnes, Dept. Comput. Sc., Unv. Toronto, Toronto, ON, Canada, Tech. Rep. UTML TR 2010-003, 2010. [29] G. Hnton, S. Osndero, and Y. Teh, A fast learnng algorthm for deep belef nets, Neural Comput., vol. 18, no. 7, pp. 1327 1554, 2006. [30] Y. Bengo, P. Lambln, D. Popovc, and H. Larochelle, Greedy layerwse tranng of deep networks, n Proc. Neural Inf. Process. Syst., 2006, pp. 153 160. [31] G. Hnton, Tranng products of experts by mnmzng contrastve dvergence, Neural Comput., vol. 14, no. 8, pp. 1771 1800, 2002. [32] P. Vncent, H. Larochelle, Y. Bengo, and P.-A. Manzagol Extractng and composng robust features wth denosng autoencoders, n Proc. 25th Int. Conf. Mach. Learn., 2008, pp. 1096 1103. VOLUME 2, 2014 523

[33] H. Larochelle, Y. Bengo, J. Louradour, and P. Lambln, Explorng strateges for tranng deep neural networks, J. Mach. Learn. Res., vol. 10, pp. 1 40, Jan. 2009. [34] H. Lee, A. Battle, R. Rana, and A. Ng, Effcent sparse codng algorthms, n Proc. Neural Inf. Procees. Syst., 2006, pp. 801 808. [35] F. Sede, G. L, and D. Yu, Conversatonal speech transcrpton usng context-dependent deep neural networks, n Proc. Interspeech, 2011, pp. 437 440. [36] D. C. Creşan, U. Meer, J. Masc, L. M. Gambardella, and J. Schmdhuber, Flexble, hgh performance convolutonal neural networks for mage classfcaton, n Proc. 22nd Int. Conf. Artf. Intell., 2011, pp. 1237 1242. [37] D. Scherer, A. Müller, and S. Behnke, Evaluaton of poolng operatons n convolutonal archtectures for object recognton, n Proc. Int. Conf. Artf. Neural Netw., 2010, pp. 92 101. [38] Y. LeCun, L. Bottou, G. Orr, and K. Muller, Effcent backprop, n Neural Networks: Trcks of the Trade, G. Orr and K. Muller, Eds. New York, NY, USA: Sprnger-Verlag, 1998. [39] K. Kavukcuoglu, M. A. Ranzato, R. Fergus, and Y. LeCun, Learnng nvarant features through topographc flter maps, n Proc. Int. Conf. CVPR, 2009, pp. 1605 1612. [40] D. Hubel and T. Wesel, Receptve felds and functonal archtecture of monkey strate cortex, J. Physol., vol. 195, pp. 215 243, Mar. 1968. [41] R. Rana, A. Madhavan, and A. Ng, Large-scale deep unsupervsed learnng usng graphcs processors, n Proc. 26th Int. Conf. Mach. Learn., Montreal, QC, Canada, 2009, pp. 873 880. [42] J. Martens, Deep learnng va Hessan-free optmzaton, n Proc. 27th Int. Conf. Mach. Learn., 2010. [43] K. Zhang and X. Chen, Large-scale deep belef nets wth MapReduce, IEEE Access, vol. 2, pp. 395 403, Apr. 2014. [44] L. Deng, D. Yu, and J. Platt, Scalable stackng and learnng for buldng deep archtectures, n Proc. IEEE ICASSP, Mar. 2012, pp. 2133 2136. [45] B. Hutchnson, L. Deng, and D. Yu, Tensor deep stackng networks, IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1944 1957, Aug. 2013. [46] V. Vanhoucke, A. Senor, and M. Mao, Improvng the speed of neural networks on CPUs, n Proc. Deep Learn. Unsupervsed Feature Learn. Workshop, 2011. [47] A. Krzhevsky, Learnng multple layers of features from tny mages, Dept. Comput. Sc., Unv. Toronto, Toronto, ON, Canada, Tech. Rep., 2009. [48] C. Farabet et al., Large-scale FPGA-based convolutonal networks, n Machne Learnng on Very Large Data Sets, R. Bekkerman, M. Blenko, and J. Langford, Eds. Cambrdge, U.K.: Cambrdge Unv. Press, 2011. [49] CUDA C Programmng Gude, PG-02829-001_v5.5, NVIDIA Corporaton, Santa Clara, CA, USA, Jul. 2013. [50] Q. Le et al., Buldng hgh-level features usng large scale unsupervsed learnng, n Proc. Int. Conf. Mach. Learn., 2012. [51] M. Ranzato and M. Szummer, Sem-supervsed learnng of compact document representatons wth deep networks, n Proc. Int. Conf. Mach. Learn., 2008, pp. 792 799. [52] S. Geman and D. Geman, Stochastc relaxaton, Gbbs dstrbutons, and the Bayesan restoraton of mages, IEEE Trans. Pattern Anal. Mach. Intell., vol. 6, no. 6, pp. 721 741, Nov. 1984. [53] G. Casella and E. George, Explanng the Gbbs sampler, Amer. Statst., vol. 46, no. 3, pp. 167 174, 1992. [54] P. Smard, D. Stenkraus, and J. Platt, Best practces for convolutonal neural networks appled to vsual document analyss, n Proc. 7th ICDAR, 2003, pp. 958 963. [55] A. Krzhevsky, I. Sutskever, and G. Hnton, ImageNet classfcaton wth deep convolutonal neural networks, n Proc. Adv. NIPS, 2012, pp. 1106 1114. [56] J. Dean et al., Large scale dstrbuted deep networks, n Proc. Adv. NIPS, 2012, pp. 1232 1240. [57] J. Duch, E. Hazan, and Y. Snger, Adaptve subgradent methods for onlne learnng and stochastc optmzaton, J. Mach. Learn. Res., vol. 12, pp. 2121 2159, Jul. 2011. [58] A. Coats, B. Huval, T. Wng, D. Wu, and A. Wu, Deep Learnng wth COTS HPS systems, J. Mach. Learn. Res., vol. 28, no. 3, pp. 1337 1345, 2013. [59] S. Tomov, R. Nath, P. Du, and J. Dongarra. (2011). MAGMA users gude. ICL, Unv. Tennessee, Knoxvlle, TN, USA [Onlne]. Avalable: http://cl.cs.utk.edu/magma [60] (2012). Obama Admnstraton Unvels Bg Data Intatve Announces $200 Mllon n New R&D Investments. Offce of Scence and Technology Polcy, Executve Offce of the Presdent, Washngton, DC, USA [Onlne]. Avalable: http://www.whtehouse.gov/stes/default/fles/ mcrostes/ostp/bg_data_press_release_fnal_2.pdf [61] K. Haberln, B. McGlpn, and C. Ouellette. Governor Patrck Announces New Intatve to Strengthen Massachusetts Poston as a World Leader n Bg Data. Commonwealth of Massachusetts [Onlne]. Avalable: http://www.mass.gov/governor/pressoffce/pressreleases/ 2012/2012530-governor-announces-bg-data-ntatve.html [62] Fact Sheet: Bran Intatve, Offce of the Press Secretary, The Whte House, Washngton, DC, USA, 2013. [63] D. Laney, The Importance of Bg Data : A Defnton. Stamford, CT, USA: Gartner, 2012. [64] A. Torralba, R. Fergus, and W. Freeman, 80 mllon tny mages: A large data set for nonparametrc object and scene recognton, IEEE Trans. Softw. Eng., vol. 30, no. 11, pp. 1958 1970, Nov. 2008. [65] J. Wang and X. Shen, Large margn sem-supervsed learnng, J. Mach. Learn. Res., vol. 8, no. 8, pp. 1867 1891, 2007. [66] J. Weston, F. Ratle, and R. Collobert, Deep learnng va sem-supervsed embeddng, n Proc. 25th Int. Conf. Mach. Learn., Helsnk, Fnland, 2008. [67] K. Snha and M. Belkn, Sem-supervsed learnng usng sparse egenfuncton bases, n Proc. Adv. NIPS, 2009, pp. 1687 1695. [68] R. Fergus, Y. Wess, and A. Torralba, Sem-supervsed learnng n ggantc mage collectons, n Proc. Adv. NIPS, 2009, pp. 522 530. [69] J. Ngam, A. Khosla, M. Km, J. Nam, H. Lee, and A. Ng, Multmodal deep learnng, n Proc. 28th Int. Conf. Mach. Learn., Bellevue, WA, USA, 2011. [70] N. Srvastava and R. Salakhutdnov, Multmodal learnng wth deep Boltzmann machnes, n Proc. Adv. NIPS, 2012. [71] L. Bottou, Onlne algorthms and stochastc approxmatons, n On-Lne Learnng n Neural Networks, D. Saad, Ed. Cambrdge, U.K.: Cambrdge Unv. Press, 1998. [72] A. Blum and C. Burch, On-lne learnng and the metrcal task system problem, n Proc. 10th Annu. Conf. Comput. Learn. Theory, 1997, pp. 45 53. [73] N. Cesa-Banch, Y. Freund, D. Helmbold, and M. Warmuth, On-lne predcton and conversaton strateges, n Proc. Conf. Comput. Learn. Theory Eurocolt, vol. 53. Oxford, U.K., 1994, pp. 205 216. [74] Y. Freund and R. Schapre, Game theory, on-lne predcton and boostng, n Proc. 9th Annu. Conf. Comput. Learn. Theory, 1996, pp. 325 332. [75] N. Lttlestone, P. M. Long, and M. K. Warmuth, On-lne learnng of lnear functons, n Proc. 23rd Symp. Theory Comput., 1991, pp. 465 475. [76] S. Shalev-Shwartz, Onlne learnng and onlne convex optmzaton, Found. Trends Mach. Learn., vol. 4, no. 2, pp. 107 194, 2012. [77] T. M. Heskes and B. Kappen, On-lne learnng processes n artfcal neural networks, North-Holland Math. Lbrary, vol. 51, pp. 199 233, 1993. [78] R. Mart and A. El-Fallah, Multlayer neural networks: An expermental evaluaton of on-lne tranng methods, Comput. Operat. Res., vol. 31, no. 9, pp. 1491 1513, 2004. [79] C. P. Lm and R. F. Harrson, Onlne pattern classfcaton wth multple neural network systems: An expermental study, IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 33, no. 2, pp. 235 247, May 2003. [80] M. Rattray and D. Saad, Globally optmal on-lne learnng rules for mult-layer neural networks, J. Phys. A, Math. General, vol. 30, no. 22, pp. L771 776, 1997. [81] P. Regler and M. Behl, On-lne backpropagaton n two-layered neural networks, J. Phys. A, vol. 28, no. 20, pp. L507 L513, 1995. [82] D. Saad and S. Solla, Exact soluton for on-lne learnng n multlayer neural networks, Phys. Rev. Lett., vol. 74, no. 21, pp. 4337 4340, 1995. [83] A. West and D. Saad, On-lne learnng wth adaptve back-propagaton n two-layer networks, Phys. Rev. E, vol. 56, no. 3, pp. 3426 3445, 1997. [84] P. Campolucc, A. Uncn, F. Pazza, and B. Rao, On-lne learnng algorthms for locally recurrent neural networks, IEEE Trans. Neural Netw., vol. 10, no. 2, pp. 253 271, Mar. 1999. [85] N. Lang, G. Huang, P. Saratchandran, and N. Sundararajan, A fast and accurate onlne sequental learnng algorthm for feedforward networks, IEEE Trans. Neural Netw., vol. 17, no. 6, pp. 1411 1423, Nov. 2006. [86] V. Ruz de Angulo and C. Torras, On-lne learnng wth mnmal degradaton n feedforward networks, IEEE Trans. Neural Netw., vol. 6, no. 3, pp. 657 668, May 1995. 524 VOLUME 2, 2014

[87] M. Choy, D. Srnvasan, and R. Cheu, Neural networks for contnuous onlne learnng and control, IEEE Trans. Neural Netw., vol. 17, no. 6, pp. 1511 1531, Nov. 2006. [88] L. Bottou and O. Bousequet, Stochastc gradent learnng n neural networks, n Proc. Neuro-Nmes, 1991. [89] S. Shalev-Shwartz, Y. Snger, and N. Srebro, Pegasos: Prmal estmated sub-gradent solver for SVM, n Proc. Int. Conf. Mach. Learn., 2007. [90] J. Chen and H. Hseh, Nonstatonary source separaton usng sequental and varatonal Bayesan learnng, IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 5, pp. 681 694, May 2013. [91] M. Sugyama and M. Kawanabe, Machne Learnng n Non-Statonary Envronments: Introducton to Covarate Shft Adaptaton. Cambrdge, MA, USA: MIT Press, Mar. 2012. [92] R. Elwell and R. Polkar, Incremental learnng n nonstatonary envronments wth controlled forgettng, n Proc. Int. Jont Conf. Neural Netw., 2009, pp. 771 778. [93] R. Elwell and R. Polkar, Incremental learnng of concept drft n nonstatonary envronments, IEEE Trans. Neural Netw., vol. 22, no. 10, pp. 1517 1531, Oct. 2011. [94] C. Alpp and M. Roveru, Just-n-tme adaptve classfers Part I: Detectng nonstatonary changes, IEEE Trans. Neural Netw., vol. 19, no. 7, pp. 1145 1153, Jul. 2008. [95] C. Alpp and M. Roveru, Just-n-tme adaptve classfers Part II: Desgnng the classfer, IEEE Trans. Neural Netw., vol. 19, no. 12, pp. 2053 2064, Dec. 2008. [96] L. Rutkowsk, Adaptve probablstc neural networks for pattern classfcaton n tme-varyng envronment, IEEE Trans. Neural Netw., vol. 15, no. 4, pp. 811 827, Jul. 2004. [97] W. de Olvera, The Rosenblatt Bayesan algorthm learnng n a nonstatonary envronment, IEEE Trans. Neural Netw., vol. 18, no. 2, pp. 584 588, Mar. 2007. [98] P. Bartlett, Optmal onlne predcton n adversaral envronments, n Proc. 13th Int. Conf. DS, 2010, p. 371. [99] Y. Bengo, Deep learnng of representatons for unsupervsed and transfer learnng, J. Mach. Learn. Res., vol. 27, pp. 17 37, 2012. [100] X. Glorot, A. Bordes, and Y. Bengo, Doman adaptaton for large-scale sentment classfcaton: A deep learnng approach, n Proc. 28th Int. Conf. Mach. Learn., Bellevue, WA, USA, 2011. [101] G. Mesnl et al., Unsupervsed and transfer learnng challenge: A deep learnng approach, J. Mach. Learn. Res., vol. 7, pp. 1 15, 2011. [102] S. J. Pan and Q. Yang, A survey on transfer learnng, IEEE Trans. Knowl. Data Eng., vol. 22, no. 10, pp. 1345 1359, Oct. 2010. [103] S. Gutsten, O. Fuentes, and E. Freudenthal, Knowledge transfer n deep convolutonal neural nets, Int. J. Artf. Intell. Tools, vol. 17, no. 3, pp. 555 567, 2008. [104] A. Blum and T. Mtchell, Combnng labeled and unlabeled data wth co-tranng, n Proc. 11th Annu. Conf. Comput. Learn. Theory, 1998, pp. 92 100. [105] R. Rana, A. Battle, H. Lee, B. Packer, and A. Y. Ng, Self-taught learnng: Transfer learnng from unlabeled data, n Proc. 24th ICML, 2007. [106] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, Doman adaptaton va transfer component analyss, IEEE Trans. Neural Netw., vol. 22, no. 2, pp. 199 210, Feb. 2011. [107] G. Mesnl, S. Rfa, A. Bordes, X. Glorot, Y. Bengo, and P. Vncent, Unsupervsed and transfer learnng under uncertanty: From object detectons to scene categorzaton, n Proc. ICPRAM, 2013, pp. 345 354. XUE-WEN CHEN (M 00 SM 03) s currently a Professor and the Char wth the Department of Computer Scence, Wayne State Unversty, Detrot, MI, USA. He receved the Ph.D. degree from Carnege Mellon Unversty, Pttsburgh, PA, USA, n 2001. He s currently servng as an Assocate Edtor or an Edtoral Board Member for several nternatonal journals, ncludng IEEE ACCESS, BMC Systems Bology, and the IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE. He served as a Conference Char or Program Char for a number of conferences such as the 21st ACM Conference on Informaton and Knowledge Management n 2012 and the 10th IEEE Internatonal Conference on Machne Learnng and Applcatons n 2011. He s a Senor Member of the IEEE Computer Socety. XIAOTONG LIN s currently a Vstng Assstant Professor wth the Department of Computer Scence and Engneerng, Oakland Unversty, Rochester, MI, USA. She receved the Ph.D. degree from the Unversty of Kansas, Lawrence, KS, USA, n 2012, and the M.Sc. degree from the Unversty of Pttsburgh, Pttsburgh, PA, USA, n 1999. Her research nterests nclude large scale machne learnng, data mnng, hgh-performance computng, and bonformatcs. VOLUME 2, 2014 525