A Scalable Data Science Workflow Approach for Big Data Bayesian Network Learning

Transcription

1 A Scalable Data Scence Workflow Approach for Bg Data Bayesan Network Learnng Janwu Wang 1, Yan Tang 2, Ma Nguyen 1, Ilkay Altntas 1 1 San Dego Supercomputer Center Unversty of Calforna, San Dego La Jolla, CA, USA, {janwu, mhnguyen, altntas@sdsc.edu 2 College of Computer and Informaton Hoha Unversty Nanjng, Chna, tangyan@hhu.edu.cn Abstract In the Bg Data era, machne learnng has more potental to dscover valuable nsghts from the data. As an mportant machne learnng technque, Bayesan Network (BN) has been wdely used to model probablstc relatonshps among varables. To deal wth the challenges of Bg Data PN learnng, we apply the technques n dstrbuted data-parallelsm (DDP) and scentfc workflow to the BN learnng process. We frst propose an ntellgent Bg Data pre-processng approach and a data qualty score to measure and ensure the data qualty and data fathfulness. Then, a new weght based ensemble algorthm s proposed to learn a BN structure from an ensemble of local results. To easly ntegrate the algorthm wth DDP engnes, such as Hadoop, we employ Kepler scentfc workflow to buld the whole learnng process. We demonstrate how Kepler can facltate buldng and runnng our Bg Data BN learnng applcaton. Our experments show good scalablty and learnng accuracy when runnng the applcaton n real dstrbuted envronments. Keywords Bg Data; Bayesan network; Dstrbuted computng; Ensemble learnng; Scentfc workflow; Kepler; Hadoop I. INTRODUCTION Wth the explosve growth of data encountered n our daly lves, we have entered the Bg Data era [1]. For the Unted States health care sector alone, the creatve and effectve use of Bg Data could brng more than $300 bllon potental annual value each year [35]. Makng the best use of Bg Data and releasng ts value are the core problems that researchers all over the world long to solve snce the New Mllennum. Bayesan Network (BN), a probablstc graph model, provdes ntutve and theoretcally sold mechansms for processng uncertan nformaton and presentng causaltes among varables. BN s an deal tool for causal relatonshp modelng and probablstc reasonng. BN s a wdely used n modelng [2][3][4], predcton [5][6][7], and rsk analyss [8]. BNs have been appled n a wde range of domans such as Health Care, Educaton, Fnance, Envronment, Bonformatcs, Telecommuncaton, and Informaton Technology [9]. Wth abundant data resources nowadays, learnng BN from Bg Data could dscover valuable busness nsghts [4] and brng potental revenue value [8] to dfferent domans. To effcently process large quanttes of data, a scalable approach s needed. Although dstrbuted data-parallelsm (DDP) patterns, such as Map, Reduce, Match, CoGroup and Cross, are promsng technques to buld scalable data parallel analyss and analytcs applcatons, applyng the DDP patterns n Bg Data BN learnng stll faces several challenges: (1) How can we effectvely pre-process Bg Data to evaluate ts qualty and reduce the sze f necessary? (2) How to desgn a workflow capable of takng Ggabytes of bg data sets and learn BNs wth decent accuracy? (3) How to provde easy scalablty support to BN learnng algorthms? These three questons have not receved substantal attenton n the current research status-quo. Ths s the man motvaton for ths research: the creaton of the novel workflow - Scalable Bayesan Network Learnng (SBNL) workflow. Ths SBNL workflow has three research noveltes whch contrbute to the current lterature: Intellgent Bg Data pre-processng through the use of a proposed data qualty score called S to measure and Arc ensure data qualty and data fathfulness. Effectve BN learnng from Bg Data by leveragng ensemble learnng and dstrbuted computng model. A new weght based ensemble algorthm s proposed to learn a BN structure from an ensemble of local results. Ths algorthm s mplemented as an R package and reusable by thrd partes. A user-frendly approach to buld and run scalable Bg Data machne learnng applcatons on top of DDP patterns and engnes va scentfc workflows. Users do not need to wrte programs to ft the nterfaces of dfferent DDP engnes. They only need to buld algorthm specfc components usng languages lke R or Matlab, whch they are already famlar wth. SBNL s valdated usng three publcly avalable data sets. SBNL obtans sgnfcant performance gan when appled to dstrbuted envronments whle keepng the same learnng accuracy, makng SBNL an deal workflow for Bg Data Bayesan Network learnng. The remander of ths paper s organzed as follows. Secton II presents the background of BN learnng technques and

2 evaluaton. Secton III dscusses how to buld scalable Bg Data applcatons va Kepler scentfc workflow system. Secton IV descrbes SBNL workflow n detal. The evaluaton results and related work are presented n Secton V and VI, respectvely. Secton VII concludes ths paper wth future work. II. BAYESIAN NETWORK LEARNING TECHNIQUES AND EVALUATION A. Ensemble Learnng One trend n machne learnng s to combne results of multple learners to obtan better accuracy. Ths trend s commonly known as Ensemble Learnng. Ensemble Learnng leverages multple models to obtan better predctve performance than what could be obtaned from any of the consttuent models [19]. There s no defntve taxonomy for ensemble learnng. Zenko [18] detals four methods of combnng multple models: baggng, boostng, stackng and error-correctng output. In 2009, the Netflx challenge grand przewnner used ensemble learnng technque to buld the most accurate predctve model for move recommendaton and won a mllon dollar 1. In ths paper, we propose a weght-based algorthm to combne local and global learnng results to learn a BN from Bg Data. Man symbols used n the paper are lsted n Table I. Data set B D P(B,D) P(B) TABLE I. SYMBOL TABLE BN structure Data set Meanngs The jont probablty of a BN structure (B) gven the data set (D) Pror probablty of the network B N Pror equvalent sample sze X Pa(X ) Γ Varable X Parents of X Gamma functon N D The number of rows n D, N arc S Arc The total number of arcs n B Arc Score B. The BDe Score Score functons descrbe how well the BN structure fts the data set. The BDe score functon [9] provdes a way to compute the jont probablty of a BN structure (B) gven the data set (D) denoted as PBD (, ). The best BN structure maxmzes the BDe score. Thus BDe score s a good measure for data set qualty. The BDe score functon s derved from dscrete varables. Let n be the total number of varables, q denotes the number of confguratons of the parent set Pa( X ) and r denotes the number of states of varable X. The BDe score functon s: P( B, D) = P( B ) n. ' = 1 ' N Γ( ) q N Γ( + N ) q r j= 1 k= 1 j q ' N Γ( + N r q ' N Γ( ) r q where N jk s the number of nstances n D for whch X = k and Pa( X ) s n the j th confguraton r = N j N jk k= The BDe score functon uses a parameter pror that s unform and requres both a pror equvalent sample sze N and a pror structure. In practce, the BDe score functon uses the value log( PBD (, )). BDe score functon s decomposable: the total score s the sum of the score of each varable X. Ths study uses BDe score to evaluate the learnng accuracy of BN learnng algorthm as well as the data set qualty (Secton V.B) In order to measure the qualty of a data set D, we ntroduce a new measure called Arc Score denoted as S Arc jk ) (1) (2) S = P( B, D)/( N * N ) (3) Arc D arc P(B, D) s the BDe score gven data D,and BN structure B, N s the number of rows n D, D N s the total number of arc arcs n B. After emprcal study on known BN structures (Secton V), we dscover that the data sets sutable for BN learnng have S Arc larger than On the other hand, data sets that produce nferor BN structure have very low S Arc, generally smaller than -0.5, reachng -1 or even -2. Hence, S Arc could be used as an effectve measure for data set qualty. Under the stuaton where we only have a data set D, how could we obtan S Arc? To address ths problem, we can use an accurate BN learnng algorthm lke Max-Mn Hll Clmbng (MMHC) [10]. Because BN learnng algorthms are neutral, gven good data set, a structure B wth hgh S Arc wll be learned, and gven a bad data set D, a structure B wth low S Arc wll be learned. Therefore, we can use MMHC, one of the most accurate and robust algorthms from the BN learnng algorthm comparson study [9] to learn a BN from any data set D and calculate the correspondng S Arc. C. Structural Hammng Dstance Structure Hammng Dstance (SHD) [14] s an mportant measure to evaluate the qualty of the BN. The learned BN structure B s drectly compared wth the structure of a gold standard network (GSN) - the network wth a known correct structure. Each arc n the learned structure fts one of the followng cases: Correct Arc: the same arc as the GSN. Added Arc (AA): an arc that does not exst n the GSN. Mssng Arc (MA): an arc that exsts n the GSN but not n the learned structure. 1 Netflx prze, Netflx Inc,

3 Wrongly Drected Arc (WDA): an arc that exsts n the GSN wth opposte drecton. Snce some algorthms may return non-orented edges as the drecton of some edges could not be statstcally dstngushed usng orentaton rules, the learned BN s a partally drected acyclc graph (PDAG). SHD s defned as the number of the operators requred to make two PDAGs dentcal: ether by addng, removng, or reversng an arc, or by addng, removng or orentng an edge. SHD = (#AA + #MA + #WDA +#AE + #ME + #NOE) (4) where AE s Added Edges, ME s Mssng Edges and NOE s Non-Orented Edges. D. Bayesan Network Learnng Algorthm A comprehensve and comparatve survey was carred out on BN learnng algorthms [9]. Out of more than 50 learnng algorthms, Max-Mn-Hll-Clmbng (MMHC) [10], Three- Phase Dependency Analyss Algorthm (TPDA) [11] and Recursve BN learnng Algorthm (REC) [12] are shown to have superor learnng accuracy and robustness. Input: D : Data set ε : threshold for condtonal ndependence test MMHC ( D, ε ) Phase I: 1. For all varables X, Set PC( X) = MMPC( X, D) ; Phase II: 2. Start from an empty graph; perform greedy hll-clmbng wth operators: add_edge, delete_edge and reverse_edge. Only try operator add_edge Y X f Y PC( X ) ; Return the hghest scorng DAG found; Fg. 1. The MMHC algorthm. The Max-Mn Hll Clmbng (MMHC) algorthm [14] combnes concepts from Constrant-based [13], and Searchand-Score-Based algorthms [11]. It takes as nput a data set D and returns a BN structure wth the hghest score. MMHC s a two-phase algorthm: Phase I dentfes the canddate sets for each varable X by callng a local-dscovery algorthm called Max-Mn Parents and Chldren (MMPC) and dscover a BN s skeleton. Phase II performs a Bayesan-scorng, greedy hllclmbng search startng from an empty graph to orent and delete the edges. Fg. 1 presents a detaled descrpton of the MMHC algorthm. The major structural search process of MMHC algorthm s the MMPC procedure that returns the parents and chldren of a target varable X, denoted as PC(X). By nvokng MMPC wth each varable as the target, one can dentfy all the edges that form the BBN s skeleton. III. BUILDING SCALABLE BIG DATA APPLICATIONS VIA KEPLER SCIENTIFIC WORKFLOW SYSTEM A. Dstrbuted Data-Parallel Patterns for Scalable Bg Data Applcaton Several DDP patterns, such as Map, Reduce, Match, CoGroup, and Cross, have been dentfed to easly buld effcent and scalable data parallel analyss and analytcs applcatons [27]. DDP patterns enable programs to execute n parallel by splttng data n dstrbuted computng envronments. Orgnatng from hgher-order functonal programmng, each DDP pattern executes user-defned functons (UDF) n parallel over nput data sets. Snce DDP executon engnes often provde many features for executon, ncludng parallelzaton, communcaton, and fault tolerance, applcaton developers only need to select the approprate DDP pattern for ther specfc data processng tasks, and mplement the correspondng UDFs. Due to the ncreasng popularty and adopton of these DDP patterns, a number of executon engnes have been mplemented to support one or more of them. These DDP executon engnes manage dstrbuted resources, and execute UDF nstances n parallel. When runnng on dstrbuted resources, DDP engnes can acheve good scalablty and performance acceleraton. Hadoop s the most popular MapReduce executon engne. The Stratosphere system [27] supports fve dfferent DDP patterns. Many of the above DDP patterns are also supported by Spark 2. Snce each DDP executon engne defnes ts own API for how UDFs should be mplemented, an applcaton mplemented for one engne may be dffcult to run on another engne. B. Kepler Scentfc Workflow The Kepler scentfc workflow system 3 s an open-source, cross-project collaboraton to serve scentsts from dfferent dscplnes [28]. Kepler adopts an actor-orented modelng paradgm for the desgn and executon of scentfc workflows. Kepler has been used n a wde varety of projects to manage, process, and analyze scentfc data. Kepler provdes a graphcal user nterface (GUI) for desgnng, managng and executng scentfc workflows, whch are a structured set of steps or tasks lnked together that mplement a computatonal soluton to a scentfc problem. In Kepler, Actors provde mplementatons of specfc tasks and can be lnked together va nput and output Ports. Data s encapsulated n messages or Tokens, and transferred between actors through ports. Actor executon s governed by Model of Computatons (MoCs), called Drectors n Kepler [29]. We found the actor-orented programmng paradgm of Kepler fts the DDP framework very well [30]. Snce each DDP pattern expresses an ndependent hgher-order functon, we defne a separate DDP actor for each pattern. Unlke normal actors, these hgher-order DDP actors do not process 2 Spark Project: 3 Kepler Project:

4 ts nput data as a whole. Instead, they frst partton the nput data and then process each partton separately. The UDF for the DDP patterns s an ndependent component and can naturally be encapsulated wthn a DDP actor. The logc of the UDF can ether be expressed as a subworkflow or compled code. In the frst case, users can compose a sub-workflow for ther UDF va Kepler GUI usng specfc subsdary actors for the DDP pattern and any other general actors. Snce the sub-workflow s not specfc to any engne API, the same sub-workflow could be executed on dfferent DDP engnes. Lke other actors, multple DDP actors can be lnked to construct bgger applcatons. Each DDP pattern defnes ts executon semantcs,.e., how data parttons are processed by the pattern. Ths clear defnton enables decouplng between a DDP pattern and ts executon engnes. To execute DDP workflows on dfferent DDP executon engnes, we have mplemented a DDP drector n Kepler. Currently, ths drector can execute DDP workflows wth Hadoop, Stratosphere and Spark. At runtme, the drector wll detect the avalablty of DDP executon engnes and transform workflows nto ther correspondng jobs. The adaptablty of the drector makes t user-frendly snce t hdes the underlyng executon engnes from users. C. Machne Learnng Support n Kepler There are many popular tools/languages for machne learnng, such as R, Matlab, Python and Knme [31]. Complex machne learnng applcatons mght need to ntegrate dfferent components mplemented n dfferent tools/languages. Kepler supports easy ntegraton of these tools/languages wthn one process. Besdes the ExternalExecuton actor n Kepler to nvoke arbtrary bnary tools n batch mode, we also have actors specfcally for many scrptng languages. For nstance, users can embed ther own R scrpts n the RExpresson actor. Users can further customze the nput/output ports of the RExpresson actor to connect wth other actors and buld complex applcatons. In addton, we are nvestgatng how to ntegrate other popular machne learnng tools, such as Mahout 4, nto Kepler. Users wll be able to use ther machne learnng functons/lbrares as actors and connect them wth other actors. IV. PROPOSED APPROACH A. Overvew of SBNL Workflow After ntroducng the background knowledge n prevous sectons, we gve the overvew of our SBNL workflow. 4 Mahout: Bg Data Data Qualty Evaluaton MasterEnsemble Learnng SBNL workflow Qualty Evaluaton & Data Parttonng Local Learner Master Learner Local Ensemble Learnng Fnal BN Structure Kepler Workflow Fg. 2. Overvew of the SBNL algorthm. As shown n Fg. 2, SBNL workflow conssts of four components: (1) Data parttonng, (2) Local learner, (3) Master learner, (4) Kepler workflow. In the data parttonng component, the SBNL workflow parttons the data set nto data parttons of reasonable sze. SBNL has a score based algorthm to dynamcally determne the best partton sze to balance both learnng complexty and accuracy. Then, data parttons are sent evenly to each local learner. The local learner wll frst use the value of S Arc to examne the data partton s qualty. If the qualty s good, SBNL then enters local ensemble learnng (LEL) step, each local learner wll run MMHC algorthm separately on each local data partton to learn an ndvdual BN. Then, local learner apples our proposed ensemble method on ndvdual BNs to generate a fnal local BN. Durng local learnng, the best local data partton s obtaned n each local learner. Fnally, SBNL workflow reaches the master learner component. Ths component receves local BN and best local data parttons from all local learners. The best data partton can be obtaned n master learner. Then, master learner runs our proposed ensemble algorthm on the local BN usng the best data partton. Note that master learner does not run any BN learnng algorthm, t just gves weght to each local BN and ensembles the fnal BN. So, all the computng heavy lftng tasks are dstrbuted among the local learners. Detals of each component n SBNL workflow are specfed n the followng sub-sectons. B. Qualty Evaluaton and Data Parttonng Frst thng SBNL workflow does s evaluatng the qualty of a gven bg data set. It runs a scorng algorthm to ncrementally evaluate a partton D p of the whole data set D. Each tme the scorng algorthm doubles the sze of D p untl a threshold s reached. If the S Arc value of D p s larger than 1.0, then the whole data set wll not be used for SBNL workflow.

5 Facng Bg Data larger than the memory sze, a sngle machne could not compute the Bayesan score of the whole data set. Therefore, we use dstrbuted computng model n SBNL workflow. A bg data set s parttoned nto K slces of sze N s. N d = N s *K + N r (5) where N d s the sze of data set D and N r s the row number of the remander of D after K parttons. By countng the last N r rows data as another slce, total partton slce s K+1. Gven the total number of local learners denoted as N local, we try to send data slces evenly to the local learners for better load balance. An mportant task here s to determne N s, we propose a fast ncremental algorthm FndNs to fnd a proper partton sze. FndNs s descrbed n Table II. TABLE II. THE FINDNS FUNCTION functon FndNs (data, maxstep, maxsze ){ bestscore = 100; currentstep = 1; n rowdata= number of rows n data; n coldata = number of column n data; Ns = 1000*( n coldata % 10); slceddata = data[1:slcesze] ; score = dataqualtycalculator(slceddata) *(-1); whle (score < bestscore && currentstep < maxstep && slcesze < maxsze) { bestscore = score; Ns= Ns*2; slceddata = data[1: Ns]; score = SarcCalculator (slceddata) *(-1); currentstep = currentstep+1; return Ns; functon SarcCalculator(D, parameters){ network = mmhc(d, parameters); score = score(network, D, type ="BDe"); S Arc = score/(n d * (# of arcs n network)); return S Arc; FndNs algorthm begns wth the ntal slce sze: N s =1000 *(n coldata %10) (7) Then t doubles the value of N s teratvely and evaluate the data partton (data[1: N s ]) untl ts S Arc value could no longer be mproved or the maxmum number teraton or partton sze s reached. In ths way, the qualty of each data partton s ensured by S Arc and the data partton sze s controlled under a threshold. C. Local Learner Frst actvty n local learner s the Data Qualty Evaluaton (DQE). Durng DQE, each data partton s examned wth the functon SarcCalculator. If the data partton s S Arc s less than -0.5, then ths partton s dropped by SBNL workflow. After DQE, local learner then enters the second actvty: Local Ensemble Learnng (LEL) shown n Table III. In LEL, the frst step s learnng local BNs from data parttons usng MMHC algorthm. Ths step also looks at each data partton and selects the best partton. Then, LEL calculates learnscores for local BNs usng the best data partton. TABLE III. LOCAL ENSEMBLE LEARNING LocalEnsembleLearnng(dataParttons){ # Intalzaton localbns = learn BNs from dataparttons usng MMHC; learnscores = Bde scores of localbns usng best data partton; fnallocalbn = ensemblebns(localbns,learnscores, bestpartton); ensemblebns(localbns, learnscores, bestpartton){ weghts = weghtcalculator(learnscores, bestpartton); mergedmatrx = matrx(0, nnodes,nnodes); # Transform and merge local BNs for ( n n 1:length(localBNs)) { adjmatrx = BNToAdjMatrx(localBNs[n]); mergedmatrx = adjmatrx * weghts[n] + mergedmatrx; # Transform merged matrx nto fnal local BN mnthreshold = mn(weghts); fnallocalbn = MergedMatrxToBN(mergedMatrx, mnthreshold*2); return fnallocalbn; Assgnng weght to each ndvdual learner s an mportant technque n ensemble learnng. LEL leverages the weghtng technque and proposes a method called ensemblebns. In ensemblebns, based on the value of localscores, a weght vector s calculated. For example, f the localscores = [-0.2, - 0.3, -0.25, -0.25], then the correspondng normalzed weght vector s [0.306, 0.204, 0.245, 0.245]. Smaller local score n absolute value has hgher weght. After obtanng the weghts, ensemblebns then transforms local BNs nto adjacency matrxes and merges them nto one matrx usng the weght vector. In the end, ensemblebns leverages the merged matrx to generate the fnal local BN. A threshold s set as the mnmal value n the weght vector. Then ensemblebns terates the merged matrx and dentfy an arc when mergedmatrx [, j] > mnthreshold * 2. Ths s a votng mechansm to promote and dscover an arc when t s present n more than two local BNs. D. Master Learner After descrbng local learner, we now ntroduce the fnal component n SBNL workflow the master learner. Master learner adopts smlar strategy as the local learner and reuses the functon ensemblebns. There are two nputs: fnal local BNs deoted as S local, and the best local partton denoted as D localbest. Master learner contans four steps: 1) Obtan the global best partton from D localbest. ; 2) Calculate scores for S local usng D best ; 3) Call ensemblebns functon to obtan the fnal BN; 4) Return fnal BN as the learnng result of SBNL workflow from Bg Data.

6 TABLE IV. MASTER LEARNER CentralEnsembleLearner(BN local, D localbest) { obtan the best data partton D best from D localbest scores = Bde scores of S local usng D best; fnalbn= ensemblebns(bn local, scores, D best); return fnalbn; E. SBNL Workflow n Kepler We buld our SBNL workflow by embeddng the above components n Kepler, whch s shown n Fg. 3. All the code snppets (namely Table II, III, IV) are mplemented n an R package as the core of the Kepler bg data BN learnng workflow. Man actors of the top-level workflow, shown n Fg. 3 (a), are ParttonData and DDPNetworkLearner actors. The frst actor s a RExpresson actor that ncludes the R scrpts for the data parttonng component n Fg. 2. The man parts of ths scrpt are provded n Table II. Fg. 3 (a): Top-level SBNL workflow. partton to a local learner nstance that runs across the computng nodes. The sub-workflow of the Map actor, shown n Fg. 3 (c), manly calls a RExpresson actor to run Local Learner R scrpt. The man parts of ths scrpt are provded n Table III. The subworkflow of the Reduce actor, shown n Fg. 3 (d), manly calls a RExpresson actor to run Master Learner R scrpt. The man parts of ths scrpt are provded n Table IV. Based on the dependency between the Map and Reduce actor n Fg. 3 (b), the DDP Drector can manage ther executons so that Reduce actor can only be executed after Map actor fnshes all local learner processng. Ths workflow demonstrates how Kepler can facltate buldng parallel network learner algorthms. The DDP framework of Kepler provdes basc buldng blocks for the DDP patterns and supports the dependences between them. RExpresson actor can easly ntegrate user R scrpts wth other parts of the workflow. Kepler also provdes subsdary actors, such as Expresson and DDPDataSource, for supportng operatons needed for a complete and executable workflow. Overall, Kepler users can buld scalable network learner workflows wthout wrtng programs except needed R scrpts. V. EVALUATION The evaluaton results of SBNL are presented n ths secton. Several bg data sets are used to evaluate SBNL. The goal of the evaluaton s to address the followng questons: 1. When constructng the SBNL, what s the best slce sze for each bg data set? 2. On all bg data sets, does SBNL workflow acheve good learnng accuracy wth sgnfcant performance mprovement? Fg. 3 (b): DDP sub-workflow. Fg. 3 (c): Local learner sub-workflow n Map. Fg. 3 (d): Master learner sub-workflow n Reduce. Fg. 3. SBNL workflow n Kepler. DDPNetworkLearner s a composte actor whose subworkflow s shown n Fg. 3 (b). Map and Reduce DDP actors are used here to acheve parallel local learner executon and sequental master learner executon. DDP Drector s used to manage the sub-workflow executon by communcatng wth underlyng DDP engnes. DDPDataSource actor reads parttons generated by ParttonData actor and sends each A bref descrpton of the data sets and threshold selecton study are presented n Subsecton A. Subsecton B answers two questons above. A. Background The background of the emprcal study s descrbed n detal n ths subsecton. Frst, data sets are descrbed and evaluaton measures are presented. Then threshold selecton study s shown. The machne specfcaton for the evaluaton of all results s as follows. Four compute nodes n a cluster envronment are employed, where each node has two eghtcore 2.6 GHz CPUs, and 64 GB memory. Each node could access the nput data va a shared fle system. 1) Data sets and measurements Three large data sets are used n ths emprcal study. A bref descrpton of each data set s presented below. Propertes of all data sets are summarzed n Table V. Data set #Rows (mllon) TABLE V. DATA SETS #Arcs #Varables Data sze (GB) Alarm10M HalFnder10M Insurance10M

7 Alarm: A medcal BN for patent montorng. HalFnder: A BN that forecasts severe summer hal n the northeastern Colorado area. Insurance: An adaptve BN Network modelng the car nsurance problem. All data sets are generated from well-known Bayesan networks as follows usng logc samplng [20]: For Alarm network, the data set contans 10 mllons rows and s called Alarm10M. Smlarly, For Halfnder network, the data set s called Hafnder10M and for nsurance network, the data set s called Insurance10M. Snce each data set contans 10 mllon rows and all the data set szes exceed the normal data set sze applcable for BN learnng. It s very tme consumng and sometme nfeasble to learn BN from most of the data sets lsted above usng tradtonal BN learnng algorthm. 2) Threshold Selecton Study To measure the BN structures learned by SBNL, we use BDe score and SHD descrbed n Secton II.B and II.C. In Secton II, two functons are descrbed. SarcCalculator calculates arc score to measure the qualty of data set D, FndNs uses SarcCalculator to fnd the deal data slce sze N s. It s crtcal to study and verfy the correctness of the functon SarcCalculator to make sure that the data preprocess phase of SBNL are sound and practcal. To evaluate the correctness of S Arc, we used sx dfferent data sets: three good data sets wthout any nose, followed by three bad data sets wth 5% nose from each BN lsted n Table VI. Then we calculate S Arc for each data set and compare t wth the S Arc of the golden standard network (GSB). SHD s lsted for each learned BN. Table VI shows that gven good data set, value of S Arc s very close to S Arc of GSN. Ths ndcates that S Arc s ndeed an accurate measure for the qualty of the data sets. Furthermore, t s observed that bad data sets wth nose have very low S Arc : generally lower than -0.5, and the SHD of the correspondng bad data set s far away from the correct structure. The column Select ndcates whether SBNL selects the data set n the DQE actvty. Date set TABLE VI. SARC OF SIX DIFFERENT DATA SETS Rows (K) S Arc (MMHC) S Arc (GSN) Select SHD n Table VII are very close to S Arc of GSN lsted n Table VI. Ths ensures the correctness of the partton sze N s. TABLE VII. ACCURACY RESULTS OF THREE NETWORKS Network N s S Arc Alarm HalFnder Insurance B. Experments We conducted our experments usng four compute nodes n a cluster envronment. The tests were done wth Hadoop verson 2.2. In the tests, one node s assgned to task coordnaton and others to worker tasks. We ran our workflow wth dfferent worker nodes to see the scalablty of executons and how ts performance changes. We also mplemented an R program that only uses the orgnal MMHC algorthm for the network learnng task. Because the R program has no parallel executon across multple nodes, no data partton step s needed and t can only run on one node. Its executon tme wll be the baselne for the performance comparsons. We ran our experments wth three data sets, whose executon nformaton s shown n Table VIII, from whch we can see our workflow acheved good scalablty runnng on more worker nodes. Although our SBNL workflow has an addtonal step for data partton, ts executon tmes are stll better than the base lne executon. The overall performance shows less mprovement when the worker node number ncreases. It s because some steps of the workflow (data partton, master leaner) cannot utlze the dstrbuted envronment for parallel executons. We plan to speedup the data partton step by utlzng the parallel data loadng and parttonng capablty of HDFS 5. We wll also do the experments wth bgger data sets on larger envronments. TABLE VIII. EXECUTION PERFORMANCE OF THE NETWORK ANALYSIS WORKFLOW AND BASE LINE R PROGRAM (UNIT: MINUTES) Data set Base lne (16 Core) Parallel executons wth Kepler 32 Core 48 Core 64 Core Alarm_good Yes 4 HalFnder_good Yes 26 Insurance_good Yes 9 Alarm_Bad No 12 HalFnder_Bad No 58 Alarm5M (936 MB) Alarm10M (1.9 GB) Insurance10M (1.9 GB) Insurance_Bad No 21 Accordng to Table VI, we can clam that SBNL has 100% data selecton accuracy n ts local learner component. Therefore, we could conclude that S Arc s an accurate measure to test the fathfulness of data set D. After runnng FndNs on three bg data sets, we obtan N s for each bg data set (n Table VII). Note that S Arc values shown We frst gve Halfnder data set to SBNL workflow. In the frst data evaluaton actor, S Arc value of Halfnder remans very hgh around 1.5. So SBNL workflow determnes that Halfnder data set s not sutable for BN learnng. To confrm t, we further apply a data set of Halfnder to MMHC 5 Hadoop Dstrbuted Fle System (HDFS) :

8 algorthm. The learned BN s very dfferent from the actual Halfnder network snce there are over 30 mssng arcs. Ths study affrms the correctness of SBNL workflow. Low qualty data sets are rejected n the begnnng by SBNL so as to ensure good learnng results. We also evaluated the Alarm and Insurance data set. Both data sets have good qualty. The accuracy analyss s summarzed n Table IX. Alarm10M data set s parttoned nto 208 parttons and Insurance10M data set s parttoned nto 625 parttons. For Alarm10M data set, we compare SBNL s result wth a sngle row data set (Alarm96K) appled drectly to MMHC algorthm on a sngle machne. Smlarly, for nsurance10m data set, we compare SBNL s result wth a sngle row data set (Insurance16K) appled to MMHC algorthm. TABLE IX. NETWORK ACCURACY ANALYSIS S Arc AA MA SHD Alarm10M (SBNL) Alarm96K (Sngle) Insurance10M (SBNL) Insurance16K (Sngle) Alarm data set has good data qualty wth very low S Arc, therefore, the learned BN s close to the actual network. The best partton sze of Alarm data set s We use a separate Alarm data set wth Alarm96K to compare SBNL s accuracy. It s observed that after applyng the Alarm10M data set rows to SBNL, we learned a BN wth 37 correct arcs, zero mssng arcs wth a structure hammng dstance of nne. It s close to the learnng result of Alarm96K data set, showng good learnng accuracy of SBNL workflow. Note that there s no added arc; ths s due to the ensemble weghtng mechansm of SBNL whch selects popular arcs dscovered by the local learner, resultng n a very compact BN wth most of the correct arcs. On the other hand, Insurance data set has hgher S Arc value. So ts learnng accuracy s not as good as Alarm network. The best partton sze of Insurance10M s It can be observed that the learnng results of Insurance10M data set wth SBNL workflow are smlar to that of Insurance16K data set. Agan, ths comparson confrms the learnng accuracy of SBNL workflow. VI. RELATED WORK To effcently manage the massve amounts of data encountered n bg data applcatons, approaches to n-stu analytcs have been nvestgated. Zou et al. explore the use of data reducton va onlne data compresson n [32] and apply ths dea to large-scale remote vsual data exploraton [33]. Our approach addresses the data set sze problem by usng a preprocessng technque to elmnate poor-qualty data, and by usng an approach that leverages an ensemble model coupled wth dstrbuted processng. Learnng BN from data s a tradtonal research area wth a long hstory. Chckerng et al. [16] show that fndng the optmal BN structure n the graph search space s NP hard. A comprehensve comparatve survey was carred out on BN learnng algorthms [9]. The majorty of learnng algorthms are not desgned for Bg Data BN learnng. The number of possble BN structures grows super-exponentally wth respect to the number of varables. In addton, large data sets can hardly ft n the memory of a sngle machne. Therefore, t s advsable to learn BN from Bg Data through dstrbuted computng methods n a dvde-and-conquer fashon. Chen et al. [13] study the problem of learnng the structure of a BN from a dstrbuted heterogeneous data sources, but ths approach focuses on learnng sparsely connected networks wth dfferent features at each ste. In 2010, Na and Yang proposed a method for learnng the structure of a BN from dstrbuted data sources [14], but ther local learnng s usng K2 algorthm wth medum accuracy and the approach does not scale for bg data set. In 2011, Tamada et al. proposed a Parallel Algorthm for learnng optmal BN structure [15], but ths approach s lmted for optmal structure search of BNs, whch s not sutable for large data sets wth mllons of records. In Bg Data BN learnng area, current research focus manly on methods for dstrbuted computng and scale-up mplementaton. To our best knowledge, ths research s the frst to brng workflow concept nto Bg Data BN learnng. Ths s a key contrbuton to the exstng research. There are several studes to scale up machne learnng applcatons. The MapReduce framework has been shown to be broadly applcable to many machne learnng algorthms [26]. Das et al. use a JSON query language, called Jaql, as brdge between R and Hadoop [23]. It provdes a new package for HDFS operatons. Ghotng and Pednault propose Hadoop-ML, an nfrastructure on whch developers can buld task-parallel or data-parallel machne learnng algorthms on program blocks under the language runtme envronment [24]. Budu et al. demonstrate how to use DryadLINQ for machne learnng applcatons such as decson tree nducton and k-means [34]. Yet learnng curves of these tools are relatvely steep snce researchers have to learn the archtectures and nterfaces to mplement ther own data mnng algorthms. Wegener et al. ntroduce a system archtecture for GUI based data mnng of large data on clusters based on MapReduce that overcomes the lmtatons of data mnng toolkts [25]. It uses an mplementaton based on Weka and Hadoop to verfy the archtecture. Ths work s smlar to our work as both provde GUI support and Hadoop ntegraton. Our work s targeted to another popular machne learnng and data mnng tool, namely R, and our framework can adapt wth dfferent DDP engnes. There are also some machne learnng workflow tools such as Knme and Ipython notebook 6. For nstance, Knme provdes a lot of machne learnng packages. Yet ts Bg Data extenson 7 currently s lmted to Hadoop/HDFS access. We have not seen how DDP patterns/sub-workflows are supported n these workflow tools. 6 Ipython notebook: 7

9 VII. CONCLUSIONS In the Bg Data era, technques for processng and analyzng data must work n contexts where the data set conssts of mllons of samples and the amount of data s measured n petabytes. By combnng machne learnng, dstrbuted computng and workflow technques, we desgn a Scalable Bayesan Network Learnng (SBNL) workflow. The workflow ncludes ntellgent Bg Data pre-processng, and effectve BN learnng from Bg Data by leveragng ensemble learnng and dstrbuted computng model. We also llustrate how the Kepler scentfc workflow system can easly provde scalablty to Bayesan network learnng. It should be noted that ths approach can be appled to many other machne learnng technques as well to make them scalable and Bg Data ready. For future work, we plan to mprove the performance of the data partton part by ntegratng the current data partton approach wth HDFS to acheve parallel data partton and loadng. We also plan to apply our work on bgger data sets wth more dstrbuted resources to further verfy ts scalablty. ACKNOWLEDGMENT Ths work s supported by the Natural Scence Foundaton of Jangsu Provnce, Chna under grant No.BK and Natonal Scence Foundaton, U.S. under grant DBI and REFERENCES [1] R. Lu, H. Zhu, X. Lu, J. K. Lu, J. Shao. Toward effcent and prvacypreservng computng n bg data era. Network, IEEE, Vol. 28, Issue 4, pp , [2] Y. Zhang, Y. Zhang, E. Swears, N. Laros, Z. Wang, Q. J, Modelng Temporal Interactons wth Interval Temporal Bayesan Networks for Complex Actvty Recognton, IEEE Transactons on Pattern Analyss and Machne Intellgence, Vol. 35, Issue 10, pp , 2013 [3] M. Nel, C. Xaol, N. Fenton, Optmzng the Calculaton of Condtonal Probablty Tables n Hybrd Bayesan Networks Usng Bnary Factorzaton, IEEE Transactons on Knowledge and Data Engneerng, Vol. 24, Issue 7, pp , 2012 [4] M. Gu, A. Pahwa, S. Das, Bayesan Network Model Wth Monte Carlo Smulatons for Analyss of Anmal-Related Outages n Overhead Dstrbuton Systems, IEEE Transactons on Power Systems, Vol. 26, Issue 3, pp , 2011 [5] N. E. Fenton, M. Nel, A crtque of software defect predcton models, IEEE Transactons on Software Engneerng, Vol. 25, Issue 5, pp , [6] S. Sun, C. Zhang, G. Yu, A bayesan network approach to traffc flow forecastng, IEEE Transactons on Intellgent Transportaton Systems, Vol 7, Issue 1, pp , [7] K. Dejaeger, T. Verbraken, B. Baesens, Toward Comprehensble Software Fault Predcton Models Usng Bayesan Network Classfers, IEEE Transactons on Software Engneerng, Vol. 39, Issue 2, pp , [8] M. Nel and N. Fenton, Usng Bayesan Networks to Model the Operatonal Rsk to Informaton Technology Infrastructure n Fnancal Insttutons, Journal of Fnancal Transformaton, Vol. 22, pp , [9] Y. Tang, K. Cooper, C. Cangussu, Bayesan Belef Network Structure Learnng Algorthms, Techncal Report. Unversty of Texas at Dallas. UTDCS [10] I. Tsamardnos, L. E. Brown, C. F. Alfers, The max-mn hll-clmbng Bayesan network structure learnng algorthm, Machne Learnng, Vol. 65, Issue 1, pp , [11] J. Cheng, R. Grener, J. Kelly, D. A, Bell, W. Lu, Learnng Bayesan networks from data: An nformaton-theory based approach, Artfcal Intellgence, Vol.137, pp , [12] X. Xe, Z. Geng, A Recursve Method for Structural Learnng of Drected Acyclc Graphs, Journal of Machne Learnng Research, Vol.9, pp , [13] R. Chen, K. Svakumar, H. Kargupta, Learnng bayesan network structure from dstrbuted data, In Proceedngs of the 3rd SIAM Internatonal Data Mnng Conference, pp , [14] Y. Na, J. Yang, Dstrbuted Bayesan network structure learnng, In Proceedngs of 2010 IEEE Internatonal Symposum on Industral Electroncs (ISIE), pp , [15] Y. Tamada, S. Imoto, S. Myano, Parallel Algorthm for Learnng Optmal Bayesan Network Structure, Journal of Machne Learnng Research, Vol.12, pp , [16] D. M. Chckerng, D. Geger, D. Heckerman, Learnng Bayesan networks s NP-hard. Vol. 196, Techncal Report MSR-TR-94-17, Mcrosoft Research, [17] D. Optz, and R. Macln, Popular ensemble methods: An emprcal study, Journal of Artfcal Intellgence Research, Vol. 11, pp , [18] B. Zenko, A comparson of stackng wth meta decson trees to baggng, boostng, and stackng wth other methods, In Proceedngs IEEE Internatonal Conference on Data Mnng (ICDM 2001), pp , [19] K. Monteth, J. L. Carroll, K. Sepp, T. Martnez., Turnng Bayesan Model Averagng nto Bayesan Model Combnaton, In Proceedngs of the Internatonal Jont Conference on Neural Networks (IJCNN'11), pp , [20] J. A. Hoetng, D. Madgan, A. E. Raftery, C. T. Volnsky, Bayesan Model Averagng: A Tutoral. Statstcal Scence, Vol. 14, Issue 4, pp , [21] M. Scutar, Bayesan Network Repostory, [22] I. Benlch, Suermondt, H.R. Chavez, G. Cooper, The ALARM montorng system: a case study wth two probablstc nference technques for belef networks, In Proceedngs of Artfcal Intellgence n Medcal Care, pp , [23] S. Das, Y. Ssmans, K. S. Beyer, R. Gemulla, P. J. Haas, and J. McPherson, Rcardo: Integratng R and Hadoop, In Proceedngs of ACM SIGMOD Internatonal Conference on Management Data (SIGMOD10), pp , [24] A. Ghotng and E. Pednault, Hadoop-ML: An Infrastructure for the Rapd Implementaton of Parallel Reusable Analytcs, In Proceedngs of Large-Scale Machne Learnng: Parallelsm and Massve Data Sets Workshop (NIPS 09), [25] D. Wegener, M. Mock, D. Adranale, S. Wrobel, Toolkt-Based Hgh- Performance Data Mnng of Large Data on MapReduce Clusters, In Proceedngs of Internatonal Conference on Data Mnng Workshops (ICDMW 09), pp , [26] C. Chu, S. K. Km, Y. Ln, Y. Yu, G. R. Bradsk, A. Y. Ng, K. Olukotun, Map-Reduce for machne learnng on multcore, n Advances n neural nformaton processng systems 19, pp , [27] D. Battre, S. Ewen, F. Hueske, O. Kao, V. Markl, D. Warneke, Nephele/PACTs: A programmng model and executon framework for web-scale analytcal processng, In Proceedngs of the 1st ACM symposum on Cloud computng (SoCC 10), ACM, pp , [28] B. Ludaescher, I. Altntas, C. Berkley, D. Hggns, E. Jaeger-Frank, M. Jones, E. A. Lee, J. Tao, Y. Zhao, Scentfc workflow management and the Kepler system, Concurrency and Computaton: Practce & Experence, Specal Issue on Scentfc Workflows, Vol. 18, Issue 10, pp , [29] A. Goders, C. Brooks, I. Altntas, E. Lee, C. Goble, Heterogeneous composton of models of computaton, Future Generaton Computer Systems, Vol. 25, Issue 5, pp , 2009.

10 [30] J. Wang, D. Crawl, I. Altntas, W. L. Bg Data Applcatons usng Workflows for Data Parallel Computng, Computng n Scence & Engneerng, Vol. 16, Issue 4, pp , July-Aug. 2014, IEEE. [31] M. R. Berthold, N. Cebron, F. Dll, T. R. Gabrel, T. Kötter, T. Menl, P. Ohl, C. Seb, K. Thel, B. Wswedel, KNIME: The Konstanz nformaton mner. Studes n Classfcaton, Data Analyss, and Knowledge Organzaton, pp , [32] H. Zou, F. Zheng, M. Wolf, G. Esenhauer, K. Schwan, H. Abbas, Q. Lu, N. Podhorszk, S. Klasky, Qualty-Aware Data Management for Large Scale Scentfc Applcatons, In Proceedngs of Hgh Performance Computng, Networkng, Storage and Analyss (SCC), 2012 SC Companon, pp , [33] H. Zou, M. Slawnska, K. Schwan, M. Wolf, G. Esenhauer, F. Zheng, J. Dayal, J. Logan, S. Klasky, T. Bode, M. Knsey, M. Clark. FlexQuery: An Onlne In-stu Query System for Interactve Remote Vsual Data Exploraton at Large Scale. In Proceedngs of 2013 IEEE Internatonal Conference on Cluster Computng (Cluster 2013). pp. 1-8, [34] M. Budu, D. Fetterly, M. Isard, F. McSherry, Y. Yu. "Large-scale machne learnng usng DryadLINQ.", n R. Bekkerman, M. Blenko, J. Langford (Eds.), Scalng up Machne Learnng: Parallel and Dstrbuted Approaches, Cambrdge Unversty Press, pp 49-68, [35] W. Raghupath, V. Raghupath. Bg data analytcs n healthcare: promse and potental. Health Informaton Scence and Systems, Vol. 2, Issue 1, 3, 2014.