CloudAtlas: Cloud-Based Predictive Modeling Platform for Healthcare Research

Transcription

1 CloudAtlas: Cloud-Based Predictive Modeling Platform for Healthcare Research Hang Su Georgia Institute of Technology Fei Xia Tsinghua University Jimeng Sun Georgia Institute of Technology ABSTRACT Rapid adoption of electronic health records (EHR) has led to increase in secondary use of EHR data for research including predictive modeling. While predictive modeling is useful for various clinical applications, there are several challenges for domain experts: 1) Complex workflow. Predictive modeling process involves many tasks with complex dependencies. E ective management and automation of the tasks are challenging; 2) Big data. For dealing with data from millions of patients, it is often impossible to timely carry out analyses on a single machine using traditional data analysis tools 3) Infrastructure provisioning. Setting up and maintaining the right infrastructure is time-consuming and expensive. Uncertainty and lack of transparency in running time and cost often drive users away from using cloud computing for predictive modeling. In this paper, we present CloudAtlas, a web-based predictive modeling platform on the cloud for healthcare research. CloudAtlas enables users to conduct predictive modeling that mimics online shopping experiences. To hide complex workflows, CloudAtlas provides configuration guidance thorough various model templates. To handle big data, CloudAtlas runs predictive modeling workflows in parallel through big data systems such as Hadoop and Spark. To provision right infrastructure, CloudAtlas estimates the running time and cost for a predictive modeling workflow then provisions the proper cluster on demand in the cloud. We evaluate CloudAtlas using two clinical predictive modeling problems with large EHR datasets. In particular, we demonstrate that CloudAtlas can achieve 40X speedup plus 40% cost saving compared to traditional sequential execution. We also show accurate estimation on cost, time and progress using machine learning with R 2 between 0.74 to INTRODUCTION Predictive modeling has numerous applications in medicine and healthcare, including early detection of disease[35], disease state prediction[37], better individualization of care[33] and e ective patient management[12]. Predictive models using electronic health records (EHR) have been applied successfully to a wide range of targets including hypertension[37], heart failure[39], and cancer[42]. The need for predictive modeling is rapidly increasing as EHR are adopted widely in all types of clinical environments. Predictive modeling process, however, is extremely challenging for clinical researchers with no or limited computer science background. Complex workflow: Many predictive modeling pipelines need to be compared in order to find the best model. It is time-consuming (often impractical) to configure all the pipelines manually. Even after complex configuration, e cient predictive pipeline scheduling tools are very limited. Big data. As rapid adoption of EHR, population level analysis requires interactively analyzing data from millions of patients from various information sources. Largescale data and fast modeling time pose great challenges for clinical predictive modeling. Infrastructure provisioning. Cloud computing provides an easy way for users to access large compute resources on demand. However, users are often uncertain about the right infrastructure to provision and the associated running time and financial cost. Beyond healthcare applications, similar needs and challenges are in many industries including retail[36], energy, cybersecurity[15], science and medicine[27][7]. We argue today predictive modeling has become a building-block service in healthcare analytics, which demands an intuitive, scalable and transparent development process. New systems need to be designed to address these challenges. In this paper, we attempt to design such a system by leveraging an analogous activity that everyone is familiar with and capable of conducting: online shopping. Table 1 illustrates the parallels between online shopping and predictive modeling. Our goal is to study the analogy and to design a predictive modeling system that can provide similar benefits that online shopping o ers. We present CloudAtlas, an intuitive, scalable and transparent predictive modeling platform that runs on cloud with the following benefits: Intuitive: CloudAtlas is intentionally designed with a simple web-based interface to take predictive modeling order from our target users, clinical researchers. We further di erentiate users skill levels by providing template based auto-configuration and recommenda- 1

2 Workflow Backend Infrastructure Online shopping simple intuitive web interface parallel processing of millions of orders predictable shipping estimates Predictive modeling programming with multiple tools sequential model building lack of running time/cost estimates Table 1: Analogy between online shopping and predictive modeling. User Interface Cluster Launcher Web App Server Workflow Builder Cost Estimator Elastic MapReduce Cluster Execution Engine tion for basic users, while exposing detailed choices of analytic tasks to advanced users. Scalable: CloudAtlas runs on cloud cluster using big data techniques, Spark[41] and Hadoop [10] so that the predictive modeling pipelines can be quickly computed and compared in parallel. CloudAtlas also facilitates launching of a compute cluster on demand for a predictive modeling workload according to user s requirement. Transparent: To avoid uncertainty about the infrastructure need and associated time/cost, we design a machine learning estimation method for running time and financial cost given the predictive modeling configuration. In turn, we can recommend the optimal cloud setup for the current workload. We systematically evaluate CloudAtlas using two EHR datasets with 30K and 2M patients, respectively. CloudAtlas demonstrates significant performance gain compared to sequential computation, where 40X speedup and 40% cost reduction can be achieved on a 20-node AWS on-demand cluster. In terms of time/cost estimation, our experiments showed that our estimation is very accurate with R 2 > 0.94 for progress estimation, R 2 > 0.83 for running time estimation and R 2 > 0.74 for cost estimation. The rest of this paper is organized as follows. First, in Section 2 we describe the architecture of CloudAtlas and high-level overview of the system. Next, in Section 3,4,5 we show how we make the system intuitive, scalable and transparent respectively. We show experiment result in Section 6 and discuss issues like privacy in Section 7. Then we present related works in Section 8. Finally, we give conclusion in Section CLOUDATLAS OVERVIEW In this section, we first show architecture of the system and how each component corresponds to the intuitive, scalable and transparent design characteristics that we want to achieve. Then, we introduce the terminologies we use in this paper. 2.1 Architecture Figure 1 shows the architecture of CloudAtlas, which consists of the following components: 1) User Interface provides the online shopping-like web interface for users to configure predictive modeling pipelines; 2) Workflow Builder extracts distinct tasks and their dependencies from the user-specified predictive modeling pipelines; 3) Cost Estimator provides cost, running time and progress estimation; 4) Cluster Launcher Figure 1: CloudAtlas system architecture provisions clusters with necessary modules in cloud computing environment and 5) Execution Engine computes all the tasks in parallel by following the task dependencies. Next we present the design highlights of CloudAtlas in pursuing an intuitive, scalable and transparent predictive modeling platform. Towards building an intuitive platform for clinical researchers, we simplify predictive modeling processes through a simple web-based interface defined based on modeling templates. Then users can use Cluster Launcher to provision clusters in the cloud on demand. The detailed design tradeo for intuitiveness will be presented in 3. The Workflow Builder and the Execution Engine work together to make the system scalable. In particular, we use big data systems such as Hadoop, Cascading and Spark to handle each step of predictive modeling and parallelize di erent tasks as much as possible. We will describe scalability design choices in 4. Cost Estimator makes cost, running time and progress transparent to users, which can give users a suggested configuration of clusters. We will show design choices for transparency in Terminology Now we introduce the terminologies used in this paper. Event is the 4-field input tuple with the format (patient id, event name, date, value). For example, (patient123, icd9v250.03, , 1 ) encodes that patient123 was diagnosed with disease of ICD 9 code v250.03(diabetes) on December 31 of Pipeline refers to a sequence of computational tasks that leads to a predictive model. For example, Figure 2.A is one example of such a pipeline, where feature selector and classifier are part of the pipeline. The resulting predictive model is a function that maps the existing patient data to an expected future target. We call a specific step in the pipeline a Task,denotedas one rectangular block in Figure 2. Workflow indicates a task dependency graph derived from multiple predictive modeling pipelines. To compare a set of pipelines, distinct tasks can be identified to form a Workflow represented as a Directed Acyclic Graph (DAG). Within a workflow, we call the set of tasks of the same type a Stage, i.e. feature selection stage contains all feature selection tasks from all pipelines. 2

3 A:#pipeline Cohort& Construction Cohort& Construction Feature Construction Feature& Construction B:#workflow Fold&1 Fold&2 Cross& validation Feature&5% Feature&20% Feature& Selection Train&kNN Train&SVM Train&kNN Train&SVM Train& Classifier Test&Classifier Test&kNN Test&SVM Test&kNN Test&SVM Predictive Model Predictive Model Predictive Model Predictive Model Predictive Model Figure 2: Predictive modeling pipeline and workflow. Figure A shows the process of getting one predictive model. Figure B shows the workflow formed by merging multiple pipelines and remove duplicate tasks. Diagnostic Procedure Medicine Observation&Window Index&date Prediction&Window Asthma&readmission date Figure 3: Feature construction for asthma readmission prediction from EHR data. Di erent horizontal lines represent di erent types of events, e.g., blue triangles represent diagnostic events. Events happened within the observation window are aggregated for predicting ath. 3. INTUITIVENESS To make the system intuitive for clinical researchers to use, we simplify the predictive modeling process by 1) introducing predictive modeling pipeline template and the associated web interface and 2) easing infrastructure provisioning using the cloud. In this section, we start with showing the predictive modeling process template. Then we describe design tradeo s in the User Interface component. Finally, we describe the Cluster Launcher component. 3.1 Pipeline Template Di erent clinical studies may have di erent goals; use different datasets and algorithms. However, most of predictive models are built using a similar pipeline template as in Figure 2.A. This particular pipeline template has six tasks: cohort construction, feature construction, cross-validation, feature selection, training and testing. Note that we focus on this particular template but our system design is generalizable to other templates. Cohort construction specifies the patient set and their corresponding prediction target values and index date. We support various study design for constructing a patient set including case-control study and cohort study[13]. For example, to conduct a case-control study for asthma readmission, we first identify all the cases with asthma readmission. Thus, each case patient will have target value 1 that indicates the presence of asthma readmission and the index date equals the readmission date. Then we match each case to one or more control patients on certain characteristics such as age, race, gender and clinic. We also make sure the selected controls do not have asthma readmission (i.e., target value 0). The choice of index date for controls is flexible, which can be the index date of the corresponding case or the latest date in control patient s records. Feature construction computes a feature vector for each patient based on the patient s EHR data. More specifically, the feature construction is anchored by the index date (such as asthma readmission date). The prediction window is a time window right before the index date, while the observation window is another time window prior to prediction window as illustrated in Figure 3. To convert such sequences into features, all events of the same event name within the observation window will be aggregated into a single feature value. Basic aggregation functions include count, sum, mean, latest. Note that the target to predict is also an event or aggregation of events. For example, for mortality prediction, the prediction target is the occurrence of certain event. For heart failure prediction, target could be occurrence of any diagnostic events represent di erent kinds of heart failure. Feature selection, training, testing are typical machine learning steps. EHR data involved in predictive modeling always include diagnostics, procedures, medication etc. The raw feature dimension could be hundreds of thousands. Feature selection reduce the dimension by calculating score for each feature and only retain most predictive features. Training task trains a specific classifier using filtered features from feature selection result on training data. Testing task applies a feature selector and a classifier on testing data to calculate metrics like area under the ROC curve (AUC), F1, precision and recall for the trained model. In some predictive modeling setting, other tasks may be required. For example, incorporating domain knowledge to refine feature construction result. We argue that our abstraction captures most settings. For special scenarios, result from our system provides a quick starting point for further exploration. For example, rather than trying various feature selector or classifier, user could just pick the model with best performance and refine that using domain knowledge, which will dramatically speedup predictive modeling process. 3.2 User Interface Our target users are clinical domain experts, not programmers or machine learning researchers. To facilitate clinical researchers to conduct predictive modeling independently, we want to mimic the experience of online shopping. In online shopping, the basic customers quickly order with the default setting if they think the default is fine or they know little about the options. Or, advanced users can customize the product with just a few simple input on top of the default setting. We provide a web wizard to collect user inputs of all predictive modeling tasks. For cohort construction and feature construction we provide default settings like prediction window [14] is 365 days and accept all patients into cohort. However, users usually need to try a lot of di erent models to find the most useful one. For feature selection and model training/testing, we allow user to select all possible classifiers and feature selectors user think maybe useful. A high-level specification sample is as follows: 3

4 Cohort construction: use MIMIC2 patient dataset, predict mortality. Feature construction: diagnoses, procedures, lab test, and fluid balance events with a sum aggregation function for lab test and count aggregation for others. Cross-validation: 3-fold cross-validation. Feature selection: select top 5% and 20% of features based on chi-squared statistics. Classifier: k-nearest Neighbors (knn)[6] and Support Vector Machine (SVM)[9]. This specification results in a set of 12 pipelines. After the predictive modeling workflow are computed, what we deliver to user are model performance report and the optimal model. Graphical user interface is chosen to enable fast modeling e orts by clinical researchers, but are not flexible by design. To alleviate the constraint, we also provide RESTful API to accept pipeline specification in JSON format. 3.3 Cluster Launcher To ease the pain for users to maintain infrastructure and let them focus more on the analysis itself, we utilize AWS Elastic MapReduce to setup a cluster for each analysis workflow when user submits a workflow specification. Elastic MapReduce will help provision machines on top of EC2 and configure Hadoop cluster. We just need to specify the size of cluster, and the type of machines to use. As we will see later in experiment and description of Cost Estimator component, a decision for cost and performance tradeo should be made here. We enable standard-alone machine learning packages such as scikit-learn[32] on the cloud. After all the tasks are finished, the cluster will be terminated. Allocating new cluster per workflow can help avoid competition of resources between di erent workflows and facilitate getting to know the real cost of running a workflow. A potential drawback could be that provisioning new cluster and terminating cluster will consume some time. According to our observation, it usually consumes less than 10 minutes to provision a cluster with 20 nodes, and termination of cluster is almost instant. We develop Cluster Launcher as a separate component because we want to be flexible in infrastructure. Currently, Cluster Launcher could either launch cluster in AWS for production, or use dedicated cluster for development. In future, we want to support other cloud providers like Azure and containers like Docker[29]. 4. SCALABILITY One key challenge in predictive modeling is to handle big data. We solve this by leveraging big data frameworks including Hadoop[10] and Spark[41]. In this section, we first describe Workflow Builder component then detail the implementation of each task within Execution Engine component. 4.1 Workflow Builder Workflow builder converts pipeline configurations into unique tasks and their dependencies represented by a dependency graph as shown in Figure 2.B. The dependency graph is a directed acyclic graph (DAG) where each node represents a task and an edge encode a dependency between two tasks. Such a dependency graph eliminates all the redundant tasks and provides essential guide for the execution engine to compute all tasks in the right order in parallel. 4.2 Execution Engine The execution engine exploits two types of parallelisms for speedup: Inter-task parallelism: Given the dependency graph, individual tasks can be computed in a topological order. Multiple tasks can then be run in parallel as long as their dependencies have completed. Intra-task parallelism: Individual task can often be computed in parallel. However, the specific strategy for parallelism will vary across di erent tasks. In term of implementation, we choose the best big data framework that is best suitable for a specific purpose, hence multiple big data frameworks are used in our technology deck. For inter-task parallelism, we use High-level programming tool Cascading[2] for running tasks in parallel with respect to their task dependencies. In particular, cascading supports setting up a DAG of Cascading flows, where each Cascading flow contains several operations and corresponds to a task in our system. The framework will help resolve dependencies and parallelize execution of Cascading flows in topological order. Currently, the backend of Cascading is Hadoop so that a sequence of MapReduce jobs will be carried out. In future, Cascading might switch to more e - cient backend such as Spark, and we expect to receive that speedup benefit almost for free. For intra-task parallelism, we focus on speedup individual tasks in the workflow including cohort construction, feature construction, feature selection and model training/testing. Since cohort and feature constructions are data processing and data integration in nature, an in-memory data flow system like Spark can fit very well. The result of these tasks are a feature matrix and a target vector stored in a SVM- Light format, which are accepted by many machine learning packages such as python scikit-learn. For feature selection and model training/testing, intra-task parallelism is more complex and subtle, and often requires parallel implementation of specific algorithms being used. Instead of trying to implementing those algorithms on our own (which could be time-consuming and di cult to maintain), we can explore existing parallel frameworks in Spark MLlib. We wrap all the algorithms in python scikit-learn through Cascading since scikit-learn provide a more comprehensive set of machine learning algorithms than Spark MLlib. In particular, all those ML implementations are invoked through Hadoop Streaming jobs inside Cascading. Here are some detailed consideration about the implementation: For feature selection, mappers just emit its input training samples to a single reducer as it is. The output of the single reducer is the feature selector fitted on input training data; For model training, besides normal input data, mappers will load a feature selector model and apply it to training samples then emit the transformed training samples. Just like feature selection that works with all training samples collected together, training also requires such setting. We set the number of reducer for this Hadoop Streaming job to 1 either and run classifier training in the single reducer. The output of reducer will be a predictive model (e.g., classification model) fitted on samples with feature selector applied. For model testing, mappers load feature selector and trained classifier as side data. Specifically, our system copies the feature selector and classifier to a local 4

5 directory for each mapper. Then mapper can apply a feature selector and a predictive model to the testing samples from standard input and emit the prediction result together with real target value to reducer. The reducer will aggregate them to calculate the score for given a combination of feature selector and predictive model. 5. TRANSPARENCY One possible reluctance for users to use CloudAtlas is the uncertainty about expected cost and running time. The analogy to online shopping is that shoppers will be very reluctant to make purchase decisions if item price and shipping time are not given. However, that is the current reality in using cloud computing for predictive modeling. No clear cost and running time estimates are provided. Making situation even worse, users also have to decide about what infrastructure to provision without knowing the clear understanding of total cost and running time. Choosing too large is waste of resource, while choosing too small can incur huge delay sometime even failures in execution. Our goal in this section is to propose methods to provide transparency to the users in terms of running time and cost estimation. 5.1 Problem Definition Given runtime environment r and workflow w, we are interested in estimating the followings: Cost is total cost c that will be charged by the cloud provider for predictive modeling using CloudAtlas for completing w on r; Running time is total time taken t total for completing w on r. t total starts from user s invoking of pipelines in our system, and ends with cluster termination; Progress is percentage of completion in running time at a given time point as p = t elapsed t total, where t elapsed is the time elapsed by since beginning of the workflow. If t remain denotes the remaining running time, then we have t total = t elapsed + t remain. 5.2 Estimation through Regression The high-level idea of our method is 1) to characterize the workflow and the associated data and underlying machines as a feature vector x, then 2) to learn a regression function mapping x to the expected output y (i.e., running time, cost or progress). In particular, we train the regression model using the historical runs as the collection of training data {(x i,y i)} n 1, where n is the number of modeling runs in the history. We choose to use such a black-box model instead of a white-box model where we need explicitly model underlying algorithms or parallel computing framework (such as Hadoop), because we want our methods to be generally applicable to broad (and even unseen) settings. The tradeo is that we have to first acquire enough history in order to model future runs. In a real-world setting, we believe such history can be quickly acquired by most of web services including the predictive modeling services we are creating. We systematically evaluate the estimation performance of such black-box models in the experiment. In terms of specific choices of models, we use both linear and non-linear regression models: Linear regression is fast and easy to interpret, but with a linear assumption between input features and output. In particular, the models assume that y i = T x +, where are parameters of the model to learn and N (0, 2 ) is error. The linear assumption is too strong for our problem, as we will show in the experiments. Polynomial regression is similar to linear regression but with polynomial transformation of the original features. In our experiments, we use second order polynomials. For example, suppose original feature vector is [a, b], it will be replaced with [a, b, a 2,ab,b 2 ]ifpolynomial degree d is 2. Such transformation gives a richer model but increases the dimensionality of the input features. As a result, it is more expensive than the linear model. Gaussian process is a non-linear model that models any finite combination of random variables with Gaussian distribution. Kernel we use is polynomial kernel defined as k(x, x 0 )= 2 (x T x 0 + c) d. To deal with high-dimensionality and insu cient samples, regularization are commonly used to provide more robust estimates. In our system, we drop two popular regularizations: Ridge regularization (L2)[19] and Lasso regularization (L1)[38]. 5.3 Features Features play an important role in the success of machine learning models. Here we describe all the features that we use: Machine related features include the number of machines in the cluster, total memory of the cluster and total CPU cores for the cluster. We realize that there are many other potential features that can be added, but we hope to capture key parallelization factors of the underlying machines with some simple measures. Dataset related features include size of input dataset, number of patients, number of features and number of nonzero entries. CloudAtlas deals with sparse patient healthcare events. Both input to the system and intermediate results use sparse representation. In such cases, besides number of features and number of patients, number of non-empty entries also matters. Workflow related features are comprised of characteristics of the analysis workflow w. We use number of independent tasks and number of dependency relationships. In addition, we incorporate healthcare data analysis domain specific settings including observation window as feature. Progress specific features are used for progress estimation only. A workflow contains tasks of di erent types, i.e. feature construction type, model training etc. While estimating the progress of a workflow at a certain time point, we use time elapsed by since beginning of the workflow and percentage of completed di erent types of tasks as features. Unlike other features mentioned above that could be collected before running a workflow, these two features, however, is available in running time only. The total number of features is 16 consisting of 3 machine, 4 dataset, 3 workflow and 6 progress features. 5.4 Running Time and Cost Estimation Estimations for both running time and cost use the same set of features in the regression model but with di erent target variables: actual running time and cloud computing 5

6 cost 1, respectively. It is worth pointing out that we estimate running time in the granularity of second, but the cloud provider usually charges for service consumption in hours. Less than one hour will be counted as one hour. For example, 74 minutes will be charged for 2 hours in bill. Given running time estimation t in seconds, and unit price of cloud service as u dollars per hour, we have cost function g(t) as t g(t) =d eu. (1) 3600 With estimated distribution of total running time as t N (µ, 2 ), we could estimate the expected cost as Z E(cost) = g(t)f(t)dt (2) where f is the probability dense function of t (Gaussian). 5.5 Progress Estimation We use all the candidate features described in previous features subsection for progress estimation. Our prediction target is the remaining time of a workflow. With estimation of remaining time of a workflow as ˆt remain, wecould calculate the estimated progress as t elapsed ˆp = t elapsed + ˆt remain We can generate training data for this estimation task by collecting history of past workflow runs. Specifically, We evenly sample 200 progress information from each workflow and accumulate 8,600 samples. 6. EXPERIMENTS In this section, we first show the intuitiveness of CloudAtlas using a real use case for asthma readmission. Then, we evaluate the scalability of the system with two EHR datasets of di erent characteristics. Finally, we show the accuracy of our estimation methods for running time, cost and progress. 6.1 Intuitiveness: Case Demonstration Next, we will demonstrate an actual use case of CloudAtlas in collaboration with Children s Healthcare of Atlanta(CHOA) on predicting asthma pediatric readmission on 3K inpatient patients (balanced cases and controls). The web based UI is used by a clinical researcher (a MD-PhD student in our lab) with the clinical guidance from one physician and one nurse at CHOA. The input data include patient demographics, diagnosis, medication, procedure and lab results exported from CHOA s internal reporting system. The raw input are then de-identified and processed into the required tuples formats that CloudAtlas requires. Those tuples are then uploaded to secure storage for predictive modeling. With the adoption of CloudAtlas, a clinical researcher (MD-PhD) is able to make quick progress on predictive modeling after data is ready. He is able to get prediction performance and feature characteristics within an hour with minimal help from our system developers. The predictive modeling results are reported on the web front-end including model accuracy measures such as AUC, F1, precision 1 We compute the cost based on charge price of on-demand instances on AWS MIMIC2 CMS Source ICU patients Medicare claims Duration 8 years 3 years Size 844MB 6.88GB #Patients 32,431 2,069,684 #Features 21, ,914 #Records 31,813, ,309,933 Table 2: Statistics of the two EHR datasets used to evaluate the performance of our system. and recall. Furthermore, the results can be exported to a spreadsheet for o ine reporting and analysis. The predictive features are also listed in the descending order of importance to the predictive model. The performance results and predictive features are presented to physicians to obtain feedback about incorporating length of stays as additional features. Based on the feedback, the clinical researcher can immediately rebuild the model by adding new features during the meeting and discuss the new results right away. In contrast, before adopting CloudAtlas it might take days for a researcher to update predictive modeling results and often costs weeks to schedule for a follow-up meeting with busy physicians. CloudAtlas provides a simple predictive modeling framework that can shorten the development and validation process dramatically. 6.2 Scalability Evaluation In this subsection, we first describe the setup about dataset. Then, we evaluate the scalability in terms of running time in di erent cluster and data setting. We run the system with various machine types, number of machine(s), observation window size and input data size. We also show the corresponding cost of using our system in cloud Environment and Data Setup To demonstrate the scalability and e ciency of our system, we use MIMIC2 Clinical dataset[34] to predict mortality of 3 months after discharge as [12], and CMS synthetic dataset[18] to predict heart failure. Some high-level statistics about these two datasets are listed as Table 2. We extract diagnose, procedure, lab test, and fluid balance events from the MIMIC2 clinical dataset. From the CMS synthetic dataset we extract diagnose, procedure, payment, and medication events. We choose these two datasets because they are of very di erent characteristics. MIMIC2 provide detailed data on smaller population, while CMS data provide less detailed data on a larger population. In AWS Elastic MapReduce, Hadoop we use is version Before running predictive modeling workflows using our system, patient data must be stored into somewhere accessible to the cluster. Running Hadoop with Elastic MapReduce, people typically upload data into Amazon s Simple Storage Service (S3). The two datasets we use are converted into Event format and uploaded to Amazon S3 using s3cmd 2.TimespenttouploadtheCMSandMIMIC2 dataset is 8m11s and 1m7s respectively. The costs of storing them are $0.03/month and $0.21/month, which is minimal compared to the remaining predictive modeling process

7 Type Description CPU Mem Price m3.2xlarge General Purpose 8 26 $0.70/h c3.2xlarge Compute Optimize 8 15 $0.53/h r3.xlarge Memory Optimize $0.44/h r3.2xlarge Large Memory 8 61 $0.88/h Table 3: Configuration and cost of 4 machine types in Amazon US East Region (N. Virginia). Price collected in 2014/11/13 for EMR and Linux virtual machines that host EMR. We report cost according to price of Amazon US East Region (N. Virginia). Predictive modeling workflow for mortality prediction uses following specification: Feature types: diagnoses, procedures, lab test, and fluid balance with a sum aggregation function for lab test and a count aggregation for others. Normalization of the aggregation result is enabled for all feature types. Filter feature with observation window of size equals 360 and prediction window of size equals 10. Eight feature selection settings: select top 1%, 2%, 5%, 10% of features based on chi-squared statistics and ANOVA F-value. Four classifier algorithms: Logistic Regression, Perceptron, Nave Bayes and Linear Support Vector Classifier. 5 times 10-fold cross-validation (i.e., running 10-fold CV 5 times) The resulting dependency graph contains more than 3000 nodes. In particular, there are 400 feature selection tasks and 1600 classification tasks. Predictive modeling workflow specification for heart failure detection with CMS synthetic dataset is similar to that of mortality prediction with MIMIC2 Clinical dataset. Di erence lies in the following components: Feature types: diagnoses, procedures, payment, and medications with a sum aggregation function for payment and a count aggregation function for others. Three classifier algorithms: Logistic Regression, Perceptron and Linear Support Vector Classifier Machine Type In the cloud computing environment AWS, di erent machine types are provided. They are usually optimized for different purpose, i.e. some are for CPU intensive job, some are for Memory intensive job. Here we evaluate 4 types of virtual machine: c3.2xlarge, r3.2xlarge, r3.xlarge, m3.2xlarge. The configuration and price of di erent machines is listed in Table 3. We run this experiment with same size of clusters equals 10. This size includes one master node of m3.xlarge machine type (with price $0.35/h). Actual cluster will have 9 nodes serve as slaves to carry out the real computation. The corresponding running time and cost for two workflows running on di erent machine types is shown in Figure 4. For both case, Large Memory machine gives fastest result. This is due to the fact that most tasks in the workflow are memory intensive. More memory could lead to better parallelization. The costs of running with di erent machine types do not vary too much for the CMS heart failure prediction workflow, but the time varies a lot. That s mainly because slow machine could be cheap but may require longer running time. Given similar cost, user will definitely want to get result earlier. In later experiment, we will just use Large Figure 4: Running time and cost of workflow for two datasets. Larger circles represent runs with CMS dataset, and smaller ones of MIMIC2. Cost is calculated using same pricing mechanism as AWS (less than one hour count for one hour) using prices listed in Table 3 Memory machine type for CMS data as it gives lowest running time. For mortality prediction with MIMIC2 Clinical dataset, Compute Optimize is slowest and Memory Optimize is cheapest. The reason why Memory Optimize is cheapest is that it best balances parallelization and unit cost for the given workflow on given dataset. In later experiment, we will use Memory Optimize for MIMIC2 data. Our system scales well for di erent size of data as total running time is within hours for both workflows. In Figure 5, we also show the breakdown of running time for di erent stages of a workflow. It corresponds to a workflow running with CMS dataset on a 10-node-Large Memory cluster. The overlap between training and testing phase means that the system could give result of testing almost immediately after training finish. Boosted by the parallelization of tasks, we could see that total running time is dramatically reduced compared to running all tasks sequentially on asinglesamekindofmachine Cluster Size Besides machine type, size of cluster also impacts the total running time and thus the cost. In this part of experiment, we run the two workflows with di erent sizes of cluster to show how our system could scale up. The choice of machine type is described in previous experiment about machine type. We run cluster of size 5, 10 and 20 for two workflows. Figure 6 shows the running time and cost of di erent cluster size. We can see that long running workflow benefits more from our system. For smaller dataset MIMIC2, speedup is limited. For CMS dataset there s a larger reduction in running time with more machines. With 20 nodes in cluster, workflow on CMS could achieve about 40x speedup compared to sequential run. The speedup is not linear to cluster size, as the time consumed for cluster provisioning, dependency graph generating etc. could not be deducted by adding more nodes to cluster. Delay caused by the distributed processing framework makes it not worthwhile to analyze small dataset with a cluster. To balance time and 7

8 Figure 7: Running time and cost of workflow with di erent observation window. Figure 5: Running time of each stage for a workflow instance with CMS synthetic dataset for heart failure prediction. Cluster size equals 10 and machine type is Large Memory. The top yellow bar is total time of running all the tasks of the workflow in topological order sequentially on a single Large Memory machine. Figure 8: Time and cost of same workflow with different input dataset size. Figure 6: Running time (in log scale) and cost for workflow run on clusters of various sizes. The single blue dot represents running time and cost if run every tasks sequentially on a single machine of the same type as cluster. For CMS, 40x speedup achieved with 20 nodes in cluster. cost, in remaining experiments, we choose cluster size as 10 for both datasets Workflow and Dataset Property In this part of experiment, we test how our system behaves under di erent workflow and dataset setting. Specifically, we vary prediction window and input dataset size. Size of observation window determines number of records that will enter model training, thus may lead to di erent workflow running time and prediction accuracy. We test prediction windows of size 180, 240, 300 and 360 days respectively. Other workflow settings are identical to previous experiments. As stated previously, the machine type for CMS dataset is Large Memory and for MIMIC2 is Memory Optimize, and the cluster size is 10. Fig 7 shows the result. We see that smaller observation windows can lead to shorter running time as expected. Next, we test the performance of our system using different dataset size. We split CMS dataset into datasets of smaller size and run the same predictive modeling for heart failure. Table 8 clearly shows that the larger the the data size(represented using number of event records in dataset), the longer the running time. The running time and cost increase linearly with increase of input data size. This trend shows our system scales well. 6.3 Transparency Evaluation With past running records of the system, we conduct experiment to test e ectiveness of our estimation methods and show results in the order of running time estimation, cost estimation and progress estimation. We collect detailed logs of 43 individual workflow runs. From these workflows, we accumulate 8600 workflow progress records by sampling 200 data points from each workflow. As all three estimation problems in this paper are formulated as regression problems, we use two widely used metrics to validate the performance of our methods with 5-fold cross-validation. First, we test the rooted mean squared error(rmse), which is also known as rooted mean squared prediction error(rmspe) on test samples. It is defined as below: r P n i=1 RMSE = kŷi n yik2 where n is number of testing samples and ŷ i is our estimation of y i. In our problem y is either cost c of workflow, total running time t or remaining running time t remain depending on the problem. Second, we measure the R 2 on cross validation set defined as below: P n R 2 i=1 =1 P (yi ŷi)2 n i=1 (yi ȳ)2 where ȳ is average of y i. The higher R 2 is and the lower RMSE is, the more e ective the method is Running Time Estimation For running time estimation, we compare four methods: 8

9 Method RMSE R 2 Linear Basic Linear Lasso Linear Ridge Polynomial Ridge Polynomial Lasso Gaussian Process Table 4: Running time estimation performance evaluation. Figure 10: Extrapolation ability analysis. The x- axis shows m, the percentage of data used in test. The lower m means more di erence in training and testing data thus more di cult to estimate running time. The y-axis shows the coe cient of variation of the RMSE (CVRMSE). The line is almost stable means that our method can extrapolate fairly well. Figure 9: Feature contribution analysis for running time estimation. With more features added, we can see the decrease in RMSE and increase in R 2. The estimation performance is best when all features are applied. Linear Basic: linear regression without regularization. Linear Lasso: linear regression with Lasso regularization. Linear Ridge: linear regression with Ridge regularization. Polynomial Ridge: polynomial regression with Ridge regularization. Polynomial Lasso: polynomial regression with Lasso regularization. Gaussian Process: polynomial kernel Gaussian process. The performance is shown in Table 4. From the result we can see that the Polynomial Ridge performs best Feature Contribution As described in 5,weusemachine related, dataset related and workflow related features in running time estimation. In this part of experiment, we show how three feature groups perform in combination. We test three di erent settings: 1) use machine related features only; 2) use machine related and dataset related features together; 3) use all three groups of features. We only test with Polynomial Ridge method as it performs best as shown in previous experiment. From Figure 9 we can see that estimation works best when all groups of features are applied, which means features we use are meaningful. Note that machine features alone seems inadequate in predicting running time, but dataset and workflow features can perform much more prediction power Extrapolation Ability Analysis In real world settings, the most challenging case is that we need to use history of small workflow to estimate that of Method RMSE R 2 Polynomial Ridge Distribution Gaussian Process Distribution Table 5: Cost estimation performance comparison. larger workload. This is called the extrapolation problem. In this experiment, we simulate this situation by sorting the workflow samples according to running time in ascending order, and use first 60% workflows (small workflows) as training set and last m% as test set (large workflows). We use Polynomial Ridge as regression method. We vary m from 5 to 40 with a step 5. Extrapolation is hard especially when m is small (i.e., the largest workflows). Low m means that testing data is very di erent from training data, and the estimation performance will be worse. In this experiment, we measure coe cient of variation of the RMSE (CVRMSE), a normalized version RMSE, which is defined as CV RMSE = RMSE ȳ The performance is shown in Figure 10, where we observe CVRMSE decrease slowly, which shows reasonable performance for the challenging extrapolation settings Cost Estimation For cost estimation, our method works by estimating distribution of total running time first as t N (µ, 2 ). We compare following methods Polynomial Ridge Distribution: estimate distribution of running time t N (µ, 2 )usingpolynomial Ridge, then estimate cost using Eq(2). Gaussian Process Distribution: estimate t N (µ, 2 ) using Gaussian Process then estimate cost using Eq(2). Table 5 shows the performance of the two methods. From the tablewe seepolynomial Ridge Distribution works better. With accurate estimation, CloudAtlas could make the cost transparent to user rather than presenting a quota that is significantly di erent from actual cost. 9

10 Method RMSE R 2 Linear Basic Linear Ridge Linear Lasso Polynomial Ridge Polynomial Lasso Table 6: Progress estimation performance comparison. Gaussian Process is not compared as it s not sparse model and not e cient in high dimensional space Progress Estimation Like running time estimation, we compare five methods: Linear Basic, Linear Ridge, Linear Lasso, Polynomial Ridge and Polynomial Lasso. We don t compare with Gaussian Process because it s not a sparse model and not e cient in high dimensional space. For progress estimation in CloudAtlas that requires frequent progress re-estimation, Gaussian Process is not suitable. In this experiment, we split the training and test progress information samples according to a per-workflow basis. Samples from the same workflow are either in the training set or in the test set. Table 6 shows that Polynomial Ridge performs best among all the methods considered. In real world settings, in fact, at a time point while the workflow is still running, we have same workflow s past progress records, which could be used for estimating current progress by retraining regression model. Intuitively, this will give a more accurate estimation as same workflow s progress information will be strongly correlated to each other. We verify this by training regression models with not only progress records from other workflow, but also progress records of same workflow happened earlier. This setting could boost Polynomial Ridge s RMSE to andr 2 to Next, we show estimated progress of two workflow instances. The results are displayed in Figure 11. By retraining a regression model using self s past progress records as well as others progress records, we could reduce the absolute error. Also, with time passing by, more and more progress records of same workflow accumulated, error could also be reduced. For the estimation with retraining, we could observe patterns like stages. This corresponds to di erent stages of workflow. Within a same workflow stage, error is almost similar, and after a stage complete, error drop. 7. DISCUSSION We discuss related issues and potential extensions. Data Privacy and Security: EHR data can be sensitive, especially, the Protect Health Information (PHI). Many e orts in the nation such as Vanderbilt s Synthetic Derivative 3 start to produce comprehensive de-identified EHR data for research, where the security risk for analyzing that on the public cloud is decreasing. At the same time that cloud security has improved significantly, most of cloud vendors including AWS and Azure are o ering HIPAA complaint cloud infrastructure 4 Besides private cloud using the same Figure 11: Instances of progress estimation. The x-axis is workflow progress. The y-axis is the absolute error of estimated remaining time. The blue line is obtained by training regression models using progress records of other workflows only. The red line is obtained by training regression models using current workflow s past progress records plus others progress records. Clearly, adding current workflow s progress could reduce error. The estimated progress error shows pattern of stages, which could be explained by stages of workflow. technology has already been deployed such as AWS Gov- Cloud[1] where such a system like CloudAtlas can be directly applied safely. Heterogeneous data: Oneofthelimitationsisthatwe currently only support structured data. In the future we will explore feature construction with unstructured data such as unstructured notes, time series and images. Our system is implemented with distributed storage support from AWS S3 service, where clinical notes and images can be easily stored at and retrieved from the same S3 data store in a HIPAA complaint manner. 8. RELATED WORK Related works of this paper fall into two categories: predictive modeling systems, and system performance estimation. 8.1 Predictive Modeling Platform Given feature is constructed and formatted in required format, general process machine learning modules like Weka[16], RapidMiner[20] could be helpful for user to try specific machine learning model. However, as we described in 3, a lot of possible models should be explored. Also, the Cohort Construction and Feature Construction should be automated to speed up the process and reduce possible human errors. Cloud-based Machine Learning platform like Azure Machine Learning 5 can help create workflow to compare models, but it also need formatted feature input, otherwise users will need to apply their own scripts. As a result, there are still a lot of work left for clinical researchers to complete manually. Other healthcare data mining systems [22, 28, 14] may provide complete predictive modeling pipeline functions like Figure 2 as well as provide web based user interface for workflow configuration and result display. However, unlike our system that runs on cloud and use big data techniques, they tend to run on a single dedicated server, thus not as scalable and flexible as ours. The PARAMO system[31] is the most relevant work in terms of function. That work explored the benefits of using

11 Hadoop[3] cluster to speedup predictive modeling. However, PARAMO is designed for dedicated cluster which is not always available for clinical researchers. Furthermore, user can t change cluster settings on demand for di erent predictive modeling workloads, thus it s not flexible enough. 8.2 System Performance Estimation Several studies exist for system performance estimation in database system and big data systems like Hadoop MapReduce. Some of them can be regarded as white-box model as they try to analyze the underlying complex mechanism. Others are black-box models that use machine learning for application. [5] and [11] are typical methods of using machine learning for database query performance prediction. For machine learning methods, proper feature combination is important. [24] studied operator-specific features. For MapReduce program running time estimation, [17] employed a white-box model to analyze single MapReduce job. For complex workflow like ours, it is hard to apply such analysis as error accumulated on each MapReduce job can lead to huge error in workflow-level estimation. [21] proposed a grey-box approach for performance estimation of individual Hadoop job. They apply machine learning on a bunch of Hadoop specific features. We adopt similar method but on a workflow level, which could involve thousands of Hadoop jobs. Progress estimation is about continuous on-line performance prediction with more runtime information. It can be either white-box or black-box models. [25] and [23] are examples of white-box and black-box for database progress estimation respectively. [30] propose a white-box analysis method for estimating progress for MapReduce DAG. For our setting, however, workflow is composed of thousands of heterogeneous tasks. Modeling with white-box like that can be too complex. 9. CONCLUSION We developed CloudAtlas, an intuitive, scalable and transparent predictive modeling platform that runs on the cloud. CloudAtlas is developed as an intuitive web application for clinical researchers with various skill levels. CloudAtlas is scalable as it runs on cloud using big data frameworks such as Spark and Hadoop in parallel. Instead of dedicated servers, CloudAtlas provisions appropriate clusters based on configuration of the predictive modeling pipelines. One novel aspect of CloudAtlas is to provide accurate cost/running time estimate using black-box regression models based on historical runs, which can facilitate users to make more informed and transparent choices. Systematical experiment evaluation using two EHR datasets demonstrates significant performance gain of CloudAtlas compared to sequential computation, where 40X speedup and 40% cost reduction can be achieved on a 20-node AWS on-demand cluster. In terms of time/cost estimation, our experiments showed that our estimation is very accurate with R 2 > 0.94 for progress estimation, R 2 > 0.83 for running time estimation and R 2 > 0.74 for cost estimation. 10. ACKNOWLEDGEMENTS This work was supported by the National Science Foundation, award # Google Faculty Award, AWS Research Award, Microsoft Azure Research Award, and Children s Healthcare of Atlanta. 11. REFERENCES [1] AWS GovCloud (US) Region Overview Government Cloud Computing. [2] Cascading application platform for enterprise big data. Accessed: [3] Welcome to Apache Hadoop! Accessed: [4] M. Ahmad, S. Duan, A. Aboulnaga, and S. Babu. Predicting completion times of batch query workloads using interaction-aware models and simulation. In Proceedings of the 14th International Conference on Extending Database Technology, pages ACM, [5] M. Akdere, U. Çetintemel, M. Riondato, E. Upfal, and S. B. Zdonik. Learning-based query performance modeling and prediction. In Data Engineering (ICDE), 2012 IEEE 28th International Conference on, pages IEEE, [6] N. S. Altman. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3): ,1992. [7] R. Bellazzi and B. Zupan. Predictive data mining in clinical medicine: current issues and guidelines. International journal of medical informatics, 77(2):81 97, [8] L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT 2010, pages Springer, [9] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3): ,1995. [10] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1): ,2008. [11] A. Ganapathi, H. Kuno, U. Dayal, J. L. Wiener, A. Fox, M. I. Jordan, and D. Patterson. Predicting multiple metrics for queries: Better decisions enabled by machine learning. In Data Engineering, ICDE 09. IEEE 25th International Conference on, pages IEEE, [12] M. Ghassemi, T. Naumann, F. Doshi-Velez, N. Brimmer, R. Joshi, A. Rumshisky, and P. Szolovits. Unfolding physiological state: mortality modelling in intensive care units. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages75 84.ACM,2014. [13] L. Gordis. Epidemiology. ElsevierHealthSciences, [14] D. Gotz, H. Stavropoulos, J. Sun, and F. Wang. Icda: a platform for intelligent care delivery analytics. In AMIA annual symposium proceedings, volume2012, page 264. American Medical Informatics Association, [15] F. L. Greitzer and D. A. Frincke. Combining traditional cyber security audit data with psychosocial data: towards predictive modeling for insider threat mitigation. In Insider Threats in Cyber Security, pages Springer, [16] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data 11

12 mining software: an update. ACM SIGKDD explorations newsletter, 11(1):10 18,2009. [17] H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. Proceedings of the VLDB Endowment, 4(11): , [18] J. C. Ho, J. Ghosh, and J. Sun. Extracting phenotypes from patient claim records using nonnegative tensor factorization. In Brain Informatics and Health, pages Springer, [19] A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics,12(1):55 67,1970. [20] M. Hofmann and R. Klinkenberg. RapidMiner: Data mining use cases and business analytics applications. CRC Press, [21] S. Kadirvel and J. A. Fortes. Grey-box approach for performance prediction in map-reduce based platforms. In Computer Communications and Networks (ICCCN), st International Conference on, pages1 9.IEEE,2012. [22] S. Klenk, J. Dippon, P. Fritz, and G. Heidemann. Interactive survival analysis with the ocdm system: From development to application. Information Systems Frontiers, 11(4): ,2009. [23] A. C. König, B. Ding, S. Chaudhuri, and V. Narasayya. A statistical approach towards robust progress estimation. Proceedings of the VLDB Endowment, 5(4): ,2011. [24] J. Li, A. C. König, V. Narasayya, and S. Chaudhuri. Robust estimation of resource consumption for sql queries using statistical techniques. Proceedings of the VLDB Endowment, 5(11): ,2012. [25] J. Li, R. Nehme, and J. Naughton. Gslpi: A cost-based query progress indicator. In Data Engineering (ICDE), 2012 IEEE 28th International Conference on, pages IEEE,2012. [26] J. Lin and A. Kolcz. Large-scale Machine Learning at Twitter. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD 12, pages , New York, NY, USA, ACM. [27] Z. J. Ling, Q. T. Tran, J. Fan, G. C. Koh, T. Nguyen, C. S. Tan, J. W. Yip, and M. Zhang. Gemini: An integrative healthcare analytics system. Proceedings of the VLDB Endowment, 7(13),2014. [28] D. McAullay, G. Williams, J. Chen, H. Jin, H. He, R. Sparks, and C. Kelman. A delivery framework for health data mining and analytics. In Proceedings of the Twenty-eighth Australasian conference on Computer Science-Volume 38, pages Australian Computer Society, Inc., [29] D. Merkel. Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239):2,2014. [30] K. Morton, M. Balazinska, and D. Grossman. Paratimer: a progress indicator for mapreduce dags. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages ACM, [31] K. Ng, A. Ghoting, S. R. Steinhubl, W. F. Stewart, B. Malin, and J. Sun. PARAMO: A PARAllel predictive MOdeling platform for healthcare analytic research using electronic health records. Journal of Biomedical Informatics. [32] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12: ,2011. [33] J. Pittman, E. Huang, H. Dressman, C.-F. Horng, S. H. Cheng, M.-H. Tsou, C.-M. Chen, A. Bild, E. S. Iversen, A. T. Huang, et al. Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomes. Proceedings of the National Academy of Sciences of the United States of America, 101(22): ,2004. [34] M. Saeed, M. Villarroel, A. T. Reisner, G. Cli ord, L.-W. Lehman, G. Moody, T. Heldt, T. H. Kyaw, B. Moody, and R. G. Mark. Multiparameter intelligent monitoring in intensive care ii (mimic-ii): a public-access intensive care unit database. Critical care medicine, 39(5):952,2011. [35] R. Summers, M. Pipke, S. Wegerich, G. Conkright, and K. Isom. Functionality of empirical model-based predictive analytics for the early detection of hemodynamic instabilty. Biomedical sciences instrumentation, 50: ,2014. [36] C. Sun, N. Rampalli, F. Yang, and A. Doan. Chimera: Large-scale classification using machine learning, rules, and crowdsourcing. Proceedings of the VLDB Endowment, 7(13),2014. [37] J. Sun, C. D. McNaughton, P. Zhang, A. Perer, A. Gkoulalas-Divanis, J. C. Denny, J. Kirby, T. Lasko, A. Saip, and B. A. Malin. Predicting changes in hypertension control using electronic health records from a chronic disease management program. Journal of the American Medical Informatics Association, pages amiajnl 2013, [38] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages ,1996. [39] J. Wu, J. Roy, and W. F. Stewart. Prediction modeling using ehr data: challenges, strategies, and a comparison of machine learning approaches. Medical care, 48(6):S106 S113,2010. [40] W. Wu, Y. Chi, S. Zhu, J. Tatemura, H. Hacigumus, and J. F. Naughton. Predicting query execution time: Are optimizer cost models really unusable? In Data Engineering (ICDE), 2013 IEEE 29th International Conference on, pages IEEE,2013. [41] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pages 10 10, [42] D. Zhao and C. Weng. Combining pubmed knowledge and ehr data to develop a weighted bayesian network for pancreatic cancer prediction. Journal of biomedical informatics, 44(5): ,