CloudAtlas: Cloud-Based Predictive Modeling Platform for Healthcare Research

Size: px
Start display at page:

Download "CloudAtlas: Cloud-Based Predictive Modeling Platform for Healthcare Research"

Transcription

1 CloudAtlas: Cloud-Based Predictive Modeling Platform for Healthcare Research Hang Su Georgia Institute of Technology Fei Xia Tsinghua University Jimeng Sun Georgia Institute of Technology ABSTRACT Rapid adoption of electronic health records (EHR) has led to increase in secondary use of EHR data for research including predictive modeling. While predictive modeling is useful for various clinical applications, there are several challenges for domain experts: 1) Complex workflow. Predictive modeling process involves many tasks with complex dependencies. E ective management and automation of the tasks are challenging; 2) Big data. For dealing with data from millions of patients, it is often impossible to timely carry out analyses on a single machine using traditional data analysis tools 3) Infrastructure provisioning. Setting up and maintaining the right infrastructure is time-consuming and expensive. Uncertainty and lack of transparency in running time and cost often drive users away from using cloud computing for predictive modeling. In this paper, we present CloudAtlas, a web-based predictive modeling platform on the cloud for healthcare research. CloudAtlas enables users to conduct predictive modeling that mimics online shopping experiences. To hide complex workflows, CloudAtlas provides configuration guidance thorough various model templates. To handle big data, CloudAtlas runs predictive modeling workflows in parallel through big data systems such as Hadoop and Spark. To provision right infrastructure, CloudAtlas estimates the running time and cost for a predictive modeling workflow then provisions the proper cluster on demand in the cloud. We evaluate CloudAtlas using two clinical predictive modeling problems with large EHR datasets. In particular, we demonstrate that CloudAtlas can achieve 40X speedup plus 40% cost saving compared to traditional sequential execution. We also show accurate estimation on cost, time and progress using machine learning with R 2 between 0.74 to INTRODUCTION Predictive modeling has numerous applications in medicine and healthcare, including early detection of disease[35], disease state prediction[37], better individualization of care[33] and e ective patient management[12]. Predictive models using electronic health records (EHR) have been applied successfully to a wide range of targets including hypertension[37], heart failure[39], and cancer[42]. The need for predictive modeling is rapidly increasing as EHR are adopted widely in all types of clinical environments. Predictive modeling process, however, is extremely challenging for clinical researchers with no or limited computer science background. Complex workflow: Many predictive modeling pipelines need to be compared in order to find the best model. It is time-consuming (often impractical) to configure all the pipelines manually. Even after complex configuration, e cient predictive pipeline scheduling tools are very limited. Big data. As rapid adoption of EHR, population level analysis requires interactively analyzing data from millions of patients from various information sources. Largescale data and fast modeling time pose great challenges for clinical predictive modeling. Infrastructure provisioning. Cloud computing provides an easy way for users to access large compute resources on demand. However, users are often uncertain about the right infrastructure to provision and the associated running time and financial cost. Beyond healthcare applications, similar needs and challenges are in many industries including retail[36], energy, cybersecurity[15], science and medicine[27][7]. We argue today predictive modeling has become a building-block service in healthcare analytics, which demands an intuitive, scalable and transparent development process. New systems need to be designed to address these challenges. In this paper, we attempt to design such a system by leveraging an analogous activity that everyone is familiar with and capable of conducting: online shopping. Table 1 illustrates the parallels between online shopping and predictive modeling. Our goal is to study the analogy and to design a predictive modeling system that can provide similar benefits that online shopping o ers. We present CloudAtlas, an intuitive, scalable and transparent predictive modeling platform that runs on cloud with the following benefits: Intuitive: CloudAtlas is intentionally designed with a simple web-based interface to take predictive modeling order from our target users, clinical researchers. We further di erentiate users skill levels by providing template based auto-configuration and recommenda- 1

2 Workflow Backend Infrastructure Online shopping simple intuitive web interface parallel processing of millions of orders predictable shipping estimates Predictive modeling programming with multiple tools sequential model building lack of running time/cost estimates Table 1: Analogy between online shopping and predictive modeling. User Interface Cluster Launcher Web App Server Workflow Builder Cost Estimator Elastic MapReduce Cluster Execution Engine tion for basic users, while exposing detailed choices of analytic tasks to advanced users. Scalable: CloudAtlas runs on cloud cluster using big data techniques, Spark[41] and Hadoop [10] so that the predictive modeling pipelines can be quickly computed and compared in parallel. CloudAtlas also facilitates launching of a compute cluster on demand for a predictive modeling workload according to user s requirement. Transparent: To avoid uncertainty about the infrastructure need and associated time/cost, we design a machine learning estimation method for running time and financial cost given the predictive modeling configuration. In turn, we can recommend the optimal cloud setup for the current workload. We systematically evaluate CloudAtlas using two EHR datasets with 30K and 2M patients, respectively. CloudAtlas demonstrates significant performance gain compared to sequential computation, where 40X speedup and 40% cost reduction can be achieved on a 20-node AWS on-demand cluster. In terms of time/cost estimation, our experiments showed that our estimation is very accurate with R 2 > 0.94 for progress estimation, R 2 > 0.83 for running time estimation and R 2 > 0.74 for cost estimation. The rest of this paper is organized as follows. First, in Section 2 we describe the architecture of CloudAtlas and high-level overview of the system. Next, in Section 3,4,5 we show how we make the system intuitive, scalable and transparent respectively. We show experiment result in Section 6 and discuss issues like privacy in Section 7. Then we present related works in Section 8. Finally, we give conclusion in Section CLOUDATLAS OVERVIEW In this section, we first show architecture of the system and how each component corresponds to the intuitive, scalable and transparent design characteristics that we want to achieve. Then, we introduce the terminologies we use in this paper. 2.1 Architecture Figure 1 shows the architecture of CloudAtlas, which consists of the following components: 1) User Interface provides the online shopping-like web interface for users to configure predictive modeling pipelines; 2) Workflow Builder extracts distinct tasks and their dependencies from the user-specified predictive modeling pipelines; 3) Cost Estimator provides cost, running time and progress estimation; 4) Cluster Launcher Figure 1: CloudAtlas system architecture provisions clusters with necessary modules in cloud computing environment and 5) Execution Engine computes all the tasks in parallel by following the task dependencies. Next we present the design highlights of CloudAtlas in pursuing an intuitive, scalable and transparent predictive modeling platform. Towards building an intuitive platform for clinical researchers, we simplify predictive modeling processes through a simple web-based interface defined based on modeling templates. Then users can use Cluster Launcher to provision clusters in the cloud on demand. The detailed design tradeo for intuitiveness will be presented in 3. The Workflow Builder and the Execution Engine work together to make the system scalable. In particular, we use big data systems such as Hadoop, Cascading and Spark to handle each step of predictive modeling and parallelize di erent tasks as much as possible. We will describe scalability design choices in 4. Cost Estimator makes cost, running time and progress transparent to users, which can give users a suggested configuration of clusters. We will show design choices for transparency in Terminology Now we introduce the terminologies used in this paper. Event is the 4-field input tuple with the format (patient id, event name, date, value). For example, (patient123, icd9v250.03, , 1 ) encodes that patient123 was diagnosed with disease of ICD 9 code v250.03(diabetes) on December 31 of Pipeline refers to a sequence of computational tasks that leads to a predictive model. For example, Figure 2.A is one example of such a pipeline, where feature selector and classifier are part of the pipeline. The resulting predictive model is a function that maps the existing patient data to an expected future target. We call a specific step in the pipeline a Task,denotedas one rectangular block in Figure 2. Workflow indicates a task dependency graph derived from multiple predictive modeling pipelines. To compare a set of pipelines, distinct tasks can be identified to form a Workflow represented as a Directed Acyclic Graph (DAG). Within a workflow, we call the set of tasks of the same type a Stage, i.e. feature selection stage contains all feature selection tasks from all pipelines. 2

3 A:#pipeline Cohort& Construction Cohort& Construction Feature Construction Feature& Construction B:#workflow Fold&1 Fold&2 Cross& validation Feature&5% Feature&20% Feature& Selection Train&kNN Train&SVM Train&kNN Train&SVM Train& Classifier Test&Classifier Test&kNN Test&SVM Test&kNN Test&SVM Predictive Model Predictive Model Predictive Model Predictive Model Predictive Model Figure 2: Predictive modeling pipeline and workflow. Figure A shows the process of getting one predictive model. Figure B shows the workflow formed by merging multiple pipelines and remove duplicate tasks. Diagnostic Procedure Medicine Observation&Window Index&date Prediction&Window Asthma&readmission date Figure 3: Feature construction for asthma readmission prediction from EHR data. Di erent horizontal lines represent di erent types of events, e.g., blue triangles represent diagnostic events. Events happened within the observation window are aggregated for predicting ath. 3. INTUITIVENESS To make the system intuitive for clinical researchers to use, we simplify the predictive modeling process by 1) introducing predictive modeling pipeline template and the associated web interface and 2) easing infrastructure provisioning using the cloud. In this section, we start with showing the predictive modeling process template. Then we describe design tradeo s in the User Interface component. Finally, we describe the Cluster Launcher component. 3.1 Pipeline Template Di erent clinical studies may have di erent goals; use different datasets and algorithms. However, most of predictive models are built using a similar pipeline template as in Figure 2.A. This particular pipeline template has six tasks: cohort construction, feature construction, cross-validation, feature selection, training and testing. Note that we focus on this particular template but our system design is generalizable to other templates. Cohort construction specifies the patient set and their corresponding prediction target values and index date. We support various study design for constructing a patient set including case-control study and cohort study[13]. For example, to conduct a case-control study for asthma readmission, we first identify all the cases with asthma readmission. Thus, each case patient will have target value 1 that indicates the presence of asthma readmission and the index date equals the readmission date. Then we match each case to one or more control patients on certain characteristics such as age, race, gender and clinic. We also make sure the selected controls do not have asthma readmission (i.e., target value 0). The choice of index date for controls is flexible, which can be the index date of the corresponding case or the latest date in control patient s records. Feature construction computes a feature vector for each patient based on the patient s EHR data. More specifically, the feature construction is anchored by the index date (such as asthma readmission date). The prediction window is a time window right before the index date, while the observation window is another time window prior to prediction window as illustrated in Figure 3. To convert such sequences into features, all events of the same event name within the observation window will be aggregated into a single feature value. Basic aggregation functions include count, sum, mean, latest. Note that the target to predict is also an event or aggregation of events. For example, for mortality prediction, the prediction target is the occurrence of certain event. For heart failure prediction, target could be occurrence of any diagnostic events represent di erent kinds of heart failure. Feature selection, training, testing are typical machine learning steps. EHR data involved in predictive modeling always include diagnostics, procedures, medication etc. The raw feature dimension could be hundreds of thousands. Feature selection reduce the dimension by calculating score for each feature and only retain most predictive features. Training task trains a specific classifier using filtered features from feature selection result on training data. Testing task applies a feature selector and a classifier on testing data to calculate metrics like area under the ROC curve (AUC), F1, precision and recall for the trained model. In some predictive modeling setting, other tasks may be required. For example, incorporating domain knowledge to refine feature construction result. We argue that our abstraction captures most settings. For special scenarios, result from our system provides a quick starting point for further exploration. For example, rather than trying various feature selector or classifier, user could just pick the model with best performance and refine that using domain knowledge, which will dramatically speedup predictive modeling process. 3.2 User Interface Our target users are clinical domain experts, not programmers or machine learning researchers. To facilitate clinical researchers to conduct predictive modeling independently, we want to mimic the experience of online shopping. In online shopping, the basic customers quickly order with the default setting if they think the default is fine or they know little about the options. Or, advanced users can customize the product with just a few simple input on top of the default setting. We provide a web wizard to collect user inputs of all predictive modeling tasks. For cohort construction and feature construction we provide default settings like prediction window [14] is 365 days and accept all patients into cohort. However, users usually need to try a lot of di erent models to find the most useful one. For feature selection and model training/testing, we allow user to select all possible classifiers and feature selectors user think maybe useful. A high-level specification sample is as follows: 3

4 Cohort construction: use MIMIC2 patient dataset, predict mortality. Feature construction: diagnoses, procedures, lab test, and fluid balance events with a sum aggregation function for lab test and count aggregation for others. Cross-validation: 3-fold cross-validation. Feature selection: select top 5% and 20% of features based on chi-squared statistics. Classifier: k-nearest Neighbors (knn)[6] and Support Vector Machine (SVM)[9]. This specification results in a set of 12 pipelines. After the predictive modeling workflow are computed, what we deliver to user are model performance report and the optimal model. Graphical user interface is chosen to enable fast modeling e orts by clinical researchers, but are not flexible by design. To alleviate the constraint, we also provide RESTful API to accept pipeline specification in JSON format. 3.3 Cluster Launcher To ease the pain for users to maintain infrastructure and let them focus more on the analysis itself, we utilize AWS Elastic MapReduce to setup a cluster for each analysis workflow when user submits a workflow specification. Elastic MapReduce will help provision machines on top of EC2 and configure Hadoop cluster. We just need to specify the size of cluster, and the type of machines to use. As we will see later in experiment and description of Cost Estimator component, a decision for cost and performance tradeo should be made here. We enable standard-alone machine learning packages such as scikit-learn[32] on the cloud. After all the tasks are finished, the cluster will be terminated. Allocating new cluster per workflow can help avoid competition of resources between di erent workflows and facilitate getting to know the real cost of running a workflow. A potential drawback could be that provisioning new cluster and terminating cluster will consume some time. According to our observation, it usually consumes less than 10 minutes to provision a cluster with 20 nodes, and termination of cluster is almost instant. We develop Cluster Launcher as a separate component because we want to be flexible in infrastructure. Currently, Cluster Launcher could either launch cluster in AWS for production, or use dedicated cluster for development. In future, we want to support other cloud providers like Azure and containers like Docker[29]. 4. SCALABILITY One key challenge in predictive modeling is to handle big data. We solve this by leveraging big data frameworks including Hadoop[10] and Spark[41]. In this section, we first describe Workflow Builder component then detail the implementation of each task within Execution Engine component. 4.1 Workflow Builder Workflow builder converts pipeline configurations into unique tasks and their dependencies represented by a dependency graph as shown in Figure 2.B. The dependency graph is a directed acyclic graph (DAG) where each node represents a task and an edge encode a dependency between two tasks. Such a dependency graph eliminates all the redundant tasks and provides essential guide for the execution engine to compute all tasks in the right order in parallel. 4.2 Execution Engine The execution engine exploits two types of parallelisms for speedup: Inter-task parallelism: Given the dependency graph, individual tasks can be computed in a topological order. Multiple tasks can then be run in parallel as long as their dependencies have completed. Intra-task parallelism: Individual task can often be computed in parallel. However, the specific strategy for parallelism will vary across di erent tasks. In term of implementation, we choose the best big data framework that is best suitable for a specific purpose, hence multiple big data frameworks are used in our technology deck. For inter-task parallelism, we use High-level programming tool Cascading[2] for running tasks in parallel with respect to their task dependencies. In particular, cascading supports setting up a DAG of Cascading flows, where each Cascading flow contains several operations and corresponds to a task in our system. The framework will help resolve dependencies and parallelize execution of Cascading flows in topological order. Currently, the backend of Cascading is Hadoop so that a sequence of MapReduce jobs will be carried out. In future, Cascading might switch to more e - cient backend such as Spark, and we expect to receive that speedup benefit almost for free. For intra-task parallelism, we focus on speedup individual tasks in the workflow including cohort construction, feature construction, feature selection and model training/testing. Since cohort and feature constructions are data processing and data integration in nature, an in-memory data flow system like Spark can fit very well. The result of these tasks are a feature matrix and a target vector stored in a SVM- Light format, which are accepted by many machine learning packages such as python scikit-learn. For feature selection and model training/testing, intra-task parallelism is more complex and subtle, and often requires parallel implementation of specific algorithms being used. Instead of trying to implementing those algorithms on our own (which could be time-consuming and di cult to maintain), we can explore existing parallel frameworks in Spark MLlib. We wrap all the algorithms in python scikit-learn through Cascading since scikit-learn provide a more comprehensive set of machine learning algorithms than Spark MLlib. In particular, all those ML implementations are invoked through Hadoop Streaming jobs inside Cascading. Here are some detailed consideration about the implementation: For feature selection, mappers just emit its input training samples to a single reducer as it is. The output of the single reducer is the feature selector fitted on input training data; For model training, besides normal input data, mappers will load a feature selector model and apply it to training samples then emit the transformed training samples. Just like feature selection that works with all training samples collected together, training also requires such setting. We set the number of reducer for this Hadoop Streaming job to 1 either and run classifier training in the single reducer. The output of reducer will be a predictive model (e.g., classification model) fitted on samples with feature selector applied. For model testing, mappers load feature selector and trained classifier as side data. Specifically, our system copies the feature selector and classifier to a local 4

5 directory for each mapper. Then mapper can apply a feature selector and a predictive model to the testing samples from standard input and emit the prediction result together with real target value to reducer. The reducer will aggregate them to calculate the score for given a combination of feature selector and predictive model. 5. TRANSPARENCY One possible reluctance for users to use CloudAtlas is the uncertainty about expected cost and running time. The analogy to online shopping is that shoppers will be very reluctant to make purchase decisions if item price and shipping time are not given. However, that is the current reality in using cloud computing for predictive modeling. No clear cost and running time estimates are provided. Making situation even worse, users also have to decide about what infrastructure to provision without knowing the clear understanding of total cost and running time. Choosing too large is waste of resource, while choosing too small can incur huge delay sometime even failures in execution. Our goal in this section is to propose methods to provide transparency to the users in terms of running time and cost estimation. 5.1 Problem Definition Given runtime environment r and workflow w, we are interested in estimating the followings: Cost is total cost c that will be charged by the cloud provider for predictive modeling using CloudAtlas for completing w on r; Running time is total time taken t total for completing w on r. t total starts from user s invoking of pipelines in our system, and ends with cluster termination; Progress is percentage of completion in running time at a given time point as p = t elapsed t total, where t elapsed is the time elapsed by since beginning of the workflow. If t remain denotes the remaining running time, then we have t total = t elapsed + t remain. 5.2 Estimation through Regression The high-level idea of our method is 1) to characterize the workflow and the associated data and underlying machines as a feature vector x, then 2) to learn a regression function mapping x to the expected output y (i.e., running time, cost or progress). In particular, we train the regression model using the historical runs as the collection of training data {(x i,y i)} n 1, where n is the number of modeling runs in the history. We choose to use such a black-box model instead of a white-box model where we need explicitly model underlying algorithms or parallel computing framework (such as Hadoop), because we want our methods to be generally applicable to broad (and even unseen) settings. The tradeo is that we have to first acquire enough history in order to model future runs. In a real-world setting, we believe such history can be quickly acquired by most of web services including the predictive modeling services we are creating. We systematically evaluate the estimation performance of such black-box models in the experiment. In terms of specific choices of models, we use both linear and non-linear regression models: Linear regression is fast and easy to interpret, but with a linear assumption between input features and output. In particular, the models assume that y i = T x +, where are parameters of the model to learn and N (0, 2 ) is error. The linear assumption is too strong for our problem, as we will show in the experiments. Polynomial regression is similar to linear regression but with polynomial transformation of the original features. In our experiments, we use second order polynomials. For example, suppose original feature vector is [a, b], it will be replaced with [a, b, a 2,ab,b 2 ]ifpolynomial degree d is 2. Such transformation gives a richer model but increases the dimensionality of the input features. As a result, it is more expensive than the linear model. Gaussian process is a non-linear model that models any finite combination of random variables with Gaussian distribution. Kernel we use is polynomial kernel defined as k(x, x 0 )= 2 (x T x 0 + c) d. To deal with high-dimensionality and insu cient samples, regularization are commonly used to provide more robust estimates. In our system, we drop two popular regularizations: Ridge regularization (L2)[19] and Lasso regularization (L1)[38]. 5.3 Features Features play an important role in the success of machine learning models. Here we describe all the features that we use: Machine related features include the number of machines in the cluster, total memory of the cluster and total CPU cores for the cluster. We realize that there are many other potential features that can be added, but we hope to capture key parallelization factors of the underlying machines with some simple measures. Dataset related features include size of input dataset, number of patients, number of features and number of nonzero entries. CloudAtlas deals with sparse patient healthcare events. Both input to the system and intermediate results use sparse representation. In such cases, besides number of features and number of patients, number of non-empty entries also matters. Workflow related features are comprised of characteristics of the analysis workflow w. We use number of independent tasks and number of dependency relationships. In addition, we incorporate healthcare data analysis domain specific settings including observation window as feature. Progress specific features are used for progress estimation only. A workflow contains tasks of di erent types, i.e. feature construction type, model training etc. While estimating the progress of a workflow at a certain time point, we use time elapsed by since beginning of the workflow and percentage of completed di erent types of tasks as features. Unlike other features mentioned above that could be collected before running a workflow, these two features, however, is available in running time only. The total number of features is 16 consisting of 3 machine, 4 dataset, 3 workflow and 6 progress features. 5.4 Running Time and Cost Estimation Estimations for both running time and cost use the same set of features in the regression model but with di erent target variables: actual running time and cloud computing 5

6 cost 1, respectively. It is worth pointing out that we estimate running time in the granularity of second, but the cloud provider usually charges for service consumption in hours. Less than one hour will be counted as one hour. For example, 74 minutes will be charged for 2 hours in bill. Given running time estimation t in seconds, and unit price of cloud service as u dollars per hour, we have cost function g(t) as t g(t) =d eu. (1) 3600 With estimated distribution of total running time as t N (µ, 2 ), we could estimate the expected cost as Z E(cost) = g(t)f(t)dt (2) where f is the probability dense function of t (Gaussian). 5.5 Progress Estimation We use all the candidate features described in previous features subsection for progress estimation. Our prediction target is the remaining time of a workflow. With estimation of remaining time of a workflow as ˆt remain, wecould calculate the estimated progress as t elapsed ˆp = t elapsed + ˆt remain We can generate training data for this estimation task by collecting history of past workflow runs. Specifically, We evenly sample 200 progress information from each workflow and accumulate 8,600 samples. 6. EXPERIMENTS In this section, we first show the intuitiveness of CloudAtlas using a real use case for asthma readmission. Then, we evaluate the scalability of the system with two EHR datasets of di erent characteristics. Finally, we show the accuracy of our estimation methods for running time, cost and progress. 6.1 Intuitiveness: Case Demonstration Next, we will demonstrate an actual use case of CloudAtlas in collaboration with Children s Healthcare of Atlanta(CHOA) on predicting asthma pediatric readmission on 3K inpatient patients (balanced cases and controls). The web based UI is used by a clinical researcher (a MD-PhD student in our lab) with the clinical guidance from one physician and one nurse at CHOA. The input data include patient demographics, diagnosis, medication, procedure and lab results exported from CHOA s internal reporting system. The raw input are then de-identified and processed into the required tuples formats that CloudAtlas requires. Those tuples are then uploaded to secure storage for predictive modeling. With the adoption of CloudAtlas, a clinical researcher (MD-PhD) is able to make quick progress on predictive modeling after data is ready. He is able to get prediction performance and feature characteristics within an hour with minimal help from our system developers. The predictive modeling results are reported on the web front-end including model accuracy measures such as AUC, F1, precision 1 We compute the cost based on charge price of on-demand instances on AWS MIMIC2 CMS Source ICU patients Medicare claims Duration 8 years 3 years Size 844MB 6.88GB #Patients 32,431 2,069,684 #Features 21, ,914 #Records 31,813, ,309,933 Table 2: Statistics of the two EHR datasets used to evaluate the performance of our system. and recall. Furthermore, the results can be exported to a spreadsheet for o ine reporting and analysis. The predictive features are also listed in the descending order of importance to the predictive model. The performance results and predictive features are presented to physicians to obtain feedback about incorporating length of stays as additional features. Based on the feedback, the clinical researcher can immediately rebuild the model by adding new features during the meeting and discuss the new results right away. In contrast, before adopting CloudAtlas it might take days for a researcher to update predictive modeling results and often costs weeks to schedule for a follow-up meeting with busy physicians. CloudAtlas provides a simple predictive modeling framework that can shorten the development and validation process dramatically. 6.2 Scalability Evaluation In this subsection, we first describe the setup about dataset. Then, we evaluate the scalability in terms of running time in di erent cluster and data setting. We run the system with various machine types, number of machine(s), observation window size and input data size. We also show the corresponding cost of using our system in cloud Environment and Data Setup To demonstrate the scalability and e ciency of our system, we use MIMIC2 Clinical dataset[34] to predict mortality of 3 months after discharge as [12], and CMS synthetic dataset[18] to predict heart failure. Some high-level statistics about these two datasets are listed as Table 2. We extract diagnose, procedure, lab test, and fluid balance events from the MIMIC2 clinical dataset. From the CMS synthetic dataset we extract diagnose, procedure, payment, and medication events. We choose these two datasets because they are of very di erent characteristics. MIMIC2 provide detailed data on smaller population, while CMS data provide less detailed data on a larger population. In AWS Elastic MapReduce, Hadoop we use is version Before running predictive modeling workflows using our system, patient data must be stored into somewhere accessible to the cluster. Running Hadoop with Elastic MapReduce, people typically upload data into Amazon s Simple Storage Service (S3). The two datasets we use are converted into Event format and uploaded to Amazon S3 using s3cmd 2.TimespenttouploadtheCMSandMIMIC2 dataset is 8m11s and 1m7s respectively. The costs of storing them are $0.03/month and $0.21/month, which is minimal compared to the remaining predictive modeling process

7 Type Description CPU Mem Price m3.2xlarge General Purpose 8 26 $0.70/h c3.2xlarge Compute Optimize 8 15 $0.53/h r3.xlarge Memory Optimize $0.44/h r3.2xlarge Large Memory 8 61 $0.88/h Table 3: Configuration and cost of 4 machine types in Amazon US East Region (N. Virginia). Price collected in 2014/11/13 for EMR and Linux virtual machines that host EMR. We report cost according to price of Amazon US East Region (N. Virginia). Predictive modeling workflow for mortality prediction uses following specification: Feature types: diagnoses, procedures, lab test, and fluid balance with a sum aggregation function for lab test and a count aggregation for others. Normalization of the aggregation result is enabled for all feature types. Filter feature with observation window of size equals 360 and prediction window of size equals 10. Eight feature selection settings: select top 1%, 2%, 5%, 10% of features based on chi-squared statistics and ANOVA F-value. Four classifier algorithms: Logistic Regression, Perceptron, Nave Bayes and Linear Support Vector Classifier. 5 times 10-fold cross-validation (i.e., running 10-fold CV 5 times) The resulting dependency graph contains more than 3000 nodes. In particular, there are 400 feature selection tasks and 1600 classification tasks. Predictive modeling workflow specification for heart failure detection with CMS synthetic dataset is similar to that of mortality prediction with MIMIC2 Clinical dataset. Di erence lies in the following components: Feature types: diagnoses, procedures, payment, and medications with a sum aggregation function for payment and a count aggregation function for others. Three classifier algorithms: Logistic Regression, Perceptron and Linear Support Vector Classifier Machine Type In the cloud computing environment AWS, di erent machine types are provided. They are usually optimized for different purpose, i.e. some are for CPU intensive job, some are for Memory intensive job. Here we evaluate 4 types of virtual machine: c3.2xlarge, r3.2xlarge, r3.xlarge, m3.2xlarge. The configuration and price of di erent machines is listed in Table 3. We run this experiment with same size of clusters equals 10. This size includes one master node of m3.xlarge machine type (with price $0.35/h). Actual cluster will have 9 nodes serve as slaves to carry out the real computation. The corresponding running time and cost for two workflows running on di erent machine types is shown in Figure 4. For both case, Large Memory machine gives fastest result. This is due to the fact that most tasks in the workflow are memory intensive. More memory could lead to better parallelization. The costs of running with di erent machine types do not vary too much for the CMS heart failure prediction workflow, but the time varies a lot. That s mainly because slow machine could be cheap but may require longer running time. Given similar cost, user will definitely want to get result earlier. In later experiment, we will just use Large Figure 4: Running time and cost of workflow for two datasets. Larger circles represent runs with CMS dataset, and smaller ones of MIMIC2. Cost is calculated using same pricing mechanism as AWS (less than one hour count for one hour) using prices listed in Table 3 Memory machine type for CMS data as it gives lowest running time. For mortality prediction with MIMIC2 Clinical dataset, Compute Optimize is slowest and Memory Optimize is cheapest. The reason why Memory Optimize is cheapest is that it best balances parallelization and unit cost for the given workflow on given dataset. In later experiment, we will use Memory Optimize for MIMIC2 data. Our system scales well for di erent size of data as total running time is within hours for both workflows. In Figure 5, we also show the breakdown of running time for di erent stages of a workflow. It corresponds to a workflow running with CMS dataset on a 10-node-Large Memory cluster. The overlap between training and testing phase means that the system could give result of testing almost immediately after training finish. Boosted by the parallelization of tasks, we could see that total running time is dramatically reduced compared to running all tasks sequentially on asinglesamekindofmachine Cluster Size Besides machine type, size of cluster also impacts the total running time and thus the cost. In this part of experiment, we run the two workflows with di erent sizes of cluster to show how our system could scale up. The choice of machine type is described in previous experiment about machine type. We run cluster of size 5, 10 and 20 for two workflows. Figure 6 shows the running time and cost of di erent cluster size. We can see that long running workflow benefits more from our system. For smaller dataset MIMIC2, speedup is limited. For CMS dataset there s a larger reduction in running time with more machines. With 20 nodes in cluster, workflow on CMS could achieve about 40x speedup compared to sequential run. The speedup is not linear to cluster size, as the time consumed for cluster provisioning, dependency graph generating etc. could not be deducted by adding more nodes to cluster. Delay caused by the distributed processing framework makes it not worthwhile to analyze small dataset with a cluster. To balance time and 7

8 Figure 7: Running time and cost of workflow with di erent observation window. Figure 5: Running time of each stage for a workflow instance with CMS synthetic dataset for heart failure prediction. Cluster size equals 10 and machine type is Large Memory. The top yellow bar is total time of running all the tasks of the workflow in topological order sequentially on a single Large Memory machine. Figure 8: Time and cost of same workflow with different input dataset size. Figure 6: Running time (in log scale) and cost for workflow run on clusters of various sizes. The single blue dot represents running time and cost if run every tasks sequentially on a single machine of the same type as cluster. For CMS, 40x speedup achieved with 20 nodes in cluster. cost, in remaining experiments, we choose cluster size as 10 for both datasets Workflow and Dataset Property In this part of experiment, we test how our system behaves under di erent workflow and dataset setting. Specifically, we vary prediction window and input dataset size. Size of observation window determines number of records that will enter model training, thus may lead to di erent workflow running time and prediction accuracy. We test prediction windows of size 180, 240, 300 and 360 days respectively. Other workflow settings are identical to previous experiments. As stated previously, the machine type for CMS dataset is Large Memory and for MIMIC2 is Memory Optimize, and the cluster size is 10. Fig 7 shows the result. We see that smaller observation windows can lead to shorter running time as expected. Next, we test the performance of our system using different dataset size. We split CMS dataset into datasets of smaller size and run the same predictive modeling for heart failure. Table 8 clearly shows that the larger the the data size(represented using number of event records in dataset), the longer the running time. The running time and cost increase linearly with increase of input data size. This trend shows our system scales well. 6.3 Transparency Evaluation With past running records of the system, we conduct experiment to test e ectiveness of our estimation methods and show results in the order of running time estimation, cost estimation and progress estimation. We collect detailed logs of 43 individual workflow runs. From these workflows, we accumulate 8600 workflow progress records by sampling 200 data points from each workflow. As all three estimation problems in this paper are formulated as regression problems, we use two widely used metrics to validate the performance of our methods with 5-fold cross-validation. First, we test the rooted mean squared error(rmse), which is also known as rooted mean squared prediction error(rmspe) on test samples. It is defined as below: r P n i=1 RMSE = kŷi n yik2 where n is number of testing samples and ŷ i is our estimation of y i. In our problem y is either cost c of workflow, total running time t or remaining running time t remain depending on the problem. Second, we measure the R 2 on cross validation set defined as below: P n R 2 i=1 =1 P (yi ŷi)2 n i=1 (yi ȳ)2 where ȳ is average of y i. The higher R 2 is and the lower RMSE is, the more e ective the method is Running Time Estimation For running time estimation, we compare four methods: 8

9 Method RMSE R 2 Linear Basic Linear Lasso Linear Ridge Polynomial Ridge Polynomial Lasso Gaussian Process Table 4: Running time estimation performance evaluation. Figure 10: Extrapolation ability analysis. The x- axis shows m, the percentage of data used in test. The lower m means more di erence in training and testing data thus more di cult to estimate running time. The y-axis shows the coe cient of variation of the RMSE (CVRMSE). The line is almost stable means that our method can extrapolate fairly well. Figure 9: Feature contribution analysis for running time estimation. With more features added, we can see the decrease in RMSE and increase in R 2. The estimation performance is best when all features are applied. Linear Basic: linear regression without regularization. Linear Lasso: linear regression with Lasso regularization. Linear Ridge: linear regression with Ridge regularization. Polynomial Ridge: polynomial regression with Ridge regularization. Polynomial Lasso: polynomial regression with Lasso regularization. Gaussian Process: polynomial kernel Gaussian process. The performance is shown in Table 4. From the result we can see that the Polynomial Ridge performs best Feature Contribution As described in 5,weusemachine related, dataset related and workflow related features in running time estimation. In this part of experiment, we show how three feature groups perform in combination. We test three di erent settings: 1) use machine related features only; 2) use machine related and dataset related features together; 3) use all three groups of features. We only test with Polynomial Ridge method as it performs best as shown in previous experiment. From Figure 9 we can see that estimation works best when all groups of features are applied, which means features we use are meaningful. Note that machine features alone seems inadequate in predicting running time, but dataset and workflow features can perform much more prediction power Extrapolation Ability Analysis In real world settings, the most challenging case is that we need to use history of small workflow to estimate that of Method RMSE R 2 Polynomial Ridge Distribution Gaussian Process Distribution Table 5: Cost estimation performance comparison. larger workload. This is called the extrapolation problem. In this experiment, we simulate this situation by sorting the workflow samples according to running time in ascending order, and use first 60% workflows (small workflows) as training set and last m% as test set (large workflows). We use Polynomial Ridge as regression method. We vary m from 5 to 40 with a step 5. Extrapolation is hard especially when m is small (i.e., the largest workflows). Low m means that testing data is very di erent from training data, and the estimation performance will be worse. In this experiment, we measure coe cient of variation of the RMSE (CVRMSE), a normalized version RMSE, which is defined as CV RMSE = RMSE ȳ The performance is shown in Figure 10, where we observe CVRMSE decrease slowly, which shows reasonable performance for the challenging extrapolation settings Cost Estimation For cost estimation, our method works by estimating distribution of total running time first as t N (µ, 2 ). We compare following methods Polynomial Ridge Distribution: estimate distribution of running time t N (µ, 2 )usingpolynomial Ridge, then estimate cost using Eq(2). Gaussian Process Distribution: estimate t N (µ, 2 ) using Gaussian Process then estimate cost using Eq(2). Table 5 shows the performance of the two methods. From the tablewe seepolynomial Ridge Distribution works better. With accurate estimation, CloudAtlas could make the cost transparent to user rather than presenting a quota that is significantly di erent from actual cost. 9

10 Method RMSE R 2 Linear Basic Linear Ridge Linear Lasso Polynomial Ridge Polynomial Lasso Table 6: Progress estimation performance comparison. Gaussian Process is not compared as it s not sparse model and not e cient in high dimensional space Progress Estimation Like running time estimation, we compare five methods: Linear Basic, Linear Ridge, Linear Lasso, Polynomial Ridge and Polynomial Lasso. We don t compare with Gaussian Process because it s not a sparse model and not e cient in high dimensional space. For progress estimation in CloudAtlas that requires frequent progress re-estimation, Gaussian Process is not suitable. In this experiment, we split the training and test progress information samples according to a per-workflow basis. Samples from the same workflow are either in the training set or in the test set. Table 6 shows that Polynomial Ridge performs best among all the methods considered. In real world settings, in fact, at a time point while the workflow is still running, we have same workflow s past progress records, which could be used for estimating current progress by retraining regression model. Intuitively, this will give a more accurate estimation as same workflow s progress information will be strongly correlated to each other. We verify this by training regression models with not only progress records from other workflow, but also progress records of same workflow happened earlier. This setting could boost Polynomial Ridge s RMSE to andr 2 to Next, we show estimated progress of two workflow instances. The results are displayed in Figure 11. By retraining a regression model using self s past progress records as well as others progress records, we could reduce the absolute error. Also, with time passing by, more and more progress records of same workflow accumulated, error could also be reduced. For the estimation with retraining, we could observe patterns like stages. This corresponds to di erent stages of workflow. Within a same workflow stage, error is almost similar, and after a stage complete, error drop. 7. DISCUSSION We discuss related issues and potential extensions. Data Privacy and Security: EHR data can be sensitive, especially, the Protect Health Information (PHI). Many e orts in the nation such as Vanderbilt s Synthetic Derivative 3 start to produce comprehensive de-identified EHR data for research, where the security risk for analyzing that on the public cloud is decreasing. At the same time that cloud security has improved significantly, most of cloud vendors including AWS and Azure are o ering HIPAA complaint cloud infrastructure 4 Besides private cloud using the same Figure 11: Instances of progress estimation. The x-axis is workflow progress. The y-axis is the absolute error of estimated remaining time. The blue line is obtained by training regression models using progress records of other workflows only. The red line is obtained by training regression models using current workflow s past progress records plus others progress records. Clearly, adding current workflow s progress could reduce error. The estimated progress error shows pattern of stages, which could be explained by stages of workflow. technology has already been deployed such as AWS Gov- Cloud[1] where such a system like CloudAtlas can be directly applied safely. Heterogeneous data: Oneofthelimitationsisthatwe currently only support structured data. In the future we will explore feature construction with unstructured data such as unstructured notes, time series and images. Our system is implemented with distributed storage support from AWS S3 service, where clinical notes and images can be easily stored at and retrieved from the same S3 data store in a HIPAA complaint manner. 8. RELATED WORK Related works of this paper fall into two categories: predictive modeling systems, and system performance estimation. 8.1 Predictive Modeling Platform Given feature is constructed and formatted in required format, general process machine learning modules like Weka[16], RapidMiner[20] could be helpful for user to try specific machine learning model. However, as we described in 3, a lot of possible models should be explored. Also, the Cohort Construction and Feature Construction should be automated to speed up the process and reduce possible human errors. Cloud-based Machine Learning platform like Azure Machine Learning 5 can help create workflow to compare models, but it also need formatted feature input, otherwise users will need to apply their own scripts. As a result, there are still a lot of work left for clinical researchers to complete manually. Other healthcare data mining systems [22, 28, 14] may provide complete predictive modeling pipeline functions like Figure 2 as well as provide web based user interface for workflow configuration and result display. However, unlike our system that runs on cloud and use big data techniques, they tend to run on a single dedicated server, thus not as scalable and flexible as ours. The PARAMO system[31] is the most relevant work in terms of function. That work explored the benefits of using

11 Hadoop[3] cluster to speedup predictive modeling. However, PARAMO is designed for dedicated cluster which is not always available for clinical researchers. Furthermore, user can t change cluster settings on demand for di erent predictive modeling workloads, thus it s not flexible enough. 8.2 System Performance Estimation Several studies exist for system performance estimation in database system and big data systems like Hadoop MapReduce. Some of them can be regarded as white-box model as they try to analyze the underlying complex mechanism. Others are black-box models that use machine learning for application. [5] and [11] are typical methods of using machine learning for database query performance prediction. For machine learning methods, proper feature combination is important. [24] studied operator-specific features. For MapReduce program running time estimation, [17] employed a white-box model to analyze single MapReduce job. For complex workflow like ours, it is hard to apply such analysis as error accumulated on each MapReduce job can lead to huge error in workflow-level estimation. [21] proposed a grey-box approach for performance estimation of individual Hadoop job. They apply machine learning on a bunch of Hadoop specific features. We adopt similar method but on a workflow level, which could involve thousands of Hadoop jobs. Progress estimation is about continuous on-line performance prediction with more runtime information. It can be either white-box or black-box models. [25] and [23] are examples of white-box and black-box for database progress estimation respectively. [30] propose a white-box analysis method for estimating progress for MapReduce DAG. For our setting, however, workflow is composed of thousands of heterogeneous tasks. Modeling with white-box like that can be too complex. 9. CONCLUSION We developed CloudAtlas, an intuitive, scalable and transparent predictive modeling platform that runs on the cloud. CloudAtlas is developed as an intuitive web application for clinical researchers with various skill levels. CloudAtlas is scalable as it runs on cloud using big data frameworks such as Spark and Hadoop in parallel. Instead of dedicated servers, CloudAtlas provisions appropriate clusters based on configuration of the predictive modeling pipelines. One novel aspect of CloudAtlas is to provide accurate cost/running time estimate using black-box regression models based on historical runs, which can facilitate users to make more informed and transparent choices. Systematical experiment evaluation using two EHR datasets demonstrates significant performance gain of CloudAtlas compared to sequential computation, where 40X speedup and 40% cost reduction can be achieved on a 20-node AWS on-demand cluster. In terms of time/cost estimation, our experiments showed that our estimation is very accurate with R 2 > 0.94 for progress estimation, R 2 > 0.83 for running time estimation and R 2 > 0.74 for cost estimation. 10. ACKNOWLEDGEMENTS This work was supported by the National Science Foundation, award # Google Faculty Award, AWS Research Award, Microsoft Azure Research Award, and Children s Healthcare of Atlanta. 11. REFERENCES [1] AWS GovCloud (US) Region Overview Government Cloud Computing. [2] Cascading application platform for enterprise big data. Accessed: [3] Welcome to Apache Hadoop! Accessed: [4] M. Ahmad, S. Duan, A. Aboulnaga, and S. Babu. Predicting completion times of batch query workloads using interaction-aware models and simulation. In Proceedings of the 14th International Conference on Extending Database Technology, pages ACM, [5] M. Akdere, U. Çetintemel, M. Riondato, E. Upfal, and S. B. Zdonik. Learning-based query performance modeling and prediction. In Data Engineering (ICDE), 2012 IEEE 28th International Conference on, pages IEEE, [6] N. S. Altman. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3): ,1992. [7] R. Bellazzi and B. Zupan. Predictive data mining in clinical medicine: current issues and guidelines. International journal of medical informatics, 77(2):81 97, [8] L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT 2010, pages Springer, [9] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3): ,1995. [10] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1): ,2008. [11] A. Ganapathi, H. Kuno, U. Dayal, J. L. Wiener, A. Fox, M. I. Jordan, and D. Patterson. Predicting multiple metrics for queries: Better decisions enabled by machine learning. In Data Engineering, ICDE 09. IEEE 25th International Conference on, pages IEEE, [12] M. Ghassemi, T. Naumann, F. Doshi-Velez, N. Brimmer, R. Joshi, A. Rumshisky, and P. Szolovits. Unfolding physiological state: mortality modelling in intensive care units. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages75 84.ACM,2014. [13] L. Gordis. Epidemiology. ElsevierHealthSciences, [14] D. Gotz, H. Stavropoulos, J. Sun, and F. Wang. Icda: a platform for intelligent care delivery analytics. In AMIA annual symposium proceedings, volume2012, page 264. American Medical Informatics Association, [15] F. L. Greitzer and D. A. Frincke. Combining traditional cyber security audit data with psychosocial data: towards predictive modeling for insider threat mitigation. In Insider Threats in Cyber Security, pages Springer, [16] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data 11

12 mining software: an update. ACM SIGKDD explorations newsletter, 11(1):10 18,2009. [17] H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. Proceedings of the VLDB Endowment, 4(11): , [18] J. C. Ho, J. Ghosh, and J. Sun. Extracting phenotypes from patient claim records using nonnegative tensor factorization. In Brain Informatics and Health, pages Springer, [19] A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics,12(1):55 67,1970. [20] M. Hofmann and R. Klinkenberg. RapidMiner: Data mining use cases and business analytics applications. CRC Press, [21] S. Kadirvel and J. A. Fortes. Grey-box approach for performance prediction in map-reduce based platforms. In Computer Communications and Networks (ICCCN), st International Conference on, pages1 9.IEEE,2012. [22] S. Klenk, J. Dippon, P. Fritz, and G. Heidemann. Interactive survival analysis with the ocdm system: From development to application. Information Systems Frontiers, 11(4): ,2009. [23] A. C. König, B. Ding, S. Chaudhuri, and V. Narasayya. A statistical approach towards robust progress estimation. Proceedings of the VLDB Endowment, 5(4): ,2011. [24] J. Li, A. C. König, V. Narasayya, and S. Chaudhuri. Robust estimation of resource consumption for sql queries using statistical techniques. Proceedings of the VLDB Endowment, 5(11): ,2012. [25] J. Li, R. Nehme, and J. Naughton. Gslpi: A cost-based query progress indicator. In Data Engineering (ICDE), 2012 IEEE 28th International Conference on, pages IEEE,2012. [26] J. Lin and A. Kolcz. Large-scale Machine Learning at Twitter. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD 12, pages , New York, NY, USA, ACM. [27] Z. J. Ling, Q. T. Tran, J. Fan, G. C. Koh, T. Nguyen, C. S. Tan, J. W. Yip, and M. Zhang. Gemini: An integrative healthcare analytics system. Proceedings of the VLDB Endowment, 7(13),2014. [28] D. McAullay, G. Williams, J. Chen, H. Jin, H. He, R. Sparks, and C. Kelman. A delivery framework for health data mining and analytics. In Proceedings of the Twenty-eighth Australasian conference on Computer Science-Volume 38, pages Australian Computer Society, Inc., [29] D. Merkel. Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239):2,2014. [30] K. Morton, M. Balazinska, and D. Grossman. Paratimer: a progress indicator for mapreduce dags. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages ACM, [31] K. Ng, A. Ghoting, S. R. Steinhubl, W. F. Stewart, B. Malin, and J. Sun. PARAMO: A PARAllel predictive MOdeling platform for healthcare analytic research using electronic health records. Journal of Biomedical Informatics. [32] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12: ,2011. [33] J. Pittman, E. Huang, H. Dressman, C.-F. Horng, S. H. Cheng, M.-H. Tsou, C.-M. Chen, A. Bild, E. S. Iversen, A. T. Huang, et al. Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomes. Proceedings of the National Academy of Sciences of the United States of America, 101(22): ,2004. [34] M. Saeed, M. Villarroel, A. T. Reisner, G. Cli ord, L.-W. Lehman, G. Moody, T. Heldt, T. H. Kyaw, B. Moody, and R. G. Mark. Multiparameter intelligent monitoring in intensive care ii (mimic-ii): a public-access intensive care unit database. Critical care medicine, 39(5):952,2011. [35] R. Summers, M. Pipke, S. Wegerich, G. Conkright, and K. Isom. Functionality of empirical model-based predictive analytics for the early detection of hemodynamic instabilty. Biomedical sciences instrumentation, 50: ,2014. [36] C. Sun, N. Rampalli, F. Yang, and A. Doan. Chimera: Large-scale classification using machine learning, rules, and crowdsourcing. Proceedings of the VLDB Endowment, 7(13),2014. [37] J. Sun, C. D. McNaughton, P. Zhang, A. Perer, A. Gkoulalas-Divanis, J. C. Denny, J. Kirby, T. Lasko, A. Saip, and B. A. Malin. Predicting changes in hypertension control using electronic health records from a chronic disease management program. Journal of the American Medical Informatics Association, pages amiajnl 2013, [38] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages ,1996. [39] J. Wu, J. Roy, and W. F. Stewart. Prediction modeling using ehr data: challenges, strategies, and a comparison of machine learning approaches. Medical care, 48(6):S106 S113,2010. [40] W. Wu, Y. Chi, S. Zhu, J. Tatemura, H. Hacigumus, and J. F. Naughton. Predicting query execution time: Are optimizer cost models really unusable? In Data Engineering (ICDE), 2013 IEEE 29th International Conference on, pages IEEE,2013. [41] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pages 10 10, [42] D. Zhao and C. Weng. Combining pubmed knowledge and ehr data to develop a weighted bayesian network for pancreatic cancer prediction. Journal of biomedical informatics, 44(5): ,

Big Data Analytics for Healthcare

Big Data Analytics for Healthcare Big Data Analytics for Healthcare Jimeng Sun Chandan K. Reddy Healthcare Analytics Department IBM TJ Watson Research Center Department of Computer Science Wayne State University 1 Healthcare Analytics

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Real-Time Analytics on Large Datasets: Predictive Models for Online Targeted Advertising

Real-Time Analytics on Large Datasets: Predictive Models for Online Targeted Advertising Real-Time Analytics on Large Datasets: Predictive Models for Online Targeted Advertising Open Data Partners and AdReady April 2012 1 Executive Summary AdReady is working to develop and deploy sophisticated

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Using In-Memory Computing to Simplify Big Data Analytics

Using In-Memory Computing to Simplify Big Data Analytics SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed

More information

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first

More information

Testing Big data is one of the biggest

Testing Big data is one of the biggest Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System

Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System By Jake Cornelius Senior Vice President of Products Pentaho June 1, 2012 Pentaho Delivers High-Performance

More information

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015 Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document

More information

Duke University http://www.cs.duke.edu/starfish

Duke University http://www.cs.duke.edu/starfish Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University http://www.cs.duke.edu/starfish Practitioners of Big Data Analytics Google Yahoo! Facebook ebay Physicists Biologists Economists

More information

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

Big Data Analytics and Healthcare

Big Data Analytics and Healthcare Big Data Analytics and Healthcare Anup Kumar, Professor and Director of MINDS Lab Computer Engineering and Computer Science Department University of Louisville Road Map Introduction Data Sources Structured

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

Journée Thématique Big Data 13/03/2015

Journée Thématique Big Data 13/03/2015 Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

Analysis Tools and Libraries for BigData

Analysis Tools and Libraries for BigData + Analysis Tools and Libraries for BigData Lecture 02 Abhijit Bendale + Office Hours 2 n Terry Boult (Waiting to Confirm) n Abhijit Bendale (Tue 2:45 to 4:45 pm). Best if you email me in advance, but I

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT

MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT 1 SARIKA K B, 2 S SUBASREE 1 Department of Computer Science, Nehru College of Engineering and Research Centre, Thrissur, Kerala 2 Professor and Head,

More information

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features Charlie Berger, MS Eng, MBA Sr. Director Product Management, Data Mining and Advanced Analytics charlie.berger@oracle.com www.twitter.com/charliedatamine

More information

Chapter 3: Data Mining Driven Learning Apprentice System for Medical Billing Compliance

Chapter 3: Data Mining Driven Learning Apprentice System for Medical Billing Compliance Chapter 3: Data Mining Driven Learning Apprentice System for Medical Billing Compliance 3.1 Introduction This research has been conducted at back office of a medical billing company situated in a custom

More information

Healthcare Big Data Exploration in Real-Time

Healthcare Big Data Exploration in Real-Time Healthcare Big Data Exploration in Real-Time Muaz A Mian A Project Submitted in partial fulfillment of the requirements for degree of Masters of Science in Computer Science and Systems University of Washington

More information

Test Run Analysis Interpretation (AI) Made Easy with OpenLoad

Test Run Analysis Interpretation (AI) Made Easy with OpenLoad Test Run Analysis Interpretation (AI) Made Easy with OpenLoad OpenDemand Systems, Inc. Abstract / Executive Summary As Web applications and services become more complex, it becomes increasingly difficult

More information

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2 Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

More information

RevoScaleR Speed and Scalability

RevoScaleR Speed and Scalability EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

More information

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish

More information

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

More information

Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

More information

Advanced Big Data Analytics with R and Hadoop

Advanced Big Data Analytics with R and Hadoop REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Cloud-Based Big Data Analytics in Bioinformatics

Cloud-Based Big Data Analytics in Bioinformatics Cloud-Based Big Data Analytics in Bioinformatics Presented By Cephas Mawere Harare Institute of Technology, Zimbabwe 1 Introduction 2 Big Data Analytics Big Data are a collection of data sets so large

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services

More information

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users. Bonus Chapter Ten Major Predictive Analytics Vendors In This Chapter Angoss FICO IBM RapidMiner Revolution Analytics Salford Systems SAP SAS StatSoft, Inc. TIBCO This chapter highlights ten of the major

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

Challenges and Lessons from NIST Data Science Pre-pilot Evaluation in Introduction to Data Science Course Fall 2015

Challenges and Lessons from NIST Data Science Pre-pilot Evaluation in Introduction to Data Science Course Fall 2015 Challenges and Lessons from NIST Data Science Pre-pilot Evaluation in Introduction to Data Science Course Fall 2015 Dr. Daisy Zhe Wang Director of Data Science Research Lab University of Florida, CISE

More information

Cisco Data Preparation

Cisco Data Preparation Data Sheet Cisco Data Preparation Unleash your business analysts to develop the insights that drive better business outcomes, sooner, from all your data. As self-service business intelligence (BI) and

More information

Cloud Cruiser and Azure Public Rate Card API Integration

Cloud Cruiser and Azure Public Rate Card API Integration Cloud Cruiser and Azure Public Rate Card API Integration In this article: Introduction Azure Rate Card API Cloud Cruiser s Interface to Azure Rate Card API Import Data from the Azure Rate Card API Defining

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect on AWS Services Overview Bernie Nallamotu Principle Solutions Architect \ So what is it? When your data sets become so large that you have to start innovating around how to collect, store, organize, analyze

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Administrative Issues

Administrative Issues Administrative Issues Make use of office hours We will have to make sure that you have tried yourself before you ask Monitor AWS expenses regularly Always do the cost calculation before launching services

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

Load and Performance Load Testing. RadView Software October 2015 www.radview.com

Load and Performance Load Testing. RadView Software October 2015 www.radview.com Load and Performance Load Testing RadView Software October 2015 www.radview.com Contents Introduction... 3 Key Components and Architecture... 4 Creating Load Tests... 5 Mobile Load Testing... 9 Test Execution...

More information

Building Platform as a Service for Scientific Applications

Building Platform as a Service for Scientific Applications Building Platform as a Service for Scientific Applications Moustafa AbdelBaky moustafa@cac.rutgers.edu Rutgers Discovery Informa=cs Ins=tute (RDI 2 ) The NSF Cloud and Autonomic Compu=ng Center Department

More information

Automating Big Data Benchmarking for Different Architectures with ALOJA

Automating Big Data Benchmarking for Different Architectures with ALOJA www.bsc.es Jan 2016 Automating Big Data Benchmarking for Different Architectures with ALOJA Nicolas Poggi, Postdoc Researcher Agenda 1. Intro on Hadoop performance 1. Current scenario and problematic 2.

More information

Cisco Hybrid Cloud Solution: Deploy an E-Business Application with Cisco Intercloud Fabric for Business Reference Architecture

Cisco Hybrid Cloud Solution: Deploy an E-Business Application with Cisco Intercloud Fabric for Business Reference Architecture Reference Architecture Cisco Hybrid Cloud Solution: Deploy an E-Business Application with Cisco Intercloud Fabric for Business Reference Architecture 2015 Cisco and/or its affiliates. All rights reserved.

More information

International Journal of Innovative Research in Computer and Communication Engineering

International Journal of Innovative Research in Computer and Communication Engineering FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,

More information

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a

More information

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

A Novel Cloud Based Elastic Framework for Big Data Preprocessing School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

Exploration and Visualization of Post-Market Data

Exploration and Visualization of Post-Market Data Exploration and Visualization of Post-Market Data Jianying Hu, PhD Joint work with David Gotz, Shahram Ebadollahi, Jimeng Sun, Fei Wang, Marianthi Markatou Healthcare Analytics Research IBM T.J. Watson

More information

A Scalable Network Monitoring and Bandwidth Throttling System for Cloud Computing

A Scalable Network Monitoring and Bandwidth Throttling System for Cloud Computing A Scalable Network Monitoring and Bandwidth Throttling System for Cloud Computing N.F. Huysamen and A.E. Krzesinski Department of Mathematical Sciences University of Stellenbosch 7600 Stellenbosch, South

More information

Big Data Analytics in Health Care

Big Data Analytics in Health Care Big Data Analytics in Health Care S. G. Nandhini 1, V. Lavanya 2, K.Vasantha Kokilam 3 1 13mss032, 2 13mss025, III. M.Sc (software systems), SRI KRISHNA ARTS AND SCIENCE COLLEGE, 3 Assistant Professor,

More information

How can we discover stocks that will

How can we discover stocks that will Algorithmic Trading Strategy Based On Massive Data Mining Haoming Li, Zhijun Yang and Tianlun Li Stanford University Abstract We believe that there is useful information hiding behind the noisy and massive

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I - Applications Motivation and Introduction Patient similarity application Part II

More information

Healthcare data analytics. Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw

Healthcare data analytics. Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw Healthcare data analytics Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw Outline Data Science Enabling technologies Grand goals Issues Google flu trend Privacy Conclusion Analytics

More information

Big Data - Infrastructure Considerations

Big Data - Infrastructure Considerations April 2014, HAPPIEST MINDS TECHNOLOGIES Big Data - Infrastructure Considerations Author Anand Veeramani / Deepak Shivamurthy SHARING. MINDFUL. INTEGRITY. LEARNING. EXCELLENCE. SOCIAL RESPONSIBILITY. Copyright

More information

Consumption of OData Services of Open Items Analytics Dashboard using SAP Predictive Analysis

Consumption of OData Services of Open Items Analytics Dashboard using SAP Predictive Analysis Consumption of OData Services of Open Items Analytics Dashboard using SAP Predictive Analysis (Version 1.17) For validation Document version 0.1 7/7/2014 Contents What is SAP Predictive Analytics?... 3

More information

Prerequisites. Course Outline

Prerequisites. Course Outline MS-55040: Data Mining, Predictive Analytics with Microsoft Analysis Services and Excel PowerPivot Description This three-day instructor-led course will introduce the students to the concepts of data mining,

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Technology WHITE PAPER

Technology WHITE PAPER Technology WHITE PAPER What We Do Neota Logic builds software with which the knowledge of experts can be delivered in an operationally useful form as applications embedded in business systems or consulted

More information

Resource Sizing: Spotfire for AWS

Resource Sizing: Spotfire for AWS Resource Sizing: for AWS With TIBCO for AWS, you can have the best in analytics software available at your fingertips in just a few clicks. On a single Amazon Machine Image (AMI), you get a multi-user

More information

WORKFLOW ENGINE FOR CLOUDS

WORKFLOW ENGINE FOR CLOUDS WORKFLOW ENGINE FOR CLOUDS By SURAJ PANDEY, DILEBAN KARUNAMOORTHY, and RAJKUMAR BUYYA Prepared by: Dr. Faramarz Safi Islamic Azad University, Najafabad Branch, Esfahan, Iran. Workflow Engine for clouds

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Cray: Enabling Real-Time Discovery in Big Data

Cray: Enabling Real-Time Discovery in Big Data Cray: Enabling Real-Time Discovery in Big Data Discovery is the process of gaining valuable insights into the world around us by recognizing previously unknown relationships between occurrences, objects

More information

Introduction to DISC and Hadoop

Introduction to DISC and Hadoop Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and

More information

Migration Scenario: Migrating Batch Processes to the AWS Cloud

Migration Scenario: Migrating Batch Processes to the AWS Cloud Migration Scenario: Migrating Batch Processes to the AWS Cloud Produce Ingest Process Store Manage Distribute Asset Creation Data Ingestor Metadata Ingestor (Manual) Transcoder Encoder Asset Store Catalog

More information

Server Load Prediction

Server Load Prediction Server Load Prediction Suthee Chaidaroon (unsuthee@stanford.edu) Joon Yeong Kim (kim64@stanford.edu) Jonghan Seo (jonghan@stanford.edu) Abstract Estimating server load average is one of the methods that

More information

Test Data Management Concepts

Test Data Management Concepts Test Data Management Concepts BIZDATAX IS AN EKOBIT BRAND Executive Summary Test Data Management (TDM), as a part of the quality assurance (QA) process is more than ever in the focus among IT organizations

More information

Scalable Architecture on Amazon AWS Cloud

Scalable Architecture on Amazon AWS Cloud Scalable Architecture on Amazon AWS Cloud Kalpak Shah Founder & CEO, Clogeny Technologies kalpak@clogeny.com 1 * http://www.rightscale.com/products/cloud-computing-uses/scalable-website.php 2 Architect

More information

2015 Workshops for Professors

2015 Workshops for Professors SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Advanced In-Database Analytics

Advanced In-Database Analytics Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??

More information

Big Data and Natural Language: Extracting Insight From Text

Big Data and Natural Language: Extracting Insight From Text An Oracle White Paper October 2012 Big Data and Natural Language: Extracting Insight From Text Table of Contents Executive Overview... 3 Introduction... 3 Oracle Big Data Appliance... 4 Synthesys... 5

More information

High Availability of VistA EHR in Cloud. ViSolve Inc. White Paper February 2015. www.visolve.com

High Availability of VistA EHR in Cloud. ViSolve Inc. White Paper February 2015. www.visolve.com High Availability of VistA EHR in Cloud ViSolve Inc. White Paper February 2015 1 Abstract Inspite of the accelerating migration to cloud computing in the Healthcare Industry, high availability and uptime

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

How To Cluster Of Complex Systems

How To Cluster Of Complex Systems Entropy based Graph Clustering: Application to Biological and Social Networks Edward C Kenley Young-Rae Cho Department of Computer Science Baylor University Complex Systems Definition Dynamically evolving

More information

Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol

Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol Data Algorithms Mahmoud Parsian Beijing Boston Farnham Sebastopol Tokyo O'REILLY Table of Contents Foreword xix Preface xxi 1. Secondary Sort: Introduction 1 Solutions to the Secondary Sort Problem 3 Implementation

More information

Big Data Mining Services and Knowledge Discovery Applications on Clouds

Big Data Mining Services and Knowledge Discovery Applications on Clouds Big Data Mining Services and Knowledge Discovery Applications on Clouds Domenico Talia DIMES, Università della Calabria & DtoK Lab Italy talia@dimes.unical.it Data Availability or Data Deluge? Some decades

More information

Oracle Data Miner (Extension of SQL Developer 4.0)

Oracle Data Miner (Extension of SQL Developer 4.0) An Oracle White Paper September 2013 Oracle Data Miner (Extension of SQL Developer 4.0) Integrate Oracle R Enterprise Mining Algorithms into a workflow using the SQL Query node Denny Wong Oracle Data Mining

More information

Big Data for Investment Research Management

Big Data for Investment Research Management IDT Partners www.idtpartners.com Big Data for Investment Research Management Discover how IDT Partners helps Financial Services, Market Research, and Investment Management firms turn big data into actionable

More information

Trustworthiness of Big Data

Trustworthiness of Big Data Trustworthiness of Big Data International Journal of Computer Applications (0975 8887) Akhil Mittal Technical Test Lead Infosys Limited ABSTRACT Big data refers to large datasets that are challenging to

More information

hmetrix Revolutionizing Healthcare Analytics with Vertica & Tableau

hmetrix Revolutionizing Healthcare Analytics with Vertica & Tableau Powered by Vertica Solution Series in conjunction with: hmetrix Revolutionizing Healthcare Analytics with Vertica & Tableau The cost of healthcare in the US continues to escalate. Consumers, employers,

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Oracle Data Miner (Extension of SQL Developer 4.0)

Oracle Data Miner (Extension of SQL Developer 4.0) An Oracle White Paper October 2013 Oracle Data Miner (Extension of SQL Developer 4.0) Generate a PL/SQL script for workflow deployment Denny Wong Oracle Data Mining Technologies 10 Van de Graff Drive Burlington,

More information

NNMi120 Network Node Manager i Software 9.x Essentials

NNMi120 Network Node Manager i Software 9.x Essentials NNMi120 Network Node Manager i Software 9.x Essentials Instructor-Led Training For versions 9.0 9.2 OVERVIEW This course is designed for those Network and/or System administrators tasked with the installation,

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Improving MapReduce Performance in Heterogeneous Environments

Improving MapReduce Performance in Heterogeneous Environments UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University of California at Berkeley Motivation 1. MapReduce

More information

StorReduce Technical White Paper Cloud-based Data Deduplication

StorReduce Technical White Paper Cloud-based Data Deduplication StorReduce Technical White Paper Cloud-based Data Deduplication See also at storreduce.com/docs StorReduce Quick Start Guide StorReduce FAQ StorReduce Solution Brief, and StorReduce Blog at storreduce.com/blog

More information

Amazon Web Services Primer. William Strickland COP 6938 Fall 2012 University of Central Florida

Amazon Web Services Primer. William Strickland COP 6938 Fall 2012 University of Central Florida Amazon Web Services Primer William Strickland COP 6938 Fall 2012 University of Central Florida AWS Overview Amazon Web Services (AWS) is a collection of varying remote computing provided by Amazon.com.

More information

Real Time Big Data Processing

Real Time Big Data Processing Real Time Big Data Processing Cloud Expo 2014 Ian Meyers Amazon Web Services Global Infrastructure Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure

More information