Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

Similar documents

Apache Hama Design Document v0.6

BSPCloud: A Hybrid Programming Library for Cloud Computing *

Mobile & Cloud Computing: Research Challenges. Satish Srirama satish.srirama@ut.ee

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Evaluating partitioning of big graphs

Challenges for Data Driven Systems

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Big Data Analytics. Lucas Rego Drumond

Machine Learning over Big Data

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Energy Efficient MapReduce

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

PARALLEL PROGRAMMING

Cloud Computing Summary and Preparation for Examination

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

High Productivity Data Processing Analytics Methods with Applications

Enabling Multi-pipeline Data Transfer in HDFS for Big Data Applications

Big-data Analytics: Challenges and Opportunities

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

1 st Symposium on Colossal Data and Networking (CDAN-2016) March 18-19, 2016 Medicaps Group of Institutions, Indore, India

Apache Hadoop Ecosystem

Managing large clusters resources

Hadoop Parallel Data Processing

Hadoop IST 734 SS CHUNG

Big Data and Scripting Systems beyond Hadoop

Hadoop Architecture. Part 1

CSE-E5430 Scalable Cloud Computing Lecture 2

The Stratosphere Big Data Analytics Platform

Parallel Computing. Benson Muite. benson.

Large-Scale Data Processing

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Apache Hadoop. Alexandru Costan

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Bringing Big Data Modelling into the Hands of Domain Experts

Spark and the Big Data Library

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.

Big Data Processing with Google s MapReduce. Alexandru Costan

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Cloud Computing based on the Hadoop Platform

Scaling Out With Apache Spark. DTL Meeting Slides based on

More AWS and Cloud-based Research at Mobile & Cloud Lab

Advanced Big Data Analytics with R and Hadoop

2015 The MathWorks, Inc. 1

Map-Reduce for Machine Learning on Multicore

Using Map-Reduce for Large Scale Analysis of Graph-Based Data

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

3rd International Symposium on Big Data and Cloud Computing Challenges (ISBCC-2016) March 10-11, 2016 VIT University, Chennai, India

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Large Scale Graph Processing with Apache Giraph

Log Mining Based on Hadoop s Map and Reduce Technique

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Extending Hadoop beyond MapReduce

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

MapReduce and Hadoop Distributed File System V I J A Y R A O

Cloud Computing at Google. Architecture

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Hadoop SNS. renren.com. Saturday, December 3, 11

Twister4Azure: Data Analytics in the Cloud

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

BIG DATA USING HADOOP

Case Study : 3 different hadoop cluster deployments

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

MapReduce and Hadoop Distributed File System

Architectures for Big Data Analytics A database perspective

Unified Big Data Processing with Apache Spark. Matei

Big Data and Scripting Systems build on top of Hadoop

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Analysis of MapReduce Algorithms

Fault Tolerance in Hadoop for Work Migration

Contents. Preface Acknowledgements. Chapter 1 Introduction 1.1

Introduction to DISC and Hadoop

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

A tour on big data classification: Learning algorithms, Feature selection, and Imbalanced Classes

Building Out Your Cloud-Ready Solutions. Clark D. Richey, Jr., Principal Technologist, DoD

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

BIG DATA AND ANALYTICS

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Chapter 7. Using Hadoop Cluster and MapReduce

Lecture 10 - Functional programming: Hadoop and MapReduce

Transcription:

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis Pelle Jakovits

Outline Problem statement State of the art Approach Solutions and contributions Current work Conclusions 3/21/2013 2/22

The Problem Accurate and large-scale scientific modeling and simulation applications require large amounts of computing resources tend to run for long periods of time are complicated to create and debug Using de facto tools like MPI to create such applications takes a lot of time and resources. Have to take care of data partitioning, distribution and synchronization, deadlocks, fault recovery, etc. 3/21/2013 3/22

The problem in Cloud Public clouds provide very convenient access to computing resources On-demand and in real-time As long as you can afford them They are built upon commodity hardware, which means there is constant risk of hardware and network failures Thus large scale potentially long running scientific applications have to be able to handle faults. 3/21/2013 4/22

State of the art in 2010 In 2004 Google had published MapReduce Hadoop MapReduce had appeared in 2008 Distributed computing framework for huge scale data processing Freely usable and open source Provides automatic parallelization and fault tolerance Algorithms have to follow the MapReduce model 3/21/2013 5/22

MapReduce model 3/21/2013 6/22

Hadoop MapReduce The user only has to write Map and Reduce functions Parallelism is achieved by executing Map and Reduce tasks concurrently Framework handles everything else for the user Allows to focus on implementing the algorithms instead of managing the parallelization Fault tolerance is achieved by data replication and re-executing failed tasks. 3/21/2013 7/22

Goals of this study Precisely identify which algorithm characteristics affect their efficiency and scalability when adapted to the MapReduce model. Create a classification for scientific computing algorithms which would provide guidelines for deciding the suitability of new algorithms. Provide alternatives for algorithms that are not suitable for MapReduce Should still have the same or most of the advantages that MapReduce provides. 3/21/2013 8/22

Approach: Classification We have created a classification scientific computing algorithms, based on how they are adapted to the MapReduce model: 1. As a single MR job 2. As a constant number of MR jobs 3. Single MR job for each iteration 4. Multiple MR jobs each iteration We adapted algorithms from each of these classes to Hadoop MapReduce and investigated the results. 3/21/2013 9/22

Issues with Hadoop MapReduce It is designed and suitable for: Large scale data processing tasks Embarrassingly parallel tasks Has serious issues with iterative algorithms Long start up and clean up times ~17 seconds No way to keep important data in memory between MapReduce job executions At each iteration, all data is read again from HDFS and written back there at the end. Thus, there is significant overhead in every tieration 3/21/2013 10/22

Approach: Alternatives We have investigated three alternative approaches for algorithms which are not suitable for the classical MapReduce model: 1. Try to restructure algorithms into non-iterative versions 2. Alternative MapReduce frameworks 3. Alternatibe distributed computing models 3/21/2013 11/22

Restructuring algorithms From PAM klustering to CLARA CLARA is designed from the start to be more effective From Conjugate Gradient (CG) to Monte Carlo Matrix Inverse Result is much less effective Does not work well with sparse matrices Is generally very difficult to find an alternative that performs close to the original algoirithm Can only be applied in small number of cases Requires a great understanding of the involved algorithms and the framework it is applyed on. 3/21/2013 12/22

Alternative Mapreduce frameworks Qlternative MapReduce implementations that are designed to handle iterative algorithms: Twister HaLoop Spark For example, in the case of Conjugate Gradient linear system solver (64 million non-0 element matrix): 3/21/2013 13/22

Alternative Mapreduce frameworks But, as a result alternative Mapreduce frameworks often: step away from the classical MapReduce model give up advantages of the MapReduce model, Fault tolerance or multiple concurrent reduce tasks are less stable, are more complicated to use and debug 3/21/2013 14/22

Alternative models Bulk Synchronous Parallel model Created by Valiant in 1990 Google gave up using MapReduce for large-scale complex graph processing Designed a new framework Pregel instead that uses Bulk Synchronous parallel model instead of MapReduce Details published in 2010 Pregel is proprietary. Like with MapReduce, third parties have designed alternative freely usable versions: jpregel Hama Giraph 3/21/2013 15/22

Bulk Synchronous Parallel (BSP) Distributed computing model for iterative applications Computations consist of a sequence of super steps Superstep consists of 3 substeps: 1. Local computation 2. Sending messages to neighboring tasks, which can be accessed only at the next super step 3. Barrier synchronization 3/21/2013 16/22

BSP results Comparison of different BSP (BSPonMPI and HAMA) and MPI (MPJ and MpiJava) implementations to Hadoop when clustering 80,000 objects using K-Medoid method. 3/21/2013 17/22

Current work Extend the evaluation of Iterative MapReduce frameworks for scientific applications Investigate the efficiency of MapReduce applcations in more detail Karl Potisepp Investigate creating a fault tolerant BSP for scientific computing algorithms Ilja Kromonov Enhance the current classification to improve its accuracy 3/21/2013 18/22

More current work Design a methodology for this approach. For classifying scientific computing algorithms For adapting them to MapReduce or the chosen alternatives Create a repository of design patterns and performance measurement results of adapting scientific computing algorithms to MapReduce and it s alternatives. 3/21/2013 19/22

Even more work not directly connected to thesis Direct migration of scientific computing experiments to the cloud D2CM tool for migrating electrochemical experiments CloudML project with SINTEF Model based deployment of large scale scientific experiments Quantifying the cost of virtualization for distributed computing applications MapReduce in Image processing SAR satellite image processing in MapReduce Large scale image processing in MapReduce 3/21/2013 20/22

Conclusions This study is aimed at supporting computer scientists who need to scale up scientific computing applications and who would like to know: Whether their algorithms are suitable for the MapReduce model? What is the best approach to adapt them to Hadoop or the alternative solutions? What parallel speedup and efficiency they can expect to achieve from the result? How do the results compare to a custom implementations of the same algorithms in MPI? Scalability? Parallel efficiency? 3/21/2013 21/22

Thank You for your attention! Questions? 3/21/2013 22/22