Final Project Proposal. CSCI.6500 Distributed Computing over the Internet



Similar documents
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop Architecture. Part 1


Neptune. A Domain Specific Language for Deploying HPC Software on Cloud Platforms. Chris Bunch Navraj Chohan Chandra Krintz Khawaja Shams

Hadoop IST 734 SS CHUNG

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Open source Google-style large scale data analysis with Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

UPS battery remote monitoring system in cloud computing

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

NoSQL and Hadoop Technologies On Oracle Cloud

Cloud computing - Architecting in the cloud

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

Apache Hadoop. Alexandru Costan

Hadoop Scheduler w i t h Deadline Constraint

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

An Open MPI-based Cloud Computing Service Architecture

MapReduce and Hadoop Distributed File System

Accelerating and Simplifying Apache

USING VIRTUAL MACHINE REPLICATION FOR DYNAMIC CONFIGURATION OF MULTI-TIER INTERNET SERVICES

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Introduction to Cloud Computing

Daniel J. Adabi. Workshop presentation by Lukas Probst

A Middleware Strategy to Survive Compute Peak Loads in Cloud

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Infrastructure as a Service (IaaS)

Elastic Cloud Computing in the Open Cirrus Testbed implemented via Eucalyptus

MapReduce and Hadoop Distributed File System V I J A Y R A O

A SURVEY ON MAPREDUCE IN CLOUD COMPUTING

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Sriram Krishnan, Ph.D.

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

International Journal of Engineering Research & Management Technology

Cloud Computing based on the Hadoop Platform

Cyber Forensic for Hadoop based Cloud System

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

Contents. Preface Acknowledgements. Chapter 1 Introduction 1.1

Research Article Hadoop-Based Distributed Sensor Node Management System

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Contents. 1. Introduction

From Grid Computing to Cloud Computing & Security Issues in Cloud Computing

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Keywords Distributed Computing, On Demand Resources, Cloud Computing, Virtualization, Server Consolidation, Load Balancing

Resource Scalability for Efficient Parallel Processing in Cloud

The Quest for Conformance Testing in the Cloud

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

How to Do/Evaluate Cloud Computing Research. Young Choon Lee

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Cloud Computing

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Log Mining Based on Hadoop s Map and Reduce Technique

Energy Efficient MapReduce

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Reduction of Data at Namenode in HDFS using harballing Technique

Hadoop Parallel Data Processing

A Service for Data-Intensive Computations on Virtual Clusters

PERFORMANCE ANALYSIS OF PaaS CLOUD COMPUTING SYSTEM

Introduction to HDFS. Prasanth Kothuri, CERN

CSE-E5430 Scalable Cloud Computing Lecture 2

Apache Hama Design Document v0.6

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

IMPLEMENTING PREDICTIVE ANALYTICS USING HADOOP FOR DOCUMENT CLASSIFICATION ON CRM SYSTEM

Cloud Storage Solution for WSN in Internet Innovation Union

Distributed Framework for Data Mining As a Service on Private Cloud

Apache HBase. Crazy dances on the elephant back

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Integrating Big Data into the Computing Curricula

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Data Semantics Aware Cloud for High Performance Analytics

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Grid Computing Vs. Cloud Computing

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Hadoop Data Warehouse Manual

A programming model in Cloud: MapReduce

Scalable Forensics with TSK and Hadoop. Jon Stewart

Understanding Microsoft Storage Spaces

Viswanath Nandigam Sriram Krishnan Chaitan Baru

A Cost-Evaluation of MapReduce Applications in the Cloud

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications

Hadoop and Map-Reduce. Swati Gore

A Brief Outline on Bigdata Hadoop

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Transcription:

Final Project Proposal CSCI.6500 Distributed Computing over the Internet Qingling Wang 660795696 1. Purpose Implement an application layer on Hybrid Grid Cloud Infrastructure to automatically or at least semi automatically determine the best environment settings for the tasks. It will simplify the procedure to deploy computational tasks on Cloud or Hybrid Grid Cloud Infrastructure, especially for the end user who knows a little about technical details of Cloud or Grid Computing. The user can get different computing performance by adjusting some parameters,, like time consuming, cost, etc. It doesn t need to restart the whole bunch of tasks. It will suspend to get the new settings, and resume computing from the stop point. 2. Background Cloud Computing There is no concrete definition for Cloud Computing yet. Generally speaking, it is a reliable, virtualized, internet based architecture to provide no demand computing, storage and many other services. Shared resources and virtualization are the key features. Cloud Computing has the following outstanding advantages. 1. Enable users to access data everywhere. 2. Provide on demand services 3. Provide more powerful computing capacities 4. Have a more reliable and flexible architecture, guarantee QoS for users. All the organizations are impressed with the computing, storage and all such powerful abilities. Indeed, Cloud provides on demand, scalable and fault tolerance services with globally scattered hardware resources. But Cloud Computing is not a universal method, architecture. It may give a low performance in some situation. 1. It is hard to evaluate exactly how many compute instances, data transfers still needed. 2. The coordination between different virtual machines will degrade the performance. 3. It s difficult to separate the data into independent data set. 4. For individual customer or some small group, it is still very expensive to use the public cloud computing services.

Usually, an existing grid has been used for years inside an organization. It is a waste to give up all the grid resources. The ideal way is to find a proper way and time to combine Grid with Cloud, such that it can maximize the computing capability with somehow lower cost. Hadoop The Apache Hadoop project develops open source software for reliable, scalable, distributed computing. It is a large scale distributed processing infrastructure designed to efficiently distribute large amounts of work across a set of machines. Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. The application is divided into small work units, and each unit can be executed in any computational node. The framework also has a distributed file system, which stores data on the various nodes. Thus, Hadoop MapReduce is suitable to test the performance of Grid, Cloud and Grid&Cloud (G&C) on different data sizes. Usually, one of the machines (physical machine in Grid, virtual machine in Cloud, physical or virtual machine in G&C) acts as the master, who is responsible for scheduling tasks among other machines (slaves). The master machine can also be responsible for executing tasks. End User s Concerns As an end user, they don t care what environment they are using. But the performance, cost and easy to use are the most important issues for a framework. Most paper about Cloud Computing is still focus on introducing what is Cloud Computing, the advantages on using Cloud Computing. Actually, the user, especially who has little knowledge about Grid/Cloud Computing needs more concrete results data to compare, to get a understanding with it. Thus, real experimental data would help a lot. 3. Scope Hadoop MapReduce, helps to get the experimental results on Grid, Cloud and Hybrid Grid Cloud. Computation intensive tasks, I will use a linear classification algorithm in Machine Learning domain, called Pocket Algorithm to run on Hadoop, and get results from different environments. Grid, RPI Grid, includes several single core and some dual core physical machines. Cloud, will run virtual machines on same physical machines as Grid, use Xen 3.4.1 hosts the VM instances. 4. Framework

There are mainly three layers in this framework. Client Input Parser Initialize input Refine Demand Schedule Process Performance Evaluation Temporary Staging Storage XML Configure File Hadoop (MapReduce) Request Public Cloud Grid Cloud Xen Public Cloud Master The bottom layer is the whole physical environment. A Grid which is comprised of several single core or dual core severs. The Cloud uses the same server machines as the Grid, but run on Xen, such that the Cloud can host numbers of Virtual Instances. In this layer, there is also a public Cloud (i.e. Amazon EC2), from which we can get extra computing capabilities when necessary. The second layer is Hadoop, which is deployed on the bottom layer. We mainly use MapReduce, sometimes maybe HBase to execute the computational task. MapReduce is responsible for splitting the data, distributing it on different DataNode and collect the distributed results. HBase is used to store the intermediate results or some other useful

status info. The top layer is what we need to implement. It is responsible for interacting with end users. It is composed by three correlated components, Client Input Parser, Temporary Staging Storage and Schedule Process. 5. Method Generally speaking, there are two parts. The first part is to collect experimental results, and analyze the data, then get some valuable conclusion. We will use Hadoop to do the experiment. The Hadoop has an inbuilt command called jar, which supports executing the jar file. The experiment will be computation intensive. In the java project, we will use a famous algorithm, Pocket Algorithm from Machine Learning to separate two different sets of points (totally 2000 uniquely random points). It takes almost 2 hours to run in the laptop, and almost one hour in the server. So it is suitable for computation intensive experiment. We will collect all the statistics from this experiment on Grid, Cloud and Hybrid Grid Cloud. Then analyze the reason for different performance on different environment, and conclude what are the best settings for different input (size change, blocks splitting change and so on) and different requirements from end user. The second part is the implementation part. Based on the conclusion, we write an application layer with three related components, Client Input Parser, Schedule Process and Temporary Staging Storage. It automatic or at least semi automatically map user s demand on proper running environment, execute the task and return the results back. The user can also change the demand parameters during execution, but only on some time points. The Client Input Parser will take client s input and parse it into a XML file. The schedule process will take the XML file as input, based on the current environment computation capacity, and then give several practical plans to execute the task. The schedule process will analyze the pros and cons of each plan, mainly on aspects of time consuming, cost, fault tolerance and etc. It will return the analysis form back to user. Users can choose either plan they want by inputting from the application layer interface. The Schedule Process will choose the plan with best performance in case that the user doesn t choose any of the plans. During the execution, the user can reset their demand settings but only on some interruptible points. Then the Schedule process will suspend on this point, save all the parameters, states and other useful status into the Temporary Staging Storage. It generates a new configure file for the environment setting, which satisfies the end user s new requirements. Last, the Schedule process will resume and run Hadoop again to get new results. Sometimes, the computing capability inside the organization (this includes the Grid, the private Cloud) is not enough. We need to request the public Cloud Computing capabilities, like Amazon EC2. In our method, if all the feasible plans cannot satisfy client s requirement, we will then turn to consider public cloud. Then the cost and privacy will become the biggest

issue here. 6. Timetable Date MileStone Experiment Preparation 11.1 11.5 Finishing the MapReduce version of Pocket Algorithm 11.8 11.12 Finishing the computation intensive experiments on all environments Implementation Part 11.15 11.19 Implement Client Input Parser and Temporary Staging Storage of the Application Layer 11.22 11.26 Implement Schedule Process of the Application Layer 11.29 12.3 Test it, Case Study using this Application Layer Conclusion 12.6 12.10 Conclusion, Future Work 7. Reference The reference paper is also what I will present. Hyunjoo Kim, Yaakoub el Khamra, Shantenu Jha, Manish Parashar, Exploring Application and Infrastructure Adaptation on Hybrid Grid Cloud Infrastructure, in HPDC 10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pages 402 412,Chicago, Illinois, USA, 2010.