A programming model in Cloud: MapReduce



Similar documents
Cloud Computing at Google. Architecture

Jeffrey D. Ullman slides. MapReduce for data intensive computing

2.1.5 Storing your application s structured data in a cloud database

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Cloud computing - Architecting in the cloud

Hypertable Architecture Overview

References. Introduction to Database Systems CSE 444. Motivation. Basic Features. Outline: Database in the Cloud. Outline

Introduction to Database Systems CSE 444

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Big Table A Distributed Storage System For Data

Chapter 7. Using Hadoop Cluster and MapReduce

f...-. I enterprise Amazon SimpIeDB Developer Guide Scale your application's database on the cloud using Amazon SimpIeDB Prabhakar Chaganti Rich Helms

Getting Started with Hadoop with Amazon s Elastic MapReduce

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms

Open source large scale distributed data management with Google s MapReduce and Bigtable

Data-Intensive Computing with Map-Reduce and Hadoop

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Introduction to Parallel Programming and MapReduce

Last time. Today. IaaS Providers. Amazon Web Services, overview

Parallel Processing of cluster by Map Reduce

Amazon Web Services Primer. William Strickland COP 6938 Fall 2012 University of Central Florida

5 SCS Deployment Infrastructure in Use

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

Lecture Data Warehouse Systems

Distributed File Systems

MapReduce (in the cloud)

Apache Hadoop. Alexandru Costan

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

HDFS. Hadoop Distributed File System

Lecture 6 Cloud Application Development, using Google App Engine as an example

Databases in the Cloud

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)

Hadoop Parallel Data Processing

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

How To Scale Out Of A Nosql Database

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

HDFS Cluster Installation Automation for TupleWare

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

The Google File System

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Big Data and Apache Hadoop s MapReduce

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Parallel & Distributed Data Management

Big Data With Hadoop

A Brief Outline on Bigdata Hadoop

Hadoop IST 734 SS CHUNG

WOS Cloud. ddn.com. Personal Storage for the Enterprise. DDN Solution Brief

Storing and Processing Sensor Networks Data in Public Clouds

Cloud Computing. Up until now

MapReduce, Hadoop and Amazon AWS

A Comparative Study Of Cloud Environments and the Development of a Framework for the Automatic Deployment of Scalable Cloud-Based Applications

Cloud Computing Now and the Future Development of the IaaS

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Cloud Computing. Lecture 24 Cloud Platform Comparison

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Challenges for Data Driven Systems

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Big Data Processing in the Cloud. Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Big Data Processing with Google s MapReduce. Alexandru Costan

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Cloud Computing Summary and Preparation for Examination

Scalable Architecture on Amazon AWS Cloud

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

A Service for Data-Intensive Computations on Virtual Clusters

Map Reduce & Hadoop Recommended Text:

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Getting Started with Cloud Computing: Amazon EC2 on Red Hat Enterprise Linux

Comparison of Different Implementation of Inverted Indexes in Hadoop

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

Data Management in the Cloud

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Viswanath Nandigam Sriram Krishnan Chaitan Baru

Introduction to Hadoop

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Public Cloud Offerings and Private Cloud Options. Week 2 Lecture 4. M. Ali Babar

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

Map Reduce / Hadoop / HDFS

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Log Mining Based on Hadoop s Map and Reduce Technique

Backup and Recovery of SAP Systems on Windows / SQL Server

Cloud Computing: Computing as a Service. Prof. Daivashala Deshmukh Maharashtra Institute of Technology, Aurangabad

Sriram Krishnan, Ph.D.

Transcription:

A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value pairs a reduce function that merges all intermediate values associated with the same intermediate key Programs can be automatically parallelized and executed on a large cluster of machines Partitioning input data Scheduling execution Handling machine failure Managing communication Allow users without much experience to utilize the resources of a large distributed system 1

An example: Count word frequency 2

MapReduce operations When the user program calls the MapReduce function, the following sequence of operations occurs Partition input date Split the input files into M pieces; Assign tasks starts up many copies of the program on a cluster of machines: one is master and others are workers Master assigns map tasks and reduce tasks to workers A Map worker does: reads input split, abstracts key/value pairs from the input data, passes each pair to the Map function, the Map function produces the intermediate key/value pairs Buffers the intermediate key/value pairs to the local disks; the disk locations of the data are passed back to the master; the master forwards the locations to the reduce workers 3

MapReduce operations The reduce workers does uses remote procedure calls to read the buffered intermediate key/value pairs from the local disks of the map workers. The reduce workers passes the intermediate key/value pairs to the Reduce function After all map and reduce tasks have been completed, the call of MapReduce returns to the user code 4

MapReduce execution flow 5

MapReduce library Partition the input data Startup and schedule execution Scheduing map and reduce workers Manage communication Reduce workers retrieve the intermediate results from Map workers Handle machine failure Ping every worker periodically If no response in a certain amount of time, mark the worker as failed Any map or reduce tasks in progress on a failed worker will be rescheduled The map tasks completed by the failed worker will be rescheduled The reduce tasks completed by the failed worker do not need to be rescheduled 6

Popular Cloud Systems Google Cloud Amazon Cloud 7

Google provides numerous online services 8

Google Cloud Computing: Focusing on User Data User data is stored in Cloud Data access is not constrained by geographic locations messages preferences contacts calendar news investments maps mailing lists e-mails photo music phone numbers Data can be conveniently shared 9

Google Cloud Infrastructure Scheduler Chubby GFS master User Application Node Google Cloud Infrastructure Node Node Node MapReduce Job BigTable Server Scheduler slave GFS chunkserver Linux 10

Bigtable: : a distributed storage system in Google Cloud Resembles a database Aims to scale to a very large size: petabyte of data across thousands of nodes Provides a simple data model that Data is indexed using row and column names supports dynamic partition of the table Data is treated as strings 11

Bigtable: : a distributed storage system in Google Cloud A row key in a table is an arbitrary string Bigtable maintains data in lexicographic order by row key Web pages in the same domain are grouped together into contiguous rows by reversing the hostname components of the URLs A table is dynamically partitioned into row ranges Each row range is called a tablet The unit of distribution and load balancing Tablets are distributed and stored in mutiple machines 12

BigTable Timestamps API 64-bit integers Can be used to specify only new enough versions are kept creating and deleting tables and columns 13

Big picture of BigTable - A Bigtable cluster stores a number of tables. - Each table consists of a set of tablets - Each tablet contains all data associated with a row range. - Initially, each table consists of just one tablet. - As a table grows, it is automatically split into multiple tablets, each approximately 100-200 MB in size by default 14

Infrastructure of BigTable - If a number of machines are needed to store the a BigTable One machine will be elected as the master; Other machines are called tablet servers - The master is responsible for assigning tablets to tablet servers, balancing tablet-server load - Each tablet server manages a set of tablets (typically we have somewhere between ten to a thousand tablets per tablet server) 15

Infrastructure of BigTable 16

Chubby - A lock service synchronize activities: e.g. two clients cannot update a cell at the same time, data consistency among different tablet servers Distributed consensus: e.g., electing a leader from a set of equivalent servers - A simple file system performs whole-file reads and writes, Both GFS and Bigtable use Chubby to store a small amount of metadata (e.g., root of their data structure) 17

Google File System Provide the traditional file system interface Files are divided into chunks, stored in chunkservers Consists of a single Master and multiple chunkservers Master stores the locations of the file chunks 18

Google App Engine - Complex Server structure: Apache, SQL, etc - Endless monitoring and debugging - Exhausting System upgrading and fault tolerance Concise development and running platform for web applications Running on the data centre provided by Google Managing the entire life cycle of web applications 19

Google App Engine Enable clients to run web applications on Google s infrastructure Clients do not have to maintain servers Users just upload the applications The applications can then serve the users of the web applications Supports Java and Python language You pay for what you use There are no set-up and maintenance costs 20

21

22

Amazon Elastic Computing Cloud (EC2) EC2 SQS EC2 User EBS EC2 EC2 EBS EBS EBS SimpleDB Developer S3 EC2: Running Instance of Virtual Machines S3: Simple Storage Service, SOAP, Object Interface SimpleDB: Simplified Database EBS: Elastic Block Service, Providing the Block Interface, Storing Virtual Machine Images SQS: Simple Queue Service 23

AWS is Production-level Cloud System 24

Amazon SimpleDB A web service for running queries on structured data in real time Data model Domain, Items, Attributes, Values 25

Amazon SimpleDB API CreateDomain, DeleteDomain, PutAttributes, Select, Consistency Keeps multiple copies of each domain 26

Amazon Simple Storage Service S3 has a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web Buckets: a container for objects Objects Objects consist of object data and metadata The metadata is a set of name-value pairs that describe the object Keys Versioning An object consists of two components: a key and a version ID Operations Create a Bucket, write an object, listing keys provides a REST and a SOAP interface The REST API: use web browser to access S3 The SOAP API: write web service client to access S3 Regions: choose the region where S3 will store the bucket user create US, US-West, EU 27

Amazon Simple Storage Service Amazon S3 data consistency model High availability is achieved by replicating data across multiple servers within Amazon data centre A process write a new object to S3 and immediately attempts to read it. Until the change is fully propagated, S3 might return key does not exist 28

Reference papers Xen and the Art of Virtualization MapReduce: Simplied Data Processing on Large Clusters Bigtable: A Distributed Storage System for Structured Data The Chubby lock service for loosely-coupled distributed systems The Google File System 29

Summary Cloud computing Concepts Virtualization technology Programming model in Cloud Google Cloud Amazon Cloud 30