From GWS to MapReduce: Google s Cloud Technology in the Early Days

Size: px
Start display at page:

Download "From GWS to MapReduce: Google s Cloud Technology in the Early Days"

Transcription

1 Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org

2 MapReduce/Hadoop Around 2004, Google invented MapReduce to parallelize computation of large data sets. It s been a key component in Google s technology foundation Around 2008, Yahoo! developed the open-source variant of MapReduce named Hadoop After 2008, MapReduce/Hadoop become a key technology component in cloud computing MapReduce Hadoop Hadoop or variants In 2010, the U.S. conferred the MapReduce patent to Google

3 Data-Intensive Computation MapReduce parallel computing for Webscale data Map: a higher-order function applied to all elements in a list Result is a new list The MapReduce Approach x 2 x 2 x 2 x 2 x (map (lambda (x) (* x x)) '( )) '( )

4 Data-Intensive Computation The MapReduce Approach Reduce is also a higher-order function Like fold : aggregate elements of a list Accumulator set to initial value Function applied to list element and the accumulator Result stored in the accumulator Repeated for every item in the list Result is the final value in the accumulator Initial value final value (fold + 0 '( )) 15 (fold * 1 '( )) 120

5 Massive parallel processing made simple Example: word count MapReduce programming The MapReduce Approach Map: parse a document and generate <word, 1> pairs Reduce: receive all pairs for a specific word, and count (sum) Map // D is a document for each word w in D output <w, 1> Reduce Reduce for key w: count = 0 for each input item count = count + 1 output <w, count>

6 Workflow 1. The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece. It then starts up many copies of the program on a cluster of machines. 2. One of the copies of the program is the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.

7 Workflow 3. A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs (KV pairs) produced by the Map function are buffered in memory. 4. Periodically, the buffered pairs are written to local disks, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.

8 Workflow 5. A reduce worker processes one partition of data. When a reduce worker, r, is notified by the master about the locations of intermediate data, r uses RPCs to read the data in its partition from the local disks of the map workers. After all intermediate data are read, r sorts and groups them by the intermediate keys. 6. Each reduce worker iterates over the sorted intermediate KV pairs in its partition and, for each unique intermediate key encountered, it passes the key and the associated set of values to the Reduce function. The output of the Reduce function is appended to a temporary output file on GFS. 7. A completed reduce task renames its temporary output file to the output file of its partition. When all tasks have completed, MapReduce returns to the user code.

9 Programming How to write a MapReduce program to Generate inverted indices? Sort? How to express more sophisticated logic? What if some workers (slaves) or the master fails?

10 MapReduce Limitations MapReduce provides an easy-to-use framework for parallel programming, but is it the most efficient and best solution to program execution in datacenters? MapReduce has its discontents DeWitt and Stonebraker: MapReduce: A major step backwards MapReduce is far less sophisticated and efficient than parallel query processing MapReduce is a parallel processing framework, not a database system, nor a query language It is possible to use MapReduce to implement some of the parallel query processing functions What are the real limitations? Inefficient for general programming (and not designed for that) Hard to handle data with complex dependence, frequent updates, etc. High overhead, bursty I/O, difficult to handle long streaming data Limited opportunity for optimization

11 Slow Problems of MapReduce/Hadoop Cannot support low-latency, interactive, or real-time programs Very slow, if the algorithm processes data in a graph model contains non-trivial dependences among operations walks through multiple iterations follows a complex data/control flow Not programmable, if the algorithm invokes recursion of Map and Reduce steps Requires an external glue language takes unpredictable execution time 11

12 Time (sec) Computing Capability 9044 Z. Ma and L. Gu. The Limitation of MapReduce: a Probing Case and a Lightweight Solution. CLOUD COMPUTING 2010 gcc (on one node) mrcc/hadoop mrcc/mrlite Linux kernel ImageMagick Xen tools Using Hadoop, parallel compilation is even slower than sequential compilation! 12

13 More Solutions Coming Dryad/DryadLINQ Megastore/Spanner Pregel OpenCL RamCloud The prosperity of solutions in this area, however, indicates that there has not been a solid, general-purpose technology for cloud systems. There are challenges in constructing sophisticated big-data processing capability, wide-area computation and very-large database systems.

14 SPARK/RDD RDD: Resilient Distributed Datasets Compute in the memory Operators transform data 20 times faster than Hadoop on some iterative workloads Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Proc. of NSDI '12. RDD is not a general-purpose instrument for big-data computation. We also expect empirical data on larger-scale workload to understand its scalability. In-memory computation will be increasingly important in big-data computation.

15 Inside MapReduce-Style Computation Network activities under MapReduce/Hadoop workload Hadoop: open-source implementation of MapReduce Processing data with 3 servers (20 cores) 116.8GB input data Network activities captured with Xen virtual machines

16 MapReduce workflow Initial data split into 64MB blocks Computed, results locally stored Master informed of result locations R reducers retrieve Data from mappers Final output written Communication intensive

17 Inside MapReduce Packet reception under MapReduce/Hadoop workload Large data volume Bursty network traffic Genrality widely observed in MapReduce workloads Packet reception on a slave server

18 Inside MapReduce Packet reception on the master server

19 Inside MapReduce Packet transmission on the master server

20 Datacenter Networking Common data center topology Internet Core Layer-3 router Datacenter Aggregation Layer-2/3 switch Access Layer-2 switch Servers

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

CSE-E5430 Scalable Cloud Computing Lecture 11

CSE-E5430 Scalable Cloud Computing Lecture 11 CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 30.11-2015 1/24 Distributed Coordination Systems Consensus

More information

Big Data Analytics Hadoop and Spark

Big Data Analytics Hadoop and Spark Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software

More information

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Spark. Fast, Interactive, Language- Integrated Cluster Computing Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC

More information

Dell In-Memory Appliance for Cloudera Enterprise

Dell In-Memory Appliance for Cloudera Enterprise Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert Armando_Acosta@Dell.com/

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Rakam: Distributed Analytics API

Rakam: Distributed Analytics API Rakam: Distributed Analytics API Burak Emre Kabakcı May 30, 2014 Abstract Today, most of the big data applications needs to compute data in real-time since the Internet develops quite fast and the users

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

Apache Spark and Distributed Programming

Apache Spark and Distributed Programming Apache Spark and Distributed Programming Concurrent Programming Keijo Heljanko Department of Computer Science University School of Science November 25th, 2015 Slides by Keijo Heljanko Apache Spark Apache

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

More information

Big Data and Scripting Systems beyond Hadoop

Big Data and Scripting Systems beyond Hadoop Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid

More information

VariantSpark: Applying Spark-based machine learning methods to genomic information

VariantSpark: Applying Spark-based machine learning methods to genomic information VariantSpark: Applying Spark-based machine learning methods to genomic information Aidan R. O BRIEN a a,1 and Denis C. BAUER a CSIRO, Health and Biosecurity Flagship Abstract. Genomic information is increasingly

More information

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004) MapReduce from the paper MapReduce: Simplified Data Processing on Large Clusters (2004) What it is MapReduce is a programming model and an associated implementation for processing and generating large

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data Spark and Shark High- Speed In- Memory Analytics over Hadoop and Hive Data Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li,

More information

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

Distributed Aggregation in Cloud Databases. By: Aparna Tiwari tiwaria@umail.iu.edu

Distributed Aggregation in Cloud Databases. By: Aparna Tiwari tiwaria@umail.iu.edu Distributed Aggregation in Cloud Databases By: Aparna Tiwari tiwaria@umail.iu.edu ABSTRACT Data intensive applications rely heavily on aggregation functions for extraction of data according to user requirements.

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

MAPREDUCE Programming Model

MAPREDUCE Programming Model CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce

More information

The Hadoop Framework

The Hadoop Framework The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex

More information

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

More information

ANALYSIS OF BILL OF MATERIAL DATA USING KAFKA AND SPARK

ANALYSIS OF BILL OF MATERIAL DATA USING KAFKA AND SPARK 44 ANALYSIS OF BILL OF MATERIAL DATA USING KAFKA AND SPARK Ashwitha Jain *, Dr. Venkatramana Bhat P ** * Student, Department of Computer Science & Engineering, Mangalore Institute of Technology & Engineering

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Big Data and Scripting Systems build on top of Hadoop

Big Data and Scripting Systems build on top of Hadoop Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is

More information

Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf

Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant

More information

MapReduce Job Processing

MapReduce Job Processing April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File

More information

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

A Survey of Cloud Computing Guanfeng Octides

A Survey of Cloud Computing Guanfeng Octides A Survey of Cloud Computing Guanfeng Nov 7, 2010 Abstract The principal service provided by cloud computing is that underlying infrastructure, which often consists of compute resources like storage, processors,

More information

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015 7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE

More information

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

A Novel Cloud Based Elastic Framework for Big Data Preprocessing School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

Big Application Execution on Cloud using Hadoop Distributed File System

Big Application Execution on Cloud using Hadoop Distributed File System Big Application Execution on Cloud using Hadoop Distributed File System Ashkan Vates*, Upendra, Muwafaq Rahi Ali RPIIT Campus, Bastara Karnal, Haryana, India ---------------------------------------------------------------------***---------------------------------------------------------------------

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Hadoop Parallel Data Processing

Hadoop Parallel Data Processing MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

More information

MapReduce (in the cloud)

MapReduce (in the cloud) MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

MapReduce and Hadoop Distributed File System V I J A Y R A O

MapReduce and Hadoop Distributed File System V I J A Y R A O MapReduce and Hadoop Distributed File System 1 V I J A Y R A O The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com Big Data Primer Alex Sverdlov alex@theparticle.com 1 Why Big Data? Data has value. This immediately leads to: more data has more value, naturally causing datasets to grow rather large, even at small companies.

More information

Management & Analysis of Big Data in Zenith Team

Management & Analysis of Big Data in Zenith Team Management & Analysis of Big Data in Zenith Team Zenith Team, INRIA & LIRMM Outline Introduction to MapReduce Dealing with Data Skew in Big Data Processing Data Partitioning for MapReduce Frequent Sequence

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Intro to Map/Reduce a.k.a. Hadoop

Intro to Map/Reduce a.k.a. Hadoop Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by

More information

Architectures for massive data management

Architectures for massive data management Architectures for massive data management Apache Spark Albert Bifet albert.bifet@telecom-paristech.fr October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache

More information

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National

More information

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT Samira Daneshyar 1 and Majid Razmjoo 2 1,2 School of Computer Science, Centre of Software Technology and Management (SOFTEM),

More information

Advanced Computer Networks. Scheduling

Advanced Computer Networks. Scheduling Oriana Riva, Department of Computer Science ETH Zürich Advanced Computer Networks 263-3501-00 Scheduling Patrick Stuedi, Qin Yin and Timothy Roscoe Spring Semester 2015 Outline Last time Load balancing

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

Big Data in the Enterprise: Network Design Considerations

Big Data in the Enterprise: Network Design Considerations White Paper Big Data in the Enterprise: Network Design Considerations What You Will Learn This document examines the role of big data in the enterprise as it relates to network design considerations. It

More information

How Companies are! Using Spark

How Companies are! Using Spark How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made

More information

Improving MapReduce Performance in Heterogeneous Environments

Improving MapReduce Performance in Heterogeneous Environments UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University of California at Berkeley Motivation 1. MapReduce

More information

HADOOP PERFORMANCE TUNING

HADOOP PERFORMANCE TUNING PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The

More information

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof. CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensie Computing Uniersity of Florida, CISE Department Prof. Daisy Zhe Wang Map/Reduce: Simplified Data Processing on Large Clusters Parallel/Distributed

More information

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

Radoop: Analyzing Big Data with RapidMiner and Hadoop

Radoop: Analyzing Big Data with RapidMiner and Hadoop Radoop: Analyzing Big Data with RapidMiner and Hadoop Zoltán Prekopcsák, Gábor Makrai, Tamás Henk, Csaba Gáspár-Papanek Budapest University of Technology and Economics, Hungary Abstract Working with large

More information

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Sabeur Aridhi Aalto University, Finland Sabeur Aridhi Frameworks for Big Data Analytics 1 / 59 Introduction Contents 1 Introduction

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Big Systems, Big Data

Big Systems, Big Data Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

Networking in the Hadoop Cluster

Networking in the Hadoop Cluster Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop

More information

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University RAMCloud and the Low- Latency Datacenter John Ousterhout Stanford University Most important driver for innovation in computer systems: Rise of the datacenter Phase 1: large scale Phase 2: low latency Introduction

More information

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine

More information

Apache Hadoop FileSystem and its Usage in Facebook

Apache Hadoop FileSystem and its Usage in Facebook Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs

More information

Report: Declarative Machine Learning on MapReduce (SystemML)

Report: Declarative Machine Learning on MapReduce (SystemML) Report: Declarative Machine Learning on MapReduce (SystemML) Jessica Falk ETH-ID 11-947-512 May 28, 2014 1 Introduction SystemML is a system used to execute machine learning (ML) algorithms in HaDoop,

More information

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks Haoyuan Li UC Berkeley Outline Motivation System Design Evaluation Results Release Status Future Directions Outline Motivation

More information

Lecture 10 - Functional programming: Hadoop and MapReduce

Lecture 10 - Functional programming: Hadoop and MapReduce Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional

More information

Classification On The Clouds Using MapReduce

Classification On The Clouds Using MapReduce Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal simao.martins@tecnico.ulisboa.pt Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal claudia.antunes@tecnico.ulisboa.pt

More information

MATE-EC2: A Middleware for Processing Data with AWS

MATE-EC2: A Middleware for Processing Data with AWS MATE-EC2: A Middleware for Processing Data with AWS Tekin Bicer Department of Computer Science and Engineering Ohio State University bicer@cse.ohio-state.edu David Chiu School of Engineering and Computer

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

Amazon EC2 Product Details Page 1 of 5

Amazon EC2 Product Details Page 1 of 5 Amazon EC2 Product Details Page 1 of 5 Amazon EC2 Functionality Amazon EC2 presents a true virtual computing environment, allowing you to use web service interfaces to launch instances with a variety of

More information

Big Data Analysis and HADOOP

Big Data Analysis and HADOOP Big Data Analysis and HADOOP B.Jegatheswari and M.Muthulakshmi III year MCA AVC College of engineering, Mayiladuthurai. Email ID: jjega.cool@gmail.com Mobile: 8220380693 Abstract: - Digital universe with

More information