Hadoop Parallel Data Processing



Similar documents
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop Architecture. Part 1

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Apache Hadoop. Alexandru Costan

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop implementation of MapReduce computational model. Ján Vaňo

Chapter 7. Using Hadoop Cluster and MapReduce


MapReduce, Hadoop and Amazon AWS

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Large scale processing using Hadoop. Ján Vaňo

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Open source Google-style large scale data analysis with Hadoop

How To Use Hadoop

Parallel Processing of cluster by Map Reduce

Data-Intensive Computing with Map-Reduce and Hadoop

Map Reduce & Hadoop Recommended Text:

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

A Cost-Evaluation of MapReduce Applications in the Cloud

CSE-E5430 Scalable Cloud Computing Lecture 2

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Big Data With Hadoop

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

BBM467 Data Intensive ApplicaAons

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

L1: Introduction to Hadoop

Internals of Hadoop Application Framework and Distributed File System

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Hadoop and Map-Reduce. Swati Gore

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Introduction to MapReduce and Hadoop

Open source large scale distributed data management with Google s MapReduce and Bigtable

Advances in Natural and Applied Sciences

Hadoop Mapreduce Framework in Big Data Analytics Vidyullatha Pellakuri 1, Dr.D. Rajeswara Rao 2

Introduction to Hadoop

MapReduce. Tushar B. Kute,

HadoopRDF : A Scalable RDF Data Analysis System

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Intro to Map/Reduce a.k.a. Hadoop

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

MapReduce. Introduction and Hadoop Overview. 13 June Lab Course: Databases & Cloud Computing SS 2012

Storage Architectures for Big Data in the Cloud

HADOOP MOCK TEST HADOOP MOCK TEST II

Introduction to Hadoop

Cloud Computing. Lectures 10 and 11 Map Reduce: System Perspective

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Snapshots in Hadoop Distributed File System

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

Fault Tolerance in Hadoop for Work Migration

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

GraySort and MinuteSort at Yahoo on Hadoop 0.23

A programming model in Cloud: MapReduce

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Performance and Energy Efficiency of. Hadoop deployment models

Distributed Computing and Big Data: Hadoop and MapReduce

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Yuji Shirasaki (JVO NAOJ)

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Research Article Hadoop-Based Distributed Sensor Node Management System

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Verification and Validation of MapReduce Program model for Parallel K-Means algorithm on Hadoop Cluster

Hadoop Scalability at Facebook. Dmytro Molkov YaC, Moscow, September 19, 2011

Processing of Hadoop using Highly Available NameNode

Survey on Scheduling Algorithm in MapReduce Framework

Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases

HDFS. Hadoop Distributed File System

Big Data and Apache Hadoop s MapReduce

Hadoop Ecosystem B Y R A H I M A.

Hadoop IST 734 SS CHUNG

Log Mining Based on Hadoop s Map and Reduce Technique

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

MapReduce and Hadoop Distributed File System V I J A Y R A O

Lecture 10 - Functional programming: Hadoop and MapReduce

The Hadoop Framework

Introduction to Hadoop

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

MapReduce and Hadoop Distributed File System

Big Data Management and NoSQL Databases

Transcription:

MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for many big data processing applications; the underlying system can automatically support parallelism, data movement, load balancing, and fault tolerance behind this interface. Original MapReduce implementation Implemented in C/C++, but support many interfaces including Java Used at Google, but not shared with the public 1 2 Hadoop Hadoop Distributed File System Hadoop: Java based implementation of MapReduce Originated at Yahoo but made open source Used by Facebook (more than 100 PetaBytes of data), Amazon (EC2 used by NYTimes), and many other places Hadoop implements MapReduce Multiple TaskTrackers run map()/reduce() tasks Single JobTracker assigns tasks to TaskTrackers and manage the entire workflow Each TaskTracker sends periodic heartbeats to the JobTracker for fault detection and management Big data applications typically run on a distributed file system Too much data that must be spread over multiple nodes a distributed file system tracks where the data is A single file can be chopped up to blocks for distribution (allowing parallel accesses) Replication for reliability Google s MapReduce works on Google File System (GFS) Hadoop works with Hadoop Distributed File System (HDFS) Inferior to GFS in many ways (poor support for mutable data replication, incompatible with POSIX standard); but open source 3 4 1

Hadoop System Structure Hadoop Programming http://upload.wikimedia.org/wikipedia/en/2/2b/hadoop_1.png JobTracker and NameNode work together to support intelligent task placement Java based, implement several interfaces Mapper() Maps input key/value pairs to a set of intermediate key/value pairs Reducer() Reduces a set of intermediate values which share a key to a smaller set of values. Tool() Handles your custom command inputs or any other custom preparation work 5 6 Hadoop Task Mapping Hadoop Performance Hadoop tries to assign map tasks to where the data is Surprisingly where the data is also affects how many map tasks are launched The real number is typically no smaller than the number you specify, but could be larger if there are more distributed splits of your data Always check the completion status of your run to be sure about the number of map tasks What affects Hadoop performance? Parallelism: number of map tasks; load at reduce Data locality Load balancing: uniform/heterogeneous nodes Speed at each map/reduce task 7 8 2

Assignment #2 Word Counting You will program three applications on the Hadoop parallel data processing platform. Goal: gain practical experience on Hadoop programming and learn the performance implication of parallel data processing. Work on your own. Where is it used in practice? The input is one or more text files. You should count the number of files each word appears in. Your word identification should be case insensitive and ignore anything that isn't a letter. We provide two datasets: one for correctness testing and another for performance tests. 9 10 Matrix Vector Multiplication K means Where is it used in practice? Matrix vector multiplication takes a file describing a matrix as an input and has to also load a file with a vector. The matrix will be split across map tasks, but every task will need to load the entire vector. Widely used for data clustering in practice. Iterative. MapReduce for each iteration. 11 12 3

TA s Solution Our Hadoop Cluster The TA has implemented his solutions for the three applications and tested on some sample datasets One master node and 9 slave nodes Small scale Status/usage check Hadoop and HDFS Courtesy of using the shared Hadoop cluster Particular for performance testing Speedup isn t always the right performance metric! 13 14 Grading Miscellaneous Things Grade on correctness and performance On performance, we care about the raw speed at 8 maps, not so much on the speedup over your own sequential solution, which can be poor Will post the assignment tomorrow E turnin Start working early Strongly encourage early turn ins (bonuses). Note that the last turn in counts. Possibly run into issues, bear with us, potential award credit if discover and help solve problems with us 15 16 4

Data Parallel Computing Data Parallel Computing For some big data applications, data can fit into one machine and the main challenge is parallel data processing on multiprocessor MapReduce on multicore machines Rationale: MapReduce is easy to program; big data application has already been written on MapReduce Phoenix Rebirth: Scalable MapReduce on a Large Scale Shared Memory, developed at Stanford Data parallel computing on multicore Phoenix Threads Very different challenges and techniques for high performance and scalability (shared cache, synchronization, operating system scalability, ) GPUs 17 18 5