Big Data and Apache Hadoop s MapReduce



Similar documents
Jeffrey D. Ullman slides. MapReduce for data intensive computing

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

CSE-E5430 Scalable Cloud Computing Lecture 2

How To Scale Out Of A Nosql Database

MapReduce (in the cloud)

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related


Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop IST 734 SS CHUNG

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Introduction to Hadoop

Large scale processing using Hadoop. Ján Vaňo

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

MapReduce with Apache Hadoop Analysing Big Data

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop. Sunday, November 25, 12

Big Data Processing with Google s MapReduce. Alexandru Costan

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Hadoop Architecture. Part 1

Application Development. A Paradigm Shift

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Introduction to Hadoop

Chapter 7. Using Hadoop Cluster and MapReduce

Open source Google-style large scale data analysis with Hadoop

Big Data With Hadoop

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

BBM467 Data Intensive ApplicaAons

Data-Intensive Computing with Map-Reduce and Hadoop

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Advanced Data Management Technologies

MapReduce and Hadoop Distributed File System

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

THE HADOOP DISTRIBUTED FILE SYSTEM

Open source large scale distributed data management with Google s MapReduce and Bigtable

BIG DATA, MAPREDUCE & HADOOP

CS54100: Database Systems

A Brief Outline on Bigdata Hadoop

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Accelerating and Simplifying Apache

Introduction to Hadoop

Lecture 10 - Functional programming: Hadoop and MapReduce

Introduction to Parallel Programming and MapReduce

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

The Hadoop Framework

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Map Reduce & Hadoop Recommended Text:

Parallel Processing of cluster by Map Reduce

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Map Reduce / Hadoop / HDFS

Lecture Data Warehouse Systems

Hadoop & its Usage at Facebook

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

BIG DATA TRENDS AND TECHNOLOGIES

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Big Data Management and NoSQL Databases

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

MapReduce and Hadoop Distributed File System V I J A Y R A O

Apache Hadoop FileSystem and its Usage in Facebook

Cloud Computing using MapReduce, Hadoop, Spark

Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Big Data Analysis using Hadoop components like Flume, MapReduce, Pig and Hive

Hadoop and Map-Reduce. Swati Gore

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Transcription:

Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

Table of Contents 1 Introduction 2 Hadoop Distributed File System 3 MapReduce 4 More Examples Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 2 / 23

Single Machine CPU Memory Disk Typical set up for data processing/mining! What are the problems with big data? Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 3 / 23

Big Data Challenge Data Sources Internet and social networks Sensors and science Needed Infrastructure Scale to thousands of CPUs Run on cheap commodity hardware (fault-tolerant hardware is expensive!) Automatically handle data replication and node failure Data distribution and load balancing Easy to implement solutions (thinking in terms of parallel computing is hard!) Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 4 / 23

Hardware: Cluster Architecture Switch Backbone Switch Rack 1... Switch Rack n CPU Memory CPU CPU... Memory Memory... CPU Memory Disk Disk Disk Disk Node failure! Rack contains 16-64 nodes Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 5 / 23

Software: Apache Hadoop What is Apache Hadoop? A software framework that supports data-intensive (petabytes) distributed applications under a free license....inspired by Google s MapReduce and Google File System (GFS). Hadoop provides: Distributed file system HDFS API to work with MapReduce Job configuration and scheduling Track progress and utilization Written in the Java programming language. Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 6 / 23

What Problem can be solved with Hadoop? Characteristics Processing can easily be made in parallel (simple computations) Process large amounts of unstructured data Running batch jobs is acceptable Examples Creating statistics (word counting) Searching (distributed grep) Sorting Indexing (postings list) Document clustering Graph algorithms (E.g. pagerank) Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 7 / 23

Who uses Hadoop? Adobe Amazon.com AOL Facebook Google IBM Microsoft NY Times Yahoo!... Source http://wiki.apache.org/hadoop/poweredby Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 8 / 23

Table of Contents 1 Introduction 2 Hadoop Distributed File System 3 MapReduce 4 More Examples Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 9 / 23

The Hadoop Distributed File System (HDFS) Source: http://hadoop.apache.org/common/docs/current/hdfs_design.html Files are split into large block size: 64 MB (typical fs has 4 kb) Replication (2-3x on different racks) Fault-tolerance Master node stores meta information Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 10 / 23

Table of Contents 1 Introduction 2 Hadoop Distributed File System 3 MapReduce 4 More Examples Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 11 / 23

MapReduce Basic Idea 1 Apply a map function to each input element and emit key/value pairs. map(k1, v1) list(k2, v2) 2 Summarize the results for each key using a reduce function. reduce(k2, list(v2)) (k2, v3) The user only has to specify the map and reduce functions and the framework takes care of the rest! The user does not have to think about concurrency, load balancing, data distribution, fault-tolerance! Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 12 / 23

Example: Word Counting Count how often each word appears in a large number of documents. 1 void map ( String name, String document ){ 2 // name: document name 3 // document : document contents 4 for each word w in document : 5 EmitIntermediate ( w, 1 ) 6 } 7 8 void reduce ( String word, Iterator partialcounts ){ 9 // word: a word 10 // partialcounts : a list of aggregated partial counts 11 int sum = 0 ; 12 for each pc in partialcounts : 13 sum += ParseInt ( pc ) ; 14 Emit ( word, AsString ( sum ) ) ; 15 } Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 13 / 23

Example: Word Counting the quick brown fox the lazy old dog the quick brown dog the, 1 quick, 1 brown, 1 fox, 1 the, 1 lazy, 1 old, 1 dog, 1 the, 1 quick, 1 brown, 1 dog, 1 brown, 1 brown, 1 dog, 1 dog, 1 fox, 1 lazy, 1 old, 1 quick, 1 quick, 1 the, 1 the, 1 the,1 split map shuffle/sort reduce brown, 2 dog, 2 fox, 1 lazy, 1 old, 1 quick, 2 the, 3 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 14 / 23

Execution of a MapReduce Job Source: Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI, 2004. Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 15 / 23

Fault-tolerance Switch Backbone Switch Rack 1... Switch Rack n Master reschedules Task 1 here CPU Memory Disk ping... Task CPU1 Memory Block Disk 1 CPU Memory Block Disk 1... CPU Memory Disk Master Node failure! Rack contains 16-64 nodes Master re-executes tasks for failed workers. Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 16 / 23

Locality Switch Backbone Switch Rack 1... Schedule Task which needs block 2 here or at least on this rack! Switch Rack n CPU Memory Disk CPU CPU... Memory Memory... Block Disk 1 Block Disk 2 CPU Memory Disk Master Rack contains 16-64 nodes Schedule map tasks near to the data to preserve network bandwidth. Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 17 / 23

More Properties Load Balancing: Subdivide work in many small tasks ( # of workers). Since the master dynamically assigns tasks to idle nodes, this automatically provides load balancing. Chaining: MapReduce operations can be chained (output of one operation is input for another operation) to solve more complicated computations. Backup tasks: Reduce phase can only start after all map tasks are finished (use backup tasks to avoid stragglers ). Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 18 / 23

Table of Contents 1 Introduction 2 Hadoop Distributed File System 3 MapReduce 4 More Examples Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 19 / 23

Example: Distributed Grep Map task? Reduce task? Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 20 / 23

Example: Creating an Inverted Index Map task? Reduce task? Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 21 / 23

Related Projects Apache HBase: open source, non-relational, distributed database modeled after Google s BigTable Apache Hive: data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Provides a SQL-like query language called HiveQL. Apache Pig: is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs. Apache Mahout: free implementations of distributed or otherwise scalable machine learning algorithms on the Hadoop platform. Apache Lucene: a high-performance, full-featured text search engine library written entirely in Java. Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 22 / 23

Reading Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters http://labs.google.com/papers/mapreduce.html The Apache Hadoop Project http://hadoop.apache.org/ Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 23 / 23