CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)



Similar documents
Big Data With Hadoop

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

Data-Intensive Computing with Map-Reduce and Hadoop

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Map Reduce & Hadoop Recommended Text:

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

MapReduce. Introduction and Hadoop Overview. 13 June Lab Course: Databases & Cloud Computing SS 2012

Jeffrey D. Ullman slides. MapReduce for data intensive computing

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Application Development. A Paradigm Shift

Hadoop Architecture. Part 1

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

How To Scale Out Of A Nosql Database

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

CSE-E5430 Scalable Cloud Computing Lecture 2

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Parallel Processing of cluster by Map Reduce

Hadoop implementation of MapReduce computational model. Ján Vaňo

Accelerating and Simplifying Apache

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Hadoop Ecosystem B Y R A H I M A.


Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

A Brief Outline on Bigdata Hadoop

Chapter 7. Using Hadoop Cluster and MapReduce

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Introduction to Hadoop

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop Distributed File System (HDFS) Overview

A very short Intro to Hadoop

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

THE HADOOP DISTRIBUTED FILE SYSTEM

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Large scale processing using Hadoop. Ján Vaňo

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Apache Hadoop. Alexandru Costan

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Verification and Validation of MapReduce Program model for Parallel K-Means algorithm on Hadoop Cluster

The Hadoop Framework

HadoopRDF : A Scalable RDF Data Analysis System

Distributed Computing and Hadoop in Statistics

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

L1: Introduction to Hadoop

Hadoop IST 734 SS CHUNG

Comparison of Different Implementation of Inverted Indexes in Hadoop

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

COURSE CONTENT Big Data and Hadoop Training

Hadoop Parallel Data Processing

Hadoop. Sunday, November 25, 12

Hadoop and Map-Reduce. Swati Gore

A Cost-Evaluation of MapReduce Applications in the Cloud

BBM467 Data Intensive ApplicaAons

Big Data and Apache Hadoop s MapReduce

Introduction to Hadoop

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Intro to Map/Reduce a.k.a. Hadoop

The Performance Characteristics of MapReduce Applications on Scalable Clusters

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

HADOOP AND THE BI ENVIRONMENT

Open source Google-style large scale data analysis with Hadoop

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Introduction to MapReduce and Hadoop

Implement Hadoop jobs to extract business value from large and varied data sets

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Apriori-Map/Reduce Algorithm

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Google Bing Daytona Microsoft Research

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Entering the Zettabyte Age Jeffrey Krone

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases

Transcription:

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016

MapReduce MapReduce is a programming model for expressing distributed computations on massive datasets, and an execution framework for large-scale data processing on commodity clusters Developed at Google in 2004 ( Jeffrey Dean & Sanjay Ghemawat ). An open-source version called Hadoop was later developed at Yahoo. Hadoop is now an Apache project. Amazon Elastic MapReduce runs Hadoop on Amazon EC2.

MapReduce MapReduce provides Simple API s, and Automatic o Parallelization o Data distribution o Load balancing o Fault tolerance

Big Ideas behind MapReduce Scale Out Instead of Scaling Up: A large number of commodity lowend servers is preferred over a small number of high-end servers. Be Ready to Tackle Failures: Failures are the norm at warehouse scale computing. Move Code to the Data: Code transfer is much cheaper than transferring massive amounts of data. Process Data Sequentially: Random accesses to data stored on disks are much costlier than sequential accesses. Hide System-Level Details from Programmers: Provide a simple abstraction that is easy to reason about. Seamless Scalability: A simple programming model to approach ideal scaling characteristics in many circumstances.

A Simplified View of MapReduce Input keyvalue pairs Input Files Map Intermediate key-value pairs Shuffle Reduce Output Files All values associated with the same key Source: Lin & Dyer, Data-Intensive Text Processing with MapReduce

A Simple Word Count Example Count the number of occurrences of every word in a text collection. Source: Lin & Dyer, Data-Intensive Text Processing with MapReduce

Combiner & Partitioner A mini reducer in the map phase Determines which key goes to which reducer Source: Lin & Dyer, Data-Intensive Text Processing with MapReduce

Word Count with In-Mapper Combining Source: Lin & Dyer, Data-Intensive Text Processing with MapReduce

Word Count with Improved In-Mapper Combining Source: Lin & Dyer, Data-Intensive Text Processing with MapReduce

Compute Mean of Values Associated with Each Key Source: Lin & Dyer, Data-Intensive Text Processing with MapReduce

Mean of Values with a Separate Combiner Source: Lin & Dyer, Data-Intensive Text Processing with MapReduce

Mean of Values with a Separate Combiner Source: Lin & Dyer, Data-Intensive Text Processing with MapReduce

Mean of Values with an In-Mapper Combiner Source: Lin & Dyer, Data-Intensive Text Processing with MapReduce

Computing Word Co-occurrences Source: Lin & Dyer, Data-Intensive Text Processing with MapReduce

Word Co-occurrences ( Stripes Approach ) Source: Lin & Dyer, Data-Intensive Text Processing with MapReduce

Baseline Inverted Indexing for Text Retrieval Source: Lin & Dyer, Data-Intensive Text Processing with MapReduce

Baseline Inverted Indexing for Text Retrieval Source: Lin & Dyer, Data-Intensive Text Processing with MapReduce

Scalable Inverted Indexing for Text Retrieval Source: Lin & Dyer, Data-Intensive Text Processing with MapReduce

Parallel Breadth-First Search Source: Lin & Dyer, Data-Intensive Text Processing with MapReduce

Hadoop Subprojects Source: Tom White, Hadoop The Definitive Guide Core: A set of components and interfaces for distributed file systems and general I/O ( serialization, Java RPC, persistent data structures ). Avro: A data serialization system for efficient, cross-language RPC, and persistent data storage. MapReduce: A distributed data processing model and execution environment that runs on large clusters of commodity machines. HDFS: A distributed filesystem that runs on large clusters of commodity machines.

Hadoop Subprojects Source: Tom White, Hadoop The Definitive Guide Pig: A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters. HBase: A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries ( random reads ). ZooKeeper: A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications.

Hadoop Subprojects Source: Tom White, Hadoop The Definitive Guide Hive: A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL ( and which is translated by the runtime engine to MapReduce jobs ) for querying the data. Chukwa: A distributed data collection and analysis system. Chukwa runs collectors that store data in HDFS, and it uses MapReduce to produce reports.

The Building Blocks of Hadoop On a fully configured Hadoop cluster a set of daemons or resident programs run on the different servers in the network. o NameNode o DataNode o Secondary NameNode o JobTracker o TaskTracker

The Building Blocks of Hadoop Source: Chuck Lam, Hadoop in Action NameNode: The bookkeeper of HDFS: keeps track of how files are broken down into file blocks, which nodes store those blocks, and the overall health of the distributed filesystem. DataNode: Each slave machine in the cluster hosts a DataNode daemon to perform the reading and writing of HDFS blocks to actual files on the local filesystem.

The Building Blocks of Hadoop Source: Chuck Lam, Hadoop in Action JobTracker: Determines the execution plan for a job by determining which files to process, assigns nodes to different tasks, and monitors all tasks as they re running. Should a task fail, the JobTracker will automatically relaunch the task, possibly on a different node. TaskTracker: Manages the execution of individual ( map or reduce ) tasks on each slave node.

The Building Blocks of Hadoop Source: Chuck Lam, Hadoop in Action Secondary NameNode: It communicates with the NameNode to take periodic snapshots of the HDFS metadata. Does not keep track of any real-time changes to HDFS. Can be configured to work as the NameNode in the event of the failure of the original NameNode.

Hadoop Distributed File System ( HDFS ) Design HDFS was designed for Very large files Streaming data access Commodity hardware But not for Low latency access Lots of small files Multiple writes, arbitrary file modifications

Hadoop MapReduce Source: Tom White, Hadoop The Definitive Guide

An Example: Mining Weather Data Find Maximum Temperature Every Year Source: Tom White, Hadoop The Definitive Guide

Maximum Temperature Every Year ( Java ) Source: Tom White, Hadoop The Definitive Guide

Maximum Temperature Every Year ( Java ) Source: Tom White, Hadoop The Definitive Guide

Maximum Temperature Every Year ( Java ) Source: Tom White, Hadoop The Definitive Guide

Maximum Temperature Every Year ( Java ) Source: Tom White, Hadoop The Definitive Guide

Maximum Temperature Every Year ( Python ) Source: Tom White, Hadoop The Definitive Guide

Maximum Temperature Every Year ( C++ ) Source: Tom White, Hadoop The Definitive Guide

Maximum Temperature Every Year ( C++ ) Source: Tom White, Hadoop The Definitive Guide