Map Reduce & Hadoop Recommended Text:



Similar documents
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Large scale processing using Hadoop. Ján Vaňo

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop. Sunday, November 25, 12

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Getting to know Apache Hadoop

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Introduction to Hadoop

MapReduce with Apache Hadoop Analysing Big Data

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

MapReduce, Hadoop and Amazon AWS

ITG Software Engineering

How To Use Hadoop

Big Data With Hadoop

Hadoop Parallel Data Processing

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Big Data and Apache Hadoop s MapReduce

A Brief Outline on Bigdata Hadoop

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Introduction to Hadoop

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Open source Google-style large scale data analysis with Hadoop

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

Hadoop and Map-Reduce. Swati Gore

A very short Intro to Hadoop

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Qsoft Inc

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Cloudera Certified Developer for Apache Hadoop

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

BBM467 Data Intensive ApplicaAons

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop Ecosystem B Y R A H I M A.

Xiaoming Gao Hui Li Thilina Gunarathne

L1: Introduction to Hadoop

Open source large scale distributed data management with Google s MapReduce and Bigtable

Hadoop 2.6 Configuration and More Examples

MapReduce (in the cloud)

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Integrating Big Data into the Computing Curricula

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

Introduction to Hadoop

Distributed Computing and Big Data: Hadoop and MapReduce

Hadoop Architecture. Part 1

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Workshop on Hadoop with Big Data

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Hadoop Design and k-means Clustering

Data processing goes big

Internals of Hadoop Application Framework and Distributed File System

CS 378 Big Data Programming

YARN and how MapReduce works in Hadoop By Alex Holmes

BIG DATA APPLICATIONS

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Hadoop & Spark Using Amazon EMR

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

The Inside Scoop on Hadoop


Click Stream Data Analysis Using Hadoop

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

How To Scale Out Of A Nosql Database

BIG DATA TRENDS AND TECHNOLOGIES

Complete Java Classes Hadoop Syllabus Contact No:

Data-Intensive Computing with Map-Reduce and Hadoop

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Big Data Analysis and HADOOP

Implement Hadoop jobs to extract business value from large and varied data sets

HDFS Cluster Installation Automation for TupleWare

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Cloud Computing using MapReduce, Hadoop, Spark

CSE-E5430 Scalable Cloud Computing Lecture 2

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Transcription:

Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately 10 billion photos. Ancestry.com, the genealogy site, stores around 2.5 petabytes of data. The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per month. The Large Hadron Collider near Geneva, Switzerland, will produce about 15 petabytes of data per year.! There is a need for a framework to process them. Hadoop: The Definitive Guide Tom White O Reilly 2 2010 VMware Inc. All rights reserved MapReduce! A programming model for data processing! Create by Google! Assumes a cluster of commodity servers! Fault tolerant Programming Model! Break processing into two (disjoint) phases! Map (K1, V1) list(k2, V2)! Reduce (K2, list(v2)) list(k3, V3) 3 4 1

Example Weather Data! Logs from the National Climatic Data Center (NCDC) Data files organized by year 1990.gz 1991.gz! Highest recorded global temperature for each year in the dataset?! Data stored using line-oriented ASCII format Each line is a record Includes many meteorological attributes We will focus on temperature 332130 # USAF weather station identifier 19500101 # observation date 0300 # observation time +51317 # latitude (degrees x 1000). 10268 # atmospheric pressure (hectopascals x 10). Example Map Function! Input Key is offset of reading in file (ignore) Value is a weather station s reading Sample input record in file: 0067011990999991950051507004...9999999N9+00001+99999999999... 0043011990999991950051512004...9999999N9+00221+99999999999... 0043011990999991950051518004...9999999N9-00111+99999999999... Presented to map function as key-value pairs: (0, 0067011990999991950051507004...9999999N9+00001+99999999999...) (106, 0043011990999991950051512004...9999999N9+00221+99999999999...) (212, 0043011990999991950051518004...9999999N9-00111+99999999999...)! Extract year and air temperature from each record: (1950, 0) (1950, 22) (1950, 11) 5 6 Example Reduce Function Example Recap! Input Key is year Value is a list of temperature reading for that year (1949, [111, 78]) (1950, [0, 22, 11])! Iterate through list and pick up the maximum reading (1949, 111) (1950, 22) 7 8 2

Apache Hadoop! Open source MapReduce implementation! Created by Doug Cutting! Hadoop is a made-up name. Name Cutting s kid gave to his stuffed yellow elephant Hadoop Job! Input data Stored on Hadoop Distributed File System (HDFS)! MapReduce program! Configuration information! Supports different programming languages Java, Ruby, Python, C++ Our focus will be on Java! In February 2008, Yahoo! announced that its production search index was being generated by a 10,000-core Hadoop cluster! April 2008, Hadoop broke the world record to sort a terabyte of data in 209 seconds. 9 10 Hadoop Job Execution Hadoop Data Flow! Divide input into fixed-sized input splits Typical split size is 1 HDFS block (64 MB)! Run a map task for each split Map tasks run in parallel Preference is given to running map on node where split is locally stored Map tasks write their output to local hard drive! Sort map output and send to reduce task (shuffle) All records for the same key are sent to the same reducer By default, keys are partition between reducers using a hash function! Merge records on reducer from multiple mappers! Run reduce task 11! Write output to HDFS 12 3

Hadoop 1 - System Architecture Hadoop 2 - YARN! Client Submits job for execution! JobTracker One per Hadoop cluster Coordinates job! TaskTracker One per cluster node Executes individual map/reduce tasks! MapTask or ReduceTask Application code http://hortonworks.com/hadoop/yarn/ 13 14 Hadoop 2 - System Architecture! ResourceManager Scheduler Global resource allocation ApplicationsManager. Starts ApplicationMaster! NodeManager Per-machine Monitors resources! ApplicationMaster Per application Negotiating appropriate resource containers from the Schedule Tracking their status and monitors progress. http://hadoop.apache.org/docs/current/hadoop-yarn/ hadoop-yarn-site/yarn.html Failures! Maper or reducer tasks Expected to be common in large clusters Re-run on a different node up to a configurable number of times (default is 4) Possible to configure number of tasks that can fail before job is terminated! Tasktracker/NodeManager Expected to be common in large clusters Incomplete allocated job re-run on other nodes! Jobtracker/ApplicationMaster Happens infrequently Single point of failure Job fails Hadoop 2 runs multiple Application Masters in high availability mode 15 16 4

Example Word Count Java Implementation! Reads text files and counts how often words occur! WordCountMapper 17 Implements map function Takes a line as input and breaks it into words. It then emits a key/value pair of the word and 1.! WordCountReducer Implements reduce function Sums the counts for each word and emits a single key/value with the word and sum.! WordCountDriver Configures the job Runs the job Running Hadoop Locally! Install from http://hadoop.apache.org/ Set environment variables JAVA_HOME HADOOP_INSTALL Export PATH= $PATH:$HADOOP_INSTALL/bin! Create Java project on Eclipse! Add Hadoop JAR files to classpath e.g., $HADOOP_INSTALL/hadoop-core-1.0.4.jar! Create configuration file <?xml version="1.0"?> <configuration> <property> <name>fs.default.name</name> <value>file:///</value> </property> <property> <name>mapred.job.tracker</name> <value>local</value> </property> 18 </configuration> Running Hadoop Locally (cont.)! Add source for mapper, reducer and driver! Compile application into JAR file! Run Command line hadoop --config conf/hadoop-local.xml jar ece1779hadoop.jar WordCountDriver input output Amazon Elastic MapReduce! Amazon supports Hadoop version 1.0.3, 2.2, 2.4! Upload application JAR file to S3! Upload input files to S3! Create a new job flow on AWS Management Console IDE Create a run configuration Set arguments to: -conf conf/hadoop-local.xml input output 19 20 5

Amazon Elastic MapReduce (cont.) Amazon Elastic MapReduce (cont.) 21 22 Hadoop High Level Languages! Pig A data flow language and execution environment for exploring very large datasets.! Hive A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data.! HBase A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).! Spark A fast and general in-memory compute engine for Hadoop data. 23 6