MapReduce, Hadoop and Amazon AWS

Similar documents

Open source Google-style large scale data analysis with Hadoop

Chapter 7. Using Hadoop Cluster and MapReduce

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop Architecture. Part 1

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Open source large scale distributed data management with Google s MapReduce and Bigtable

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop Parallel Data Processing

Large scale processing using Hadoop. Ján Vaňo

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Apache Hadoop. Alexandru Costan

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

MapReduce. Tushar B. Kute,

University of Maryland. Tuesday, February 2, 2010

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Data-Intensive Computing with Map-Reduce and Hadoop

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Map Reduce & Hadoop Recommended Text:

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Lecture 3 Hadoop Technical Introduction CSE 490H

How To Use Hadoop

Intro to Map/Reduce a.k.a. Hadoop

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

Hadoop Setup. 1 Cluster

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Apache Hadoop new way for the company to store and analyze big data

TP1: Getting Started with Hadoop

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Parallel Processing of cluster by Map Reduce

CS54100: Database Systems

Hadoop (pseudo-distributed) installation and configuration

THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Application Development. A Paradigm Shift

A Cost-Evaluation of MapReduce Applications in the Cloud

Hadoop & its Usage at Facebook

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

How To Install Hadoop From Apa Hadoop To (Hadoop)

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

Hadoop Installation MapReduce Examples Jake Karnes

Sriram Krishnan, Ph.D.

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Hadoop 2.6 Configuration and More Examples

Data-intensive computing systems

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Internals of Hadoop Application Framework and Distributed File System

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Hadoop & its Usage at Facebook

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation

Extreme Computing. Hadoop MapReduce in more detail.

A very short Intro to Hadoop

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

MapReduce for Statistical NLP/Machine Learning

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

Introduction to Hadoop

Reduction of Data at Namenode in HDFS using harballing Technique

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Introduction to MapReduce and Hadoop

Tutorial for Assignment 2.0

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

MapReduce and Hadoop Distributed File System

Comparative analysis of Google File System and Hadoop Distributed File System

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

MapReduce with Apache Hadoop Analysing Big Data

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Introduction to Cloud Computing

Tutorial for Assignment 2.0

Distributed Filesystems

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Apache Hadoop FileSystem and its Usage in Facebook

Big Data With Hadoop

Hadoop Lab - Setting a 3 node Cluster. Java -

Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

Introduc)on to Map- Reduce. Vincent Leroy

Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Single Node Hadoop Cluster Setup

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Hadoop IST 734 SS CHUNG

Hadoop and Map-Reduce. Swati Gore

Optimize the execution of local physics analysis workflows using Hadoop

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Transcription:

MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011

What is Hadoop? A software framework that supports data-intensive distributed applications. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google's MapReduce and Google File System (GFS). Hadoop is a top-level Apache project being built and used by a global community of contributors, using the Java programming language. Yahoo! has been the largest contributor to the project, and uses Hadoop extensively across its businesses.

Who uses Hadoop? http://wiki.apache.org/hadoop/poweredby

Who uses Hadoop? Yahoo! More than 100,000 CPUs in >36,000 computers. Facebook Used in reporting/analytics and machine learning and also as storage engine for logs. A 1100-machine cluster with 8800 cores and about 12 PB raw storage. A 300-machine cluster with 2400 cores and about 3 PB raw storage. Each (commodity) node has 8 cores and 12 TB of storage.

Very Large Storage Requirements Facebook has Hadoop clusters with 15 PB of raw storage (15,000,000 GB). No single storage can handle this amount of data. We need a large set of nodes each storing part of the data.

HDFS: Hadoop Distributed File System 1. filename, index Namenode Client 2. Datanodes, Blockid 3. Read data 1 2 2 3 1 3 1 3 2 Data Nodes

Terabyte Sort Benchmark http://sortbenchmark.org/ Task: Sorting 100TB of data and writing results on disk (10^12 records each 100 bytes). Yahoo s Hadoop Cluster is the current winner: 173 minutes 3452 nodes x (2 Quadcore Xeons, 8 GB RAM) This is the first time that a Java program has won this competition.

Counting Words by MapReduce Hello World Bye World Hello Hadoop Goodbye Hadoop Split Hello World Bye World Hello Hadoop Goodbye Hadoop

Counting Words by MapReduce Hello World Bye World Mapper Hello, <1> World, <1> Bye, <1> World, <1> Sort & Merge Bye, <1> Hello, <1> World, <1, 1> Combiner Bye, <1> Hello, <1> World, <2> Node 1

Counting Words by MapReduce Bye, <1> Hello, <1> World, <2> Goodbye, <1> Hadoop, <2> Hello, <1> Sort & Merge Bye, <1> Goodbye, <1> Hadoop, <2> Hello, <1, 1> World, <2> Split Bye, <1> Goodbye, <1> Hadoop, <2> Hello, <1, 1> World, <2>

Counting Words by MapReduce Node 1 Bye, <1> Goodbye, <1> Hadoop, <2> Reducer Bye, <1> Goodbye, <1> Hadoop, <2> part 00000 Bye 1 Goodbye 1 Hadoop 2 Node 2 Write on Disk part 00001 Hello, <1, 1> World, <2> Reducer Hello, <2> World, <2> Hello 2 World 2

Writing Word Count in Java Download hadoop core (version 0.20.2): http://www.apache.org/dyn/closer.cgi/hadoop/core/ It would be something like: hadoop-0.20.2.tar.gz Unzip the package and extract: hadoop-0.20.2-core.jar Add this jar file to your project class path Warning! Most of the sample codes on web are for older versions of Hadoop.

Word Count: Mapper Source files are available at: http://www.ics.uci.edu/~yganjisa/files/2011/hadoop-presentation/wordcount-v1-src.zip

Word Count: Reducer

Word Count: Main Class

My Small Test Cluster 3 nodes 1 master (ip address: 50.17.65.29) 2 slaves Copy your jar file to master node: Linux: scp WordCount.jar john@50.17.65.29:wordcount.jar Windows (you need to download pscp.exe): pscp.exe WordCount.jar john@50.17.65.29:wordcount.jar Login to master node: ssh john@50.17.65.29

Counting words in U.S. Constitution! Download text version: wget http://www.usconstitution.net/const.txt Put input text file on HDFS: hadoop dfs -put const.txt const.txt Run the job: hadoop jar WordCount.jar edu.uci.hadoop.wordcount const.txt word-count-result

Counting words in U.S. Constitution! List my files on HDFS: Hadoop dfs -ls List files in word-count-result folder: Hadoop dfs -ls word-count-result/

Counting words in U.S. Constitution! Downloading results from HDFS: hadoop dfs -cat word-count-result/part-r-00000 > word-count.txt Sort and view results: sort -k2 -n -r word-count.txt more

Hadoop Map/Reduce - Terminology Running Word Count across 20 files is one job Job Tracker initiates some number of map tasks and some number of reduce tasks. For each map task at least one task attempt will be performed more if a task fails (e.g., machine crashes).

High Level Architecture of MapReduce Master Node Client Computer JobTracker TaskTracker TaskTracker TaskTracker Task Task Task Task Task Slave Node Slave Node Slave Node

High Level Architecture of Hadoop Master Node Slave Node Slave Node TaskTracker TaskTracker TaskTracker MapReduce layer JobTracker HDFS layer NameNode DataNode DataNode DataNode

Web based User interfaces JobTracker: http://50.17.65.29:9100/ NameNode: http://50.17.65.29:9101/

Hadoop Job Scheduling FIFO queue matches incoming jobs to available nodes No notion of fairness Never switches out running job Warning! Start your job as soon as possible.

Reporting Progress If your tasks don t report anything in 10 minutes they would be killed by Hadoop! Source files are available at: http://www.ics.uci.edu/~yganjisa/files/2011/hadoop-presentation/wordcount-v2-src.zip

Distributed File Cache The Distributed Cache facility allows you to transfer files from the distributed file system to the local file system (for reading only) of all participating nodes before the beginning of a job.

TextInputFormat LineRecordReader <offset 1, line 1 > <offset 2, line 2 > <offset 3, line 3 > Split For more complex inputs, You should extend: InputSplit RecordReader InputFormat

Part 2: Amazon Web Services (AWS)

What is AWS? A collection of services that together make up a cloud computing platform: S3 (Simple Storage Service) EC2 (Elastic Compute Cloud) Elastic MapReduce Email Service SimpleDB Flexibile Payments Service

Case Study: yelp Yelp uses Amazon S3 to store daily logs and photos, generating around 100GB of logs per day. Features powered by Amazon Elastic MapReduce include: People Who Viewed this Also Viewed Review highlights Auto complete as you type on search Search spelling suggestions Top searches Ads Yelp runs approximately 200 Elastic MapReduce jobs per day, processing 3TB of data.

Amazon S3 Data Storage in Amazon Data Center Web Service interface 99.99% monthly uptime guarantee Storage cost: $0.15 per GB/Month S3 is reported to store more than 102 billion objects as of March 2010.

Amazon S3 You can think of S3 as a big HashMap where you store your files with a unique key: HashMap: key -> File

References Hadoop Project Page: http://hadoop.apache.org/ Amazon Web Services: http://aws.amazon.com/