Introduction to Hadoop

Similar documents

Introduction to Hadoop

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Introduction to Parallel Programming and MapReduce

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

MapReduce (in the cloud)

Sriram Krishnan, Ph.D.

Big Data and Apache Hadoop s MapReduce

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

CSE-E5430 Scalable Cloud Computing Lecture 2

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Big Data With Hadoop

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Map Reduce & Hadoop Recommended Text:

MapReduce with Apache Hadoop Analysing Big Data

The Hadoop Framework

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Chapter 7. Using Hadoop Cluster and MapReduce

Apache Hadoop. Alexandru Costan

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Lecture 10 - Functional programming: Hadoop and MapReduce

Hadoop Architecture. Part 1

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Open source Google-style large scale data analysis with Hadoop

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

MapReduce and Hadoop Distributed File System

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

MAPREDUCE Programming Model

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Evaluating HDFS I/O Performance on Virtualized Systems

Mobile Cloud Computing for Data-Intensive Applications

International Journal of Advance Research in Computer Science and Management Studies

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Distributed File Systems

Getting to know Apache Hadoop

Hadoop and Map-Reduce. Swati Gore

ImprovedApproachestoHandleBigdatathroughHadoop

Hadoop implementation of MapReduce computational model. Ján Vaňo

Comparison of Different Implementation of Inverted Indexes in Hadoop

Distributed File Systems

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Advanced Data Management Technologies

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce

Hadoop Parallel Data Processing

Parallel Processing of cluster by Map Reduce

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Application Development. A Paradigm Shift

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Open source large scale distributed data management with Google s MapReduce and Bigtable

Cloud Computing Summary and Preparation for Examination

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)

MapReduce, Hadoop and Amazon AWS

Distributed Computing and Big Data: Hadoop and MapReduce

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Scalable Multiple NameNodes Hadoop Cloud Storage System

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Big Data Processing with Google s MapReduce. Alexandru Costan

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Large scale processing using Hadoop. Ján Vaňo

A very short Intro to Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Hadoop IST 734 SS CHUNG

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Fault Tolerance in Hadoop for Work Migration

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Big Data Analysis and Its Scheduling Policy Hadoop

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Parallel Computing. Benson Muite. benson.

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Introduction to Cloud Computing

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Reduction of Data at Namenode in HDFS using harballing Technique

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Transcription:

Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction to Supercomputing Jan Verschelde, 10 March 2014 Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 1 / 34

Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 2 / 34

extract meaning from the data avalanche goal: applying complex analytics to large data sets complementary technology is the use of cloud computing, in particular: Amazon Web Services. Assumptions for writing MapReduce applications: 1 comfortable writing Java programs; and 2 familiar with the Unix command-line interface. Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 3 / 34

What is the value of data? Some questions are relevant only for large data sets. For example: movie preferences are inaccurate when based on just another person, but patterns can be extracted from the viewing history of millions. Big data tools enable processing on larger scale at lower cost. Additional hardware is needed to make up for latency. The notion of what is a database should be revisited. Realize that one no longer needs to be among the largest corporations or government agencies to extract value from data. Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 4 / 34

data processing systems Two ways to process large data sets: 1 scale-up: large and expensive computer (supercomputer). We move the same software onto larger and larger servers. 2 scale-out: spread processing onto more and more machines (commodity cluster). Limiting factors: complexity of concurrency in multiple CPUs; CPUs are much faster than memory and hard disk speeds. There is a third way: 3 Cloud computing: provider deals with scaling problems. Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 5 / 34

Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 6 / 34

principles of Hadoop 1 All roads lead to scale-out, scale-up architectures are rarely used and scale-out is the standard in big data processing. 2 Share nothing: communication and dependencies are bottlenecks, individual components should be as independent as possible to allow to proceed regardless of whether others fail. 3 Expect failure: components will fail at inconvenient times. See our previous exercise on multi-component expected life span; resilient scale-up architectures require much effort. 4 Smart software, dumb hardware: push smarts in the software, responsible for allocating generic hardware. 5 Move processing, not data: perform processing locally on data. What gets moved through the network are program binaries and status reports, which are dwarfed in size by the actual data set. 6 Build applications, not infrastructure. Instead of placing focus on data movement and processing, work on job scheduling, error handling, and coordination. Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 7 / 34

History of Hadoop Who to thank for Hadoop? 1 Google released two academic papers on their technology: in 2003: the google file system; and in 2004: MapReduce. The papers are available for download from google.research.com. 2 Doug Cutting started implementing Google systems, as a project within the Apache open source foundation. 3 Yahoo hired Doug Cutting in 2006 and supported Hadoop project. From Tom White: Hadoop: The Definitive Guide, published by O Reilly: Definition (The name Hadoop is...) The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 8 / 34

components of Hadoop Two main components of Hadoop are 1 The Hadoop Distributed File System (HDFS) stores very large data sets across a cluster of hosts, optimized for throughput instead of latency, achieving high availability through replication instead of redundancy. 2 MapReduce is a data processing paradigm that takes a specification of input (map) and output (reduce) and applies this to the data. MapReduce integrates tightly with HDFS, running directly on HDFS nodes. Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 9 / 34

common building blocks run on commodity clusters, scale-out: can add more servers have mechanisms for identifying and working around failures provides services transparently, user can concentrate on problem software cluster sitting on physical server controls execution Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 10 / 34

the Hadoop Distributed File System (HDFS) HDFS spreads the storage across nodes. Features: files stored in blocks of 64MB (typical file system: 4-32 KB) optimized for throughput over latency, efficient at streaming, but poor at seek requests for many smale ones optimized for workloads that are read-many and write-once storage node runs process DataNode that manages blocks, coordinated by master NameNode process running on separate host replicates blocks on multiple nodes to handle disk failures Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 11 / 34

MapReduce A new technology build on fundamental concepts: functional programming languages offer map and reduce; divide and conquer breaks problem into multiple subtasks. The input data goes to a series of transformations: the developer defines the data transformations, Hadoop s MapReduce job manages parallel execution. Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 12 / 34

What is the point? Unlike relational databases the required structure data, the data is provided as a series of key-value pairs and the output of the map is another set of key-value pairs. Most important point: the Hadoop platform takes responsibility for every aspect of executing the processing across the data. The user is unaware of the actual size of data and cluster, the platform determines how best to utilize all the hosts. Challenge to the user: break the problem into 1 the best combination of chains of map and reduce functions, where the output of one is the input of the next; and 2 where each chain could be applied independently to each data element, allowing for parallelism. Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 13 / 34

software clusters The common architecture of HDFS and MapReduce are software clusters: cluster of worker nodes managed by a coordinator node; master (NameNode for HDFS and JobTracker for MapReduce) monitors clusters and handles failures; processes on server (DataNode for HDFS and TaskTracker for MapReduce) perform work on physical host, receiving instructions for master and reporting status. Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 14 / 34

a multi-node Hadoop cluster copied from wikipedia Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 15 / 34

what is it good for, and what not Strengths and weaknesses: + Hadoop is flexible and scalable data processing platform; batch processing not for real time, e.g.: serving web queries. Application by Google: 1 Web crawler retrieves updated webpage data. 2 MapReduces processes the huge data set. 3 The web index is then used by a fleet of MySQL servers for search requests. Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 16 / 34

Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 17 / 34

cloud computing Cloud computing with Amazon Web Services is another technology. Two main aspects: 1 new architecture option; and 2 different approach to cost. Instead of running your own clusters, all you need is a credit card. This is the third way: 1 scale-up: supercomputer 2 scale-out: commodity cluster 3 cloud computing: the provider deals with the scaling problem. Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 18 / 34

using cloud computing How does it work? Two steps: 1 user develops software according to published guidelines or interface; 2 deploy onto the cloud platform and allow the service to scale on demand. Service-on-demand aspect: start application on small hardware and scale up, off loading infrastructure costs to cloud provider. Major benefit: great reduction of cost of entry for organization to launch an online service. Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 19 / 34

Amazon Web Services (AWS) AWS is an infrastructure on demand A set of cloud computing services: Elastic Compute Cloud (EC2): a server on demand. Simple Storage Service (S3): programmatic interface to store objects. Elastic MapReduce (EMR): Hadoop in the cloud. In most impressive mode: EMR pulls data from S3, process on EC2. Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 20 / 34

Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 21 / 34

key/value transformations MapReduce as a series of key/value transformations map reduce {K1, V1} --> {K2, List<V2>} -----> {K3, V3} Two steps, three sets of key-value pairs: 1 The input to the map is a series of key-value pairs, called K1 and V1. 2 The output of the map (and the input to the reduce) is a series of keys and an associated list of values that are called K2 and V2. 3 The output of the reduce is another series of key-value pairs, called K3 and V3. Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 22 / 34

Hadoop s hello world The word count problem: Input: a text file. Output: the frequency of words. Project Gutenberg (at www.gutenberg.org) offers many free books (mostly classics) in plain text file format. Doing a word count is a very basic data analytic, same authors will use the same patterns of words. Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 23 / 34

solving the word count problem with MapReduce Every word on the text file is a key. Its value is the count, its number of occurrences. Keys must be unique, but values need not be. Each value must be associate with a key, but a key could have no values One must be careful when defining what is a key, e.g. for the word problem do we distinguish between lower and upper case words? Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 24 / 34

a MapReduce job A complete MapReduce Job for the word count problem: 1 Input to the map: K1/V1 pairs are in the form < line number, text on the line >. The mapper takes the line and breaks it up into words. 2 Output of the map, input of reduce: K2/V2 pairs are in the form < word, 1 >. 3 Output of reduce: K3/V3 pairs are in the form < word, count >. Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 25 / 34

pseudo code for the word count problem map(string key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(string key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 26 / 34

Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 27 / 34

grep and counting URLs Distributed Grep: The map function emits a line if it matches a supplied pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. Count of URL Access Frequency: The map function processes logs of web page requests and outputs (URL, 1). The reduce function adds together all values for the same URL and emits a (URL, total count) pair. Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 28 / 34

reverse web-link graph Reverse Web-Link Graph: The map function outputs (target, source) pairs for each link to a target URL found in a page named source. The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: (target, list(source)). Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 29 / 34

term-vector per host Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of (word, frequency) pairs. The map function emits a (hostname, term vector) pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final (hostname, term vector) pair. Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 30 / 34

inverted index Inverted Index: The map function parses each document, and emits a sequence of (word, document ID) pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a (word, list(document ID)) pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions. Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 31 / 34

distributed sort Distributed Sort: The map function extracts the key from each record, and emits a key, record) pair. The reduce function emits all pairs unchanged. This computation depends on the partitioning facilities: hashing keys; and and the ordering properties: within a partition key-value pairs are processed in increasing key order. Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 32 / 34

suggested reading Garry Turkington. Hadoop Beginner s Guide. www.it-ebooks.info. Papers available from research.google.com/archive (URL): Luiz Barroso, Jeffrey Dean, and Urs Hoelzle: Web Search for a Planet: The Google Cluster Architecture. At URL/googlecluster.html. Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. At URL/mapreduce.html. Appeared in OSDI 04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December 2004. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung: The Google File System. At URL/gfs.html. Appeared in 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003. Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 33 / 34

Summary and Exercises Hadoop and Cloud Computing reduce the entry cost for large scale data processing. Exercises: 1 Install Hadoop on your computer and run the word count problem. 2 Consider the pancake problem we have seen in Lecture 23 and formulate a solution within the MapReduce programming model. Applying Hadoop to a nontrivial problem could be a topic for an interesting final project. Introduction to Supercomputing (MCS 572) introduction to Hadoop L-24 10 March 2014 34 / 34