Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN

Similar documents

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Word count example Abdalrahman Alsaedi

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Biomap Jobs and the Big Picture

Introduction to MapReduce and Hadoop

How To Write A Mapreduce Program In Java.Io (Orchestra)

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop: Understanding the Big Data Processing Method

Data Science in the Wild

CS54100: Database Systems

Word Count Code using MR2 Classes and API

Extreme Computing. Hadoop MapReduce in more detail.

map/reduce connected components

By Hrudaya nath K Cloud Computing

University of Maryland. Tuesday, February 2, 2010

Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship

MR-(Mapreduce Programming Language)

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Hadoop Configuration and First Examples

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Internals of Hadoop Application Framework and Distributed File System

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

Big Data Management and NoSQL Databases

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

HPCHadoop: MapReduce on Cray X-series

Introduction To Hadoop

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

Download and install Download virtual machine Import virtual machine in Virtualbox

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

Getting to know Apache Hadoop

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

Programming in Hadoop Programming, Tuning & Debugging

Jordan Boyd-Graber University of Maryland. Tuesday, February 10, 2011

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

The MapReduce Framework

Hadoop and ecosystem * 本文中的言论仅代表作者个人观点 * 本文中的一些图例来自于互联网. Information Management. Information Management IBM CDL Lab

Tutorial- Counting Words in File(s) using MapReduce

Big Data Analytics* Outline. Issues. Big Data

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Mrs: MapReduce for Scientific Computing in Python

BIG DATA APPLICATIONS

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Introduc)on to Map- Reduce. Vincent Leroy

hadoop Running hadoop on Grid'5000 Vinicius Cogo Marcelo Pasin Andrea Charão

School of Parallel Programming & Parallel Architecture for HPC ICTP October, Hadoop for HPC. Instructor: Ekpe Okorafor

Implementations of iterative algorithms in Hadoop and Spark

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

INTRODUCTION TO HADOOP

Building a distributed search system with Apache Hadoop and Lucene. Mirko Calvaresi

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

Introduction to MapReduce and Hadoop

Big Data 2012 Hadoop Tutorial

Hadoop, Hive & Spark Tutorial

Xiaoming Gao Hui Li Thilina Gunarathne

Data-intensive computing systems

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Cloud Computing using MapReduce, Hadoop, Spark

Introduction to Cloud Computing

Introduction to Big Data Science. Wuhui Chen

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Introduc8on to Apache Spark

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Hadoop (Hands On) Irene Finocchi and Emanuele Fusco

Cloud Computing Era. Trend Micro

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis

LANGUAGES FOR HADOOP: PIG & HIVE

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL)

The Hadoop Eco System Shanghai Data Science Meetup

Hadoop + Clojure. Hadoop World NYC Friday, October 2, Stuart Sierra, AltLaw.org

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

Data Science Analytics & Research Centre

MapReduce Tutorial. Table of contents

BIG DATA ANALYTICS HADOOP PERFORMANCE ANALYSIS. A Thesis. Presented to the. Faculty of. San Diego State University. In Partial Fulfillment

The Cloud Computing Era and Ecosystem. Phoenix Liau, Technical Manager

Discover Hadoop. MapReduce Flexible framework for data analysis. Bonus. Network & Security. Special. Step by Step: Configuring.

Big Data for the JVM developer. Costin Leau,

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Connecting Hadoop with Oracle Database

Petabyte-scale Data with Apache HDFS. Matt Foley Hortonworks, Inc.

An Implementation of Sawzall on Hadoop

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Distributed Systems + Middleware Hadoop

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Lecture 3 Hadoop Technical Introduction CSE 490H

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

A very short Intro to Hadoop

Transcription:

Hadoop Framework technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN

Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal concepts/ideas appeared during your par8cipa8on! (and we could skip part of the content)

Hadoop MapReduce Hadoop is the dominant open source MapReduce implementation Funded by Yahoo, it emerged in 2006 The Hadoop project is now hosted by Apache Implemented in Java, (The data to be processed must be loaded into e.g. the Hadoop Distributed Filesystem) Source: Wikipedia 3

Hadoop MapReduce Hadoop is an open source MapReduce runtime provided by the Apache Software Foundation De-facto standard, free, open-source MapReduce implementation. Endorsed by: http://wiki.apache.org/hadoop/poweredby 4

Hadoop - Architecture Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez http://www.jorditorres.org/wp-content/uploads/2012/03/part2.eedc_.bigdata.hadoop.pdf 5

Hadoop: Very high-level overview When data is loaded into the systems, it is split into blocks of 64Mb/128Mb Map tasks works on typically a single block A master program allocates work to nodes (that work in parallel) such that a Map task will work on a block of data stored locally on that node If a node fails, the master will detect that failure and re-assign the work to a different node on the system 6

Hadoop esentials Computation: Move the computation to the data Storage: Keeping track of the data and metadata Data is sharded across the cluster Cluster management tools... 7

(default) Hadoop s Stack Applications more detail in next part!!! Compute Services Data Services Storage Services Hadoop s MapReduce Hbase: NoSQL Databases Hadoop Distributed File System (HDFS) Resource Fabrics 8 8

Basic Cluster Components One of each: Namenode (NN) Jobtracker (JT) Set of each per slave machine: Tasktracker (TT) Datanode (DN) 9

Put2ng Everything Together namenode job submission node namenode daemon jobtracker tasktracker datanode daemon Linux file system slave node tasktracker datanode daemon Linux file system slave node tasktracker datanode daemon Linux file system slave node 10

Anatomy of a Job MapReduce program in Hadoop = Hadoop job Jobs are divided into map and reduce tasks An instance of running a task is called a task attempt Multiple jobs can be composed into a workflow 11

Anatomy of a Job Job submission process Client (i.e., driver program) creates a job, configures it, and submits it to job tracker JobClient computes input splits (on client end) Job data (jar, configuration XML) are sent to JobTracker JobTracker puts job data in shared location, enqueues tasks TaskTrackers poll for tasks Off to the races 12

Running MapReduce job with Hadoop Steps: Defining the MapReduce stages in a Java program Loading the data into the Hadoop Distributed Filesystem Submitting the job for execution Retrieving the results from the filesystem MapReduce has been implemented in a variety of other programming languages and systems, Several NoSQL database systems have integrated MapReduce (later in this course) 13

Hadoop and enterprise? Hadoop is a complement to a relational data warehouse Enterprises are generally not replacing their relational DataWarehouse with Hadoop Hadoop s strengths Inexpensive High reliability Extreme scalability Flexibility: Data can be added without defining a schema Hadoop s weaknesses Hadoop is not an interactive query environment Processing data in Hadoop requires writing code 14

Who is using Hadoop? Source: Wikipedia, April 2013 15

What is MapReduce model used for? At Google: Index construction for Google Search Article clustering for Google News Statistical machine translation At Yahoo!: Web map powering Yahoo! Search Spam detection for Yahoo! Mail At Facebook: Data mining Ad optimization Spam detection 16

Hadoop 1.0 04-01-2012: The Apache Software Foundation delivers Hadoop 1.0, the much-anticipated 1.0 version of the popular opensource platform for storing and processing large amounts of data. six years of development, production experience, extensive testing, and feedback from hundreds of knowledgeable users, data scientists and systems engineers, culminating in a highly stable, enterprise-ready release of the fastest-growing big data platform. 17

Getting Started with Hadoop Different ways to write jobs: Java API Hadoop Streaming (for Python, Perl, etc) Pipes API (C++) R 18

Hadoop API Different APIs to write Hadoop programs: A rich Java API (main way to write Hadoop programs) A Streaming API that can be used to write map and reduce func2ons in any programming language (using standard inputs and outputs) A C++ API (Hadoop Pipes) With a higher language level (e.g., Pig, Hive) 19

Hadoop API Mapper void map(k1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter) void configure(jobconf job) void close() throws IOException Reducer/Combiner void reduce(k2 key, Iterator<V2> values, OutputCollector<K3,V3> output, Reporter reporter) void configure(jobconf job) void close() throws IOException Par22oner void getpartition(k2 key, V2 value, int numpartitions) 20

WordCount.java package org.myorg; import java.io.ioexception; import java.util.*; import org.apache.hadoop.fs.path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount { 21

WordCount.java public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map( LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); output.collect(word, one); 22

WordCount.java public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += values.next().get(); output.collect(key, new IntWritable(sum)); 23

WordCount.java public static void main(string[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); conf.setmapperclass(map.class); conf.setcombinerclass(reduce.class); conf.setreducerclass(reduce.class); conf.setinputformat(textinputformat.class); conf.setoutputformat(textoutputformat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); 24

E.g. Common wordcount Hello World Hello MapReduce Fig1: Sample input Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez http://www.jorditorres.org/wp-content/uploads/2012/03/part2.eedc_.bigdata.hadoop.pdf 25

E.g. Common wordcount void map(string i, string line): for word in line: print word, 1 Fig 2: wordcount map function Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez http://www.jorditorres.org/wp-content/uploads/2012/03/part2.eedc_.bigdata.hadoop.pdf 26 12 March 2012

E.g. Common wordcount void reduce(string word, list partial_counts): total = 0 for c in partial_counts: total += c print word, total Fig 3: wordcount reduce function Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez http://www.jorditorres.org/wp-content/uploads/2012/03/part2.eedc_.bigdata.hadoop.pdf 27

E.g. Common wordcount MAP Hello World Hello MapReduce Input Hello, 1 World, 1 First intermediate output Hello, 1 MapReduce, 1 Second intermediate output REDUCE Hello, 2 MapReduce, 1 World, 1 Final output Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez http://www.jorditorres.org/wp-content/uploads/2012/03/part2.eedc_.bigdata.hadoop.pdf 28

Word Count Python Mapper def read_input(file): for line in file: yield line.split() def main(separator='\t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1) Source: Robert Grossman Tutorial Supercompu2ng 2011

Word Count R Mapper trimwhitespace <- function(line) gsub("(^ +) ( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readlines(con, n = 1, warn = FALSE)) > 0) { line <- trimwhitespace(line) words <- splitintowords(line) cat(paste(words, "\t1\n", sep=""), sep="") close(con) Source: Robert Grossman Tutorial Supercompu2ng 2011

Word Count Java Mapper public static class Map extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context context throws IOException, InterruptedException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); Source: Robert Grossman Tutorial Supercompu2ng 2011

Code Comparison Word Count Mapper Python def read_input(file): for line in file: yield line.split() def main(separator='\t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1) Java public static class Map extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context context throws IOException, InterruptedException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); R trimwhitespace <- function(line) gsub("(^ +) ( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readlines(con, n = 1, warn = FALSE)) > 0) { line <- trimwhitespace(line) words <- splitintowords(line) cat(paste(words, "\t1\n", sep=""), sep="") close(con) Source: Robert Grossman Tutorial Supercompu2ng 2011

Word Count Python Reducer def read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(sep='\t'): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count) Source: Robert Grossman Tutorial Supercompu2ng 2011

Word Count R Reducer trimwhitespace <- function(line) gsub("(^ +) ( +$)", "", line) splitline <- function(line) { val <- unlist(strsplit(line, "\t")) list(word = val[1], count = as.integer(val[2])) env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readlines(con, n = 1, warn = FALSE)) > 0) { line <- trimwhitespace(line) split <- splitline(line) word <- split$word count <- split$count Source: Robert Grossman Tutorial Supercompu2ng 2011

Word Count R Reducer (cont d) if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) else assign(word, count, envir = env) close(con) for (w in ls(env, all = TRUE)) " ) cat(w, "\t", get(w, envir = env), "\n", sep =

Word Count Java Reducer public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); context.write(key, new IntWritable(sum));

Code Comparison Word Count Reducer Python def read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(sep='\t'): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count) if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) else assign(word, count, envir = env) close(con) for (w in ls(env, all = TRUE)) cat(w, "\t", get(w, envir = env), "\n", sep = " ) Java public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { R public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { trimwhitespace <- function(line) gsub("(^ +) ( +$)", "", line) splitline <- function(line) { val <- unlist(strsplit(line, "\t")) list(word = val[1], count = as.integer(val[2])) env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readlines(con, n = 1, warn = FALSE)) > 0) { line <- trimwhitespace(line) split <- splitline(line) word <- split$word count <- split$count int sum = 0; for (IntWritable val : values) { sum += val.get(); context.write(key, new IntWritable(sum)); Source: Robert Grossman Tutorial Supercompu2ng 2011