Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Transcription

1 Hadoop Framework technology basics for data scientists Spring Jordi Torres, UPC - BSC

2 Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal concepts/ideas appeared during your par8cipa8on! (and we could skip part of the content)

3 Hadoop MapReduce Hadoop is the dominant open source MapReduce implementation Funded by Yahoo, it emerged in 2006 The Hadoop project is now hosted by Apache Implemented in Java, (The data to be processed must be loaded into e.g. the Hadoop Distributed Filesystem) Source: Wikipedia 3

4 Hadoop MapReduce Hadoop is an open source MapReduce runtime provided by the Apache Software Foundation De-facto standard, free, open-source MapReduce implementation. Endorsed by: 4

5 Hadoop - Architecture Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez 5

6 Hadoop: Very high-level overview When data is loaded into the systems, it is split into blocks of 64Mb/128Mb Map tasks works on typically a single block A master program allocates work to nodes (that work in parallel) such that a Map task will work on a block of data stored locally on that node If a node fails, the master will detect that failure and re-assign the work to a different node on the system 6

7 Hadoop esentials Computation: Move the computation to the data Storage: Keeping track of the data and metadata Data is sharded across the cluster Cluster management tools... 7

8 (default) Hadoop s Stack Applications more detail in next part!!! Compute Services Data Services Storage Services Hadoop s MapReduce Hbase: NoSQL Databases Hadoop Distributed File System (HDFS) Resource Fabrics 8 8

9 Basic Cluster Components One of each: Namenode (NN) Jobtracker (JT) Set of each per slave machine: Tasktracker (TT) Datanode (DN) 9

10 Put2ng Everything Together namenode job submission node namenode daemon jobtracker tasktracker datanode daemon Linux file system slave node tasktracker datanode daemon Linux file system slave node tasktracker datanode daemon Linux file system slave node 10

11 Anatomy of a Job MapReduce program in Hadoop = Hadoop job Jobs are divided into map and reduce tasks An instance of running a task is called a task attempt Multiple jobs can be composed into a workflow 11

12 Anatomy of a Job Job submission process Client (i.e., driver program) creates a job, configures it, and submits it to job tracker JobClient computes input splits (on client end) Job data (jar, configuration XML) are sent to JobTracker JobTracker puts job data in shared location, enqueues tasks TaskTrackers poll for tasks Off to the races 12

13 Running MapReduce job with Hadoop Steps: Defining the MapReduce stages in a Java program Loading the data into the Hadoop Distributed Filesystem Submitting the job for execution Retrieving the results from the filesystem MapReduce has been implemented in a variety of other programming languages and systems, Several NoSQL database systems have integrated MapReduce (later in this course) 13

14 Hadoop and enterprise? Hadoop is a complement to a relational data warehouse Enterprises are generally not replacing their relational DataWarehouse with Hadoop Hadoop s strengths Inexpensive High reliability Extreme scalability Flexibility: Data can be added without defining a schema Hadoop s weaknesses Hadoop is not an interactive query environment Processing data in Hadoop requires writing code 14

15 Who is using Hadoop? Source: Wikipedia, April

16 What is MapReduce model used for? At Google: Index construction for Google Search Article clustering for Google News Statistical machine translation At Yahoo!: Web map powering Yahoo! Search Spam detection for Yahoo! Mail At Facebook: Data mining Ad optimization Spam detection 16

17 Hadoop : The Apache Software Foundation delivers Hadoop 1.0, the much-anticipated 1.0 version of the popular opensource platform for storing and processing large amounts of data. six years of development, production experience, extensive testing, and feedback from hundreds of knowledgeable users, data scientists and systems engineers, culminating in a highly stable, enterprise-ready release of the fastest-growing big data platform. 17

18 Getting Started with Hadoop Different ways to write jobs: Java API Hadoop Streaming (for Python, Perl, etc) Pipes API (C++) R 18

19 Hadoop API Different APIs to write Hadoop programs: A rich Java API (main way to write Hadoop programs) A Streaming API that can be used to write map and reduce func2ons in any programming language (using standard inputs and outputs) A C++ API (Hadoop Pipes) With a higher language level (e.g., Pig, Hive) 19

20 Hadoop API Mapper void map(k1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter) void configure(jobconf job) void close() throws IOException Reducer/Combiner void reduce(k2 key, Iterator<V2> values, OutputCollector<K3,V3> output, Reporter reporter) void configure(jobconf job) void close() throws IOException Par22oner void getpartition(k2 key, V2 value, int numpartitions) 20

21 WordCount.java package org.myorg; import java.io.ioexception; import java.util.*; import org.apache.hadoop.fs.path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount { 21

22 WordCount.java public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map( LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); output.collect(word, one); 22

23 WordCount.java public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += values.next().get(); output.collect(key, new IntWritable(sum)); 23

24 WordCount.java public static void main(string[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); conf.setmapperclass(map.class); conf.setcombinerclass(reduce.class); conf.setreducerclass(reduce.class); conf.setinputformat(textinputformat.class); conf.setoutputformat(textoutputformat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); 24

25 E.g. Common wordcount Hello World Hello MapReduce Fig1: Sample input Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez 25

26 E.g. Common wordcount void map(string i, string line): for word in line: print word, 1 Fig 2: wordcount map function Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez March 2012

27 E.g. Common wordcount void reduce(string word, list partial_counts): total = 0 for c in partial_counts: total += c print word, total Fig 3: wordcount reduce function Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez 27

28 E.g. Common wordcount MAP Hello World Hello MapReduce Input Hello, 1 World, 1 First intermediate output Hello, 1 MapReduce, 1 Second intermediate output REDUCE Hello, 2 MapReduce, 1 World, 1 Final output Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez 28

29 Word Count Python Mapper def read_input(file): for line in file: yield line.split() def main(separator='\t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1) Source: Robert Grossman Tutorial Supercompu2ng 2011

30 Word Count R Mapper trimwhitespace <- function(line) gsub("(^ +) ( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readlines(con, n = 1, warn = FALSE)) > 0) { line <- trimwhitespace(line) words <- splitintowords(line) cat(paste(words, "\t1\n", sep=""), sep="") close(con) Source: Robert Grossman Tutorial Supercompu2ng 2011

31 Word Count Java Mapper public static class Map extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context context throws IOException, InterruptedException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); Source: Robert Grossman Tutorial Supercompu2ng 2011

32 Code Comparison Word Count Mapper Python def read_input(file): for line in file: yield line.split() def main(separator='\t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1) Java public static class Map extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context context throws IOException, InterruptedException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); R trimwhitespace <- function(line) gsub("(^ +) ( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readlines(con, n = 1, warn = FALSE)) > 0) { line <- trimwhitespace(line) words <- splitintowords(line) cat(paste(words, "\t1\n", sep=""), sep="") close(con) Source: Robert Grossman Tutorial Supercompu2ng 2011

33 Word Count Python Reducer def read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(sep='\t'): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count) Source: Robert Grossman Tutorial Supercompu2ng 2011

34 Word Count R Reducer trimwhitespace <- function(line) gsub("(^ +) ( +$)", "", line) splitline <- function(line) { val <- unlist(strsplit(line, "\t")) list(word = val[1], count = as.integer(val[2])) env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readlines(con, n = 1, warn = FALSE)) > 0) { line <- trimwhitespace(line) split <- splitline(line) word <- split$word count <- split$count Source: Robert Grossman Tutorial Supercompu2ng 2011

35 Word Count R Reducer (cont d) if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) else assign(word, count, envir = env) close(con) for (w in ls(env, all = TRUE)) " ) cat(w, "\t", get(w, envir = env), "\n", sep =

36 Word Count Java Reducer public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); context.write(key, new IntWritable(sum));

37 Code Comparison Word Count Reducer Python def read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(sep='\t'): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count) if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) else assign(word, count, envir = env) close(con) for (w in ls(env, all = TRUE)) cat(w, "\t", get(w, envir = env), "\n", sep = " ) Java public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { R public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { trimwhitespace <- function(line) gsub("(^ +) ( +$)", "", line) splitline <- function(line) { val <- unlist(strsplit(line, "\t")) list(word = val[1], count = as.integer(val[2])) env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readlines(con, n = 1, warn = FALSE)) > 0) { line <- trimwhitespace(line) split <- splitline(line) word <- split$word count <- split$count int sum = 0; for (IntWritable val : values) { sum += val.get(); context.write(key, new IntWritable(sum)); Source: Robert Grossman Tutorial Supercompu2ng 2011