Hadoop. HDFS & MapReduce. The Basics: divide and conquer petabyte-scale data. Matthew McCullough, Ambient Ideas, LLC

Transcription

1 Hadoop divide and conquer petabyte-scale data The Basics: HDFS & MapReduce Matthew McCullough, Ambient Ideas, LLC

2

3

4

5

6

7 Metadata Matthew McCullough Ambient Ideas, LLC Code

8

9

10 Why Hadoop? HDFS MapReduce

11 Your big data Data type? Business need? Processing? How large? Growth projections? What if you saved everything?

13 Why Hadoop?

14 I use Hadoop often.

15 I use Hadoop often. That s the sound my Tivo makes every time I skip a commercial. -Brian Goetz, author of Java Concurrency in Practice

16 Big Data

17 NOSQL

18 NO SQL

19 NO SQL

20 Not SQL

21 Not SQL

22 Not Only SQL

23 NOSQL Applies to data No strong schemas No foreign keys Applies to processing No SQL-99 standard No execution plan

24 The computer you are using right now may very well have the fastest GHz processor you ll ever own

25 Scale up?

26 Scale out

27 Web Crawling

28 g Do u g in tt Cu

29 open source

30 Marketing?

31 Name Research?

32

33 Daughter s stuffed toy

34 Lucene

35 Lucene Nutch

36 Lucene Nutch Hadoop

37 Lucene Nutch Hadoop

38 Lucene Mahout Nutch Hadoop

39

40 Today

41 current version Dozens of companies contributing Hundreds of companies using

42

43

44

45

46

47

48 Bing

50 HDFS

51 Virtual Machine

52 VM Sources Yahoo True to the OSS distribution Cloudera Desktop tools Both VMWare based

53 StartIng Up

54 Tailing the logs

55

56 Filesystem

57

58 scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware

59 HDFS Basics Open Source implementation of Google BigTable Replicated data store Stored in 64MB blocks

60 HDFS Rack location aware Configurable redundancy factor Self-healing Looks almost like *NIX filesystem

61 Why HDFS? Random reads Parallel reads Redundancy

62 Data Overload

63 Data Overload overflow from traditional RDBMS or log files destination?

67

68 Lab Test HDFS Upload a file List directories

69 Lab HDFS Upload Show contents of file in HDFS Show vocabulary of HDFS

70 HDFS Challenges Writes are re-writes today Append is planned Block size, alignment Small file inefficiencies NameNode SPOF

72 MapReduce

73 MapReduce is functional programming on a distributed processing platform

74 Hadoop s Premise Geography-aware distribution of brute-force processing

75 MapReduce the algorithm

76 To appear in OSDI MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat [email protected], [email protected] Google, Inc. Abstract MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelizedand executedon a largecluster ofcommodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google s clusters every day. 1 Introduction Over the past five years, the authors and many others at Google have implemented hundreds of special-purpose computations that process large amounts of raw data, such as crawled documents, web request logs, etc., to compute various kinds of derived data, such as inverted indices, various representations of the graph structure of web documents, summaries of the number of pages crawled per host, the set of most frequent queries in a given day, etc. Most such computations are conceptually straightforward. However, the input data is usually large and the computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time. The issues of how to parallelize the computation, distribute the data, and handle failures conspire to obscure the original simple computation with large amounts of complex code to deal with these issues. As a reaction to this complexity, we designed a new abstraction that allows us to express the simple computations we were trying to perform but hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library. Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with userspecified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance. The major contributions of this work are a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs. Section 2 describes the basic programming model and gives several examples. Section 3 describes an implementation of the MapReduce interface tailored towards our cluster-based computing environment. Section 4 describes several refinements of the programming model that we have found useful. Section 5 has performance measurements of our implementation for a variety of tasks. Section 6 explores the use of MapReduce within Google including our experiences in using it as the basis

77 A programming model and implementation for processing and generating large data sets

78 Lab Verify Setup Launch cloudera-training VMWare instance Open a terminal window Verify hadoop

79 The process ➊ Map(k1,v1) -> list(k2,v2) Every item is parallel candidate for Map Shuffle (group) pairs from all lists by key Reduce(k2, list (v2)) -> list(v3) Reduce in parallel on each group of keys

80 ➋

81 Split ➋

82 Split ➋ Map Map Map

83 Split ➋ Map Map Map Shuffle Shuffle

84 Split ➋ Map Map Map Shuffle Shuffle Reduce Reduce

85 Lab MapReduce Run the grep Shakespeare job View the status consoles

86 MapReduce a word counting conceptual example

87 The Goal Provide the occurrence count of each distinct word across all documents

88 Raw Data a folder of documents ➌ mydoc1.txt mydoc2.txt mydoc3.txt At four years old I acted out Glad I am not four years old Yet I still act like someone four

89 at four years old I acted out Map break documents into words glad I am not four years old yet I still act like someone four

90 at four four four old Shuffle physically group (relocate) by key glad I I I am yet not still act like old acted years years someone out

91 at = 1 four = 3 old = 2 acted = 1 Reduce count word occurrences glad = 1 I = 3 am = 1 years = 2 yet =1 not = 1 still = 1 act = 1 like = 1 someone = 1 out = 1

92 Reduce Again sort occurrences alphabetically act = 1 acted = 1 at = 1 am = 1 four = 3 glad = 1 I = 3 like = 1 not = 1 out = 1 old = 2 someone = 1 still = 1 years = 2 yet =1

93 Grep.java

94 package org.apache.hadoop.examples; import java.util.random; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.conf.configured; import org.apache.hadoop.fs.filesystem; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.longwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapred.lib.*; import org.apache.hadoop.util.tool; import org.apache.hadoop.util.toolrunner; /* Extracts matching regexs from input files and counts them. */ public class Grep extends Configured implements Tool { private Grep() {} // singleton public int run(string[] args) throws Exception { if (args.length < 3) { System.out.println("Grep <indir> <outdir> <regex> [<group>]");

95 import org.apache.hadoop.util.toolrunner; /* Extracts matching regexs from input files and counts them. */ public class Grep extends Configured implements Tool { private Grep() {} // singleton public int run(string[] args) throws Exception { if (args.length < 3) { System.out.println("Grep <indir> <outdir> <regex> [<group>]"); ToolRunner.printGenericCommandUsage(System.out); return -1; } Path tempdir = new Path("grep-temp-"+ Integer.toString(new Random().nextInt (Integer.MAX_VALUE))); JobConf grepjob = new JobConf(getConf(), Grep.class); try { grepjob.setjobname("grep-search"); FileInputFormat.setInputPaths(grepJob, args[0]);

96 grepjob.setjobname("grep-search"); FileInputFormat.setInputPaths(grepJob, args[0]); grepjob.setmapperclass(regexmapper.class); grepjob.set("mapred.mapper.regex", args[2]); if (args.length == 4) grepjob.set("mapred.mapper.regex.group", args[3]); grepjob.setcombinerclass(longsumreducer.class); grepjob.setreducerclass(longsumreducer.class); FileOutputFormat.setOutputPath(grepJob, tempdir); grepjob.setoutputformat(sequencefileoutputformat.class); grepjob.setoutputkeyclass(text.class); grepjob.setoutputvalueclass(longwritable.class); JobClient.runJob(grepJob); JobConf sortjob = new JobConf(Grep.class); sortjob.setjobname("grep-sort"); FileInputFormat.setInputPaths(sortJob, tempdir); sortjob.setinputformat(sequencefileinputformat.class);

97 FileOutputFormat.setOutputPath(grepJob, tempdir); grepjob.setoutputformat(sequencefileoutputformat.class); grepjob.setoutputkeyclass(text.class); grepjob.setoutputvalueclass(longwritable.class); JobClient.runJob(grepJob); JobConf sortjob = new JobConf(Grep.class); sortjob.setjobname("grep-sort"); FileInputFormat.setInputPaths(sortJob, tempdir); sortjob.setinputformat(sequencefileinputformat.class); sortjob.setmapperclass(inversemapper.class); sortjob.setnumreducetasks(1); // write a single file FileOutputFormat.setOutputPath(sortJob, new Path(args [1])); sortjob.setoutputkeycomparatorclass // sort by decreasing freq (LongWritable.DecreasingComparator.class); } JobClient.runJob(sortJob);

98 JobClient.runJob(grepJob); JobConf sortjob = new JobConf(Grep.class); sortjob.setjobname("grep-sort"); FileInputFormat.setInputPaths(sortJob, tempdir); sortjob.setinputformat(sequencefileinputformat.class); sortjob.setmapperclass(inversemapper.class); sortjob.setnumreducetasks(1); // write a single file FileOutputFormat.setOutputPath(sortJob, new Path(args [1])); sortjob.setoutputkeycomparatorclass // sort by decreasing freq (LongWritable.DecreasingComparator.class); } JobClient.runJob(sortJob); } finally { FileSystem.get(grepJob).delete(tempDir, true); } return 0;

99 RegExMapper.java

100 /** * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information * regarding copyright ownership. The ASF licenses this file * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at * * * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.hadoop.mapred.lib; import java.io.ioexception; import java.util.regex.matcher; import java.util.regex.pattern;

101 /** A {@link Mapper} that extracts text matching a regular expression. */ public class RegexMapper<K> extends MapReduceBase implements Mapper<K, Text, Text, LongWritable> { private Pattern pattern; private int group; public void configure(jobconf job) { pattern = Pattern.compile(job.get("mapred.mapper.regex")); group = job.getint("mapred.mapper.regex.group", 0); } public void map(k key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { String text = value.tostring(); Matcher matcher = pattern.matcher(text); while (matcher.find()) { output.collect(new Text(matcher.group(group)), new LongWritable (1)); } } }

102 Lab MapReduce View the Shakespeare output part-xxxx file Why part-xxxx naming?

103 Have Code, Will Travel Code travels to the data Opposite of traditional systems

104 Streaming

105 Unix via MapReduce Any UNIX command Any shell-invokable script Perl Python Ruby

106 Unix via MapReduce Line at a time Tab separator

107 hadoop jar contrib/streaming/ hadoop streaming.jar -input people.csv -output outputstuff -mapper 'cut -f 2 -d,' -reducer 'uniq'

108 Lab Streaming Run a streaming job

109 The Grid DSLs Tools

110 The Grid

111 Motivations

112 1 Gigabyte

113 1 Terabyte

114 1 Petabyte

115 16 Petabytes

116 Near-Linear Hardware Scalability

117 Applications Protein folding pharmaceutical research Search Engine Indexing walking billions of web pages Product Recommendations based on other customer purchases Sorting terabytes to petabyes in size Classification government intelligence

118 Contextual Ads

119 Contextual Ads Shopper A Customer B looking at Product X is told that 66% of other customers who bought X Customer C Customer D also bought Y did not buy also bought Y Product Y

120 SELECT reccprod.name, reccprod.id FROM products reccprod WHERE purchases.customerid =

121 SELECT reccprod.name, reccprod.id FROM products reccprod WHERE purchases.customerid = (SELECT customerid FROM customers WHERE purchases.productid = thisprod) LIMIT 5

122 Grid Benefits

123 Scalable Data storage is pipelined Code travels to data Near linear hardware scalability

124 OptimizED Preemptive execution Maximizes use of faster hardware New jobs trump P.E.

125 Fault Tolerant Configurable data redundancy Minimizes hardware failure impact Automatic job retries Self healing filesystem

126 Sproinnnng! Bzzzt! Poof!

127 server Funerals No pagers go off when machines die Report of dead machines once a week Clean out the carcasses

128 Robustness attributes prevented from bleeding into application code Data redundancy Node death Retries Data geography Parallelism Scalability

129 Components

130 Hadoop Components

131 Hadoop Components Column Storage Workflow/ Transactions Data Warehousing MapReduce Common Filesystem

132 Hadoop Components Tool Common HDFS Pig HBase Hive ZooKeeper Chukwa Purpose MapReduce Filesystem Analyst MapReduce language Column-oriented data storage SQL-like language for HBase Workflow & distributed transactions Log file processing

134 DSLs

135 Sync, Async Pig is asynchronous Hive is asynchronous HBase is near realtime RDBMS is realtime

136 Pig

137 Pig Basics Yahoo-authored add-on High-level language for authoring data analysis programs Console

138 Pig Sample Person = LOAD 'people.csv' using PigStorage(','); Names = FOREACH Person GENERATE $2 AS name; OrderedNames = ORDER Names BY name ASC; GroupedNames = GROUP OrderedNames BY name; NameCount = FOREACH GroupedNames GENERATE group, COUNT(OrderedNames); store NameCount into 'names.out';

139

140

141 Hive

142 Hive Basics Authored by SQL interface to HBase Hive is low-level Hive-specific metadata Data warehousing

143 SELECT * FROM shakespeare WHERE freq > 100 SORT BY freq ASC LIMIT 10;

144 Sync, Async RDBMS SQL is realtime Hive is primarily asynchronous

145 HBase

146

147

148

149 HBase Basics Map-oriented storage Key value pairs Column families Stores to HDFS Fast Usable for synchronous responses

150 hbase> help create 'mylittletable', 'mylittlecolumnfamily' describe 'mylittletable' put 'mylittletable', 'r2', 'mylittlecolumnfamily', 'x' get 'mylittletable', 'r2' scan 'mylittletable'

151 Sqoop

152

153 Sqoop A utility Imports from RDBMS Outputs plaintext, SequenceFile, or Hive

154 sqoop --connect jdbc:mysql://database.example.com/

156 Tools

157 Monitoring

158 Web Status Panels NameNode JobTracker

159

160 Extended Family Members

161 Cascading

162 Another DSL Java-based Different vocabulary Abstraction from MapReduce Uses Hadoop Cascalog for Clojure

163