Hadoop and ecosystem * 本文中的言论仅代表作者个人观点 * 本文中的一些图例来自于互联网. Information Management. Information Management IBM CDL Lab

Transcription

1 IBM CDL Lab Hadoop and ecosystem * 本文中的言论仅代表作者个人观点 * 本文中的一些图例来自于互联网 Information Management 2012 IBM Corporation

2 Agenda Hadoop 技术 Hadoop 概述 Hadoop 1.x Hadoop 2.x Hadoop 生态系统 IBM Corporation 2

3 The BIG Data Challenge Extract insight from a high volume, variety and velocity of datain a timely and cost-effective manner for traditional technology is difficult. Why? Data size is too large for traditional technology to process: Too slow: 1 TB of data takes 3.4 hours for 1 disk Performance for the system decreases a lot and may crash Too expensive A large server causes lots of money (million) Unable to process semi-structured or unstructured data Current DB technology or other popular technology can only compute on structured data Unable to process flowing data Traditional technology can process data at rest only We need a new technology! IBM Corporation

4 Technology improvements through the years CPU Speeds: MIPS at 40 MHz ,561 MIPS at 1.2 GHz ,600 MIPS at 3.3 GHz RAM Memory K conventional memory (256K extended memory recommended) MB memory GB (and more) Disk Capacity MB GB TB Disk Latency (speed of reads and writes) not much improvement in last 7-10 years Access speeds haven t caught up with the incensement of the storage capacities. Disk latency is a bottleneck for data processing IBM Corporation

5 How long does it take to read 1 TB of data? 1TB (at 80Mb / sec): 1 disk hours 10 disks - 20 min 100 disks - 2 min 1000 disks - 12 sec IBM Corporation

6 Parallel Data Processing is the answer! It was with us for a while: GRID computing - spreads processing load Distributed workload - hard to manage applications, overhead on developer Parallel databases DB2 DPF, Teradata, DB2 PureScale, etc (distribute the data) IBM Corporation

7 However, it is not so easy to do Parallel Data Processing Handle partial hardware failures without going down: If machine fails, we should be switch over to stand by machine Capability: Increase capacity without restarting the whole system (PureScale) More computing power should equal to faster processing Result consistency: Answer should be consistent (independent of something failing) and returned in reasonable amount of time Is there a technology that will do all these things itself so that developer can just focus on the data business analysis? The answer is YES --hadoop is a good one to use IBM Corporation

8 What is Hadoop? Open source project Written in Java Optimized to handle Massive amounts of data through parallelism A variety of data (structured, unstructured, semi-structured) Using inexpensive commodity hardware Reliability provided through replication Inspired by Google technologies : MapReduce Google file system IBM Corporation

9 Hadoop some history Created by Doug Cutting. hadoop is the name his kid give to a stuffed yellow elephant In 2002, Doug Cutting was in a project Nutch, a currently widely used web search engine. In the Nutch project, they realized that their architecture wouldn t scale to the billions of pages on the Web. In 2003, Google published a paper about Google s distributed filesystem, called GFS. In 2004, Google published paper that introduced MapReduce to the world. The Nutch developers worked out a Nutch Distributed Filesystem and a working MapReduce implementation. In 2006, they moved out of Nutch and became a subproject of Lucene called Hadoop; in 2008, Hadoop became a top-level project of Apache IBM Corporation

10 2009: Cloudera delivered CDH 2010: v0.20.2(first version includes all features) 2010/05: IBM delivered InfoSphere BigInsights 2011/07: Yahoo! created Hortonworks 2012: v : v2.1.0-beta 2013: v : v : v2.4 Doug Cutting IBM Corporation

11 Hadoop Ecosystem Core: HDFS and Map/Reduce Other Open Sources: Hbase Pig Hive IBM Corporation

12 Hadoop Applicable Area Ad placement Applicable to Personal recommendations TB level data processing Result ranking Advertiser ROI Write once and read many times.. Massively parallel worth hundreds of CPU cores More on Not Applicable to Low latency is critical for accessing data Processing a small subset of the data within a large data set Real time processing of data that must be immediately processed IBM Corporation

13 HDFS IBM Corporation

14 HDFS Hadoop Distributed File System Characteristics : storing very large files (GBs to TBs. Modest number of large files) Designed for streaming data access patterns (write once, read many times ) running on clusters of commodity hardware but with built-in fault tolerance HDFS is not applicable for: Low-latency data access Lot s of small files Multiple writers, arbitrary file modifications IBM Corporation

15 Hadoop 1.x Framework - HDFS IBM Corporation

16 HDFS Concepts Blocks HDFS aslo has the concept of block, but it is a much larger unit - 64 MB by default. For larger files should be 128MB (recommended) NameNode The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. Secondary NameNode Periodically merges the namespace image with edit log to prevent the edit log becoming too large. The merged namespace image will also be kept on namenode. DataNode Node for storing data. Report back to the namenode periodically with lists of blocks that they are storing IBM Corporation

17 Blocks Blocks are stored on DataNodes Stored in 3 (default, configured by replica) locations one of them on separate rack. Name of file blocks and location managed by NameNode If chunk of file is smaller than HDFS block only needed space is used 128MB 128MB 128MB 66MB IBM Corporation

18 NameNode Manages the filesystem namespace; file data never goes or stored on NameNode Single NameNode per cluster Stores all metadata in memory to serve read requests File name Permissions Owner Group Location of all blocks for a file.. All write operations recorded in edit log (synced after every write); Metadata information in memory is also updated after the edit log has been written Also stores the metadata information and edit log on disk (fsimage, but location of all blocks for a file is not stored in fsimage) IBM Corporation

19 NameNode directory structure Directory structure $(dfs.name.dir)/current/version /edits /fsimage /fstime VERSION is java properties file, recording HDFS version edits is the edit log fsimage is metadata information checkpoint on disk. When recovered from failure, fsimage is loaded into memory and then edit log is replayed fstime is for the current fsimage generated time IBM Corporation

20 Secondary NameNode Responsible for check point : Periodically merges fsimage and edit log -> generate new fsimage -> send new fsimage back to namenode -> old edit log and fsimage on namenode is replaced, fstime is updated Directory is same as NameNode except it keeps previous checkpoint version in addition to current. Memory requirements are the same as NameNode (big) Typically on a separate machine in large cluster ( > 10 nodes) It can be used to restore failed NameNode (just copy current directory to new NameNode) IBM Corporation

21 DataNode Storage for file blocks Different blocks from the same file are stored on different DataNodes Replication factor is configurable Size of the blocks is configurable DataNode periodically reports to NameNode IBM Corporation

22 Read file from HDFS IBM Corporation

23 Read file from HDFS - Cont d During reading, if the DFSInputStream encounters an error while communicating with a datanode, then it will try the next closest one for that block. It will also remember datanodes that have failed so that it doesn t needlessly retry them for later blocks The DFSInputStream also verifies checksums for the data transferred to it from the datanode. If a corrupted block is found, it is reported to the namenode before the DFSInputStream attempts to read a replica of the block from another datanode IBM Corporation

24 Write file to HDFS IBM Corporation

25 How does namenode choose replica to write on? B1 B2 B3 create Block done NameNode DataNodeR1N1 DataNodeR2N1 DataNodeR3N1 DataNodeR1N2 DataNodeR2N2 DataNodeR3N2 DataNodeR1N3 DataNodeR2N3 DataNodeR3N3 DataNodeR1N4 DataNodeR2N4 DataNodeR3N4 DataNodeR1N5 DataNodeR2N5 DataNodeR3N5 DataNodeR1N6 DataNodeR2N6 DataNodeR3N6 DataNodeR1N7 DataNodeR2N7 DataNodeR3N7 DataNodeR1N8 DataNodeR2N8 DataNodeR3N IBM Corporation

26 Write file to HDFS cont d If writing to a replica fails, The failed datanode is removed from the pipeline and the remainder of the block s data is written to the two good datanodes in the pipeline. It s possible, but unlikely, that multiple datanodes fail while a block is being written. As long as dfs.replication.min replicas (default one) are written, the write will succeed, and the block will be asynchronously replicated across the cluster until its target replication factor is reached (dfs.replication, which defaults to three) IBM Corporation

27 HDFS command line Hadoop fs args IBM Corporation

28 MapReduce IBM Corporation

29 MapReduce framework A distributed data processing model and execution environment that runs on large clusters of commodity machines MapReduce works by breaking the processing into two phases: the map phase and the reduce phase. Each phase has key-value pairs as input and output. The programmer usually need to specify: the map function and the reduce function. Also needs to specify configuration information to run the map and reduce IBM Corporation

30 JobTracker and TaskTrackers Two types of daemons that control job execution: Job Tracker (master node) Task Trackers (slave nodes) Job sent to JobTracker JobTracker communicates with NameNode and assigns parts of job to TaskTrackers (TaskTracker is run on each DataNode) Task is a single MAP or REDUCE operation over piece of data Hadoop divides the input to MAP / REDUCE job into equal splits The JobTracker knows (from NameNode) which node contains the data, and which other machines are nearby. Task processes send heartbeats to JobTracker IBM Corporation

31 Hadoop 1.x Framework - MapReduce IBM Corporation

32 An example of MapReduce We have a weather data set (lots of files), with each line in the file representing a record for that year. Now need to get the highest temperature for each year Input for Map: (0, ) (33, ) (66, ) (99, ) (132, ) (165, ) (198, ) (231, ) (264, ) (297, ) Shuffle Output for Map: (1950, 0) (1950, 22) (1950, -11) (1949, 111) (1949, 78) (1937, 1) (1937, -2) (1945, 1) (1945, 2) (1945, 78) Input for Reduce: (1950, [0, 22, 11]) (1949, [111, 78]) (1937, [1, -2]) (1945, [1, 2, 78]) Output for Reduce: (1950, 22) (1949, 111) (1937, 1) (1945, 78) IBM Corporation

33 Map code public class MaxTemperatureMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private static final int MISSING = public void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); String year = line.substring(15, 19); int airtemperature; if (line.charat(87) == '+') { // parseint doesn't like leading plus signs airtemperature = Integer.parseInt(line.substring(88, 92)); } else { airtemperature = Integer.parseInt(line.substring(87, 92)); } String quality = line.substring(92, 93); if (airtemperature!= MISSING && quality.matches("[01459]")) { context.write(new Text(year), new IntWritable(airTemperature)); } } } IBM Corporation

34 Reduce code public class MaxTemperatureReducer extends Reducer<Text, IntWritable, Text, IntWritable> public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int maxvalue = Integer.MIN_VALUE; for (IntWritable value : values) { maxvalue = Math.max(maxValue, value.get()); } context.write(key, new IntWritable(maxValue)); } } IBM Corporation

35 Code to run the job public class MaxTemperature { public static void main(string[] args) throws Exception { if (args.length!= 2) { System.err.println("Usage: MaxTemperature <input path> <output path>"); System.exit(-1); } Job job = new Job(); job.setjarbyclass(maxtemperature.class); job.setjobname("max temperature"); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setmapperclass(maxtemperaturemapper.class); job.setreducerclass(maxtemperaturereducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); System.exit(job.waitForCompletion(true)? 0 : 1); } } IBM Corporation

36 MapReduce Processing User runs a program on client computer Program submits a job to HDFS. Job contains: Input data Map / Reduce program Configuration information Hadoop runs the job by dividing it into tasks: map tasks and reduce tasks Hadoop divides the input sent to the job into fixed-size splits (block-size). Create 1 map for each split Hadoop does its best to run the map task on a node where the input data resides in HDFS. This is called the data locality optimization IBM Corporation

37 public static class TokenizerMapper public static class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable> { extends Mapper<Object,Text,Text,IntWritable> { private final static IntWritable private final static IntWritable one = new IntWritable(1); one = new IntWritable(1); private Text word = new Text(); private Text word = new Text(); public void map(object key, Text val, Context public void map(object key, Text val, Context StringTokenizer itr = StringTokenizer itr = new StringTokenizer(val.toString()); new StringTokenizer(val.toString()); while (itr.hasmoretokens()) { while (itr.hasmoretokens()) { word.set(itr.nexttoken()); word.set(itr.nexttoken()); context.write(word, one); context.write(word, one); } } } } } } public static class IntSumReducer public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWrita extends Reducer<Text,IntWritable,Text,IntWrita private IntWritable result = new IntWritable(); private IntWritable result = new IntWritable(); public void reduce(text key, public void reduce(text key, Iterable<IntWritable> val, Context context){ Iterable<IntWritable> val, Context context){ int sum = 0; int sum = 0; for (IntWritable v : val) { for (IntWritable v : val) { sum += v.get(); sum += v.get(); Distribute map tasks to cluster Hadoop Data Nodes 1. Map Phase (break job into small parts) 2. Shuffle (transfer interim output for final processing) MapReduce Application Shuffle 3. Reduce Phase (boil all output down to a single result set) Result Set Return a single result set Copyright IBM Corporation IBM Corporation

38 RDMS and Hadoop complementary, not competing Structured data with known schemas Records, long fields, objects, XML Updates allowed SQL & XQuery Quick response, random access Data loss is not acceptable Security and auditing Encryption Sophisticated data compression Enterprise hardware 40+ years of innovation Random access (indexing) Unstructured and structured Files Only inserts and deletes Hive, Pig, Jaql Batch processing Data loss can happen sometimes Not yet Not yet Simple file compression Commodity hardware 10+ years old technology Access files only (streaming) Large DBA and Application development community, widely used Small number of companies using it in production, many startups IBM Corporation

39 Hadoop 1.x limitations HDFS Single point of failure Namenode heavy workload MapReduce: Scalability (maximum 4000 nodes) Availability (Jobtracker heavy workload and single point of failure) Inefficient resource management (resource divided into map task slot and reduce task slot) Processing frameworks work separately. Unable to share data/resource IBM Corporation

40 Hadoop 2.x Framework HDFS HA IBM Corporation

41 Hadoop 2.x Framework HDFS Federation IBM Corporation

42 Hadoop 2.x Framework HDFS Federation - Two Layers - Client Side Mount Table IBM Corporation

43 HA & Federation IBM Corporation

44 Hadoop 1.x limitations HDFS Single point of failure Namenode heavy workload MapReduce: Scalability (maximum 4000 nodes) Availability (Jobtracker heavy workload and single point of failure) Inefficient resource management (resource divided into map task slot and reduce task slot) Processing frameworks work separately. Unable to share data/resource IBM Corporation

45 Hadoop 2.x Framework - YARN (Yet Another Resource Negotiator) IBM Corporation

46 Hadoop 2.x Framework - YARN IBM Corporation

47 Hadoop 2.x Framework - YARN IBM Corporation

48 Job Runs On Yarn IBM Corporation

49 Advantages Advantages of YARN: - Good scalability - Supports multiple processing framework - Efficient cluster usage IBM Corporation

50 HBase on Yarn Hoya HBase v IBM Corporation

51 HBase on Yarn - Hoya IBM Corporation

52 Hadoop Ecosystem IBM Corporation

53 Questions? IBM Corporation