CMSC 491 Hadoop-Based Distributed Compu8ng Spring 2016 Adam Shook

Transcription

1 CMSC 491 Hadoop-Based Distributed Compu8ng Spring 2016 Adam Shook

2 Objec8ves Explain why and how Hadoop has become the founda8on for virtually all modern data architectures Explain the architecture and design of HDFS and MapReduce

3 Agenda! What is it? Why should I care? Example Usage Hadoop Core HDFS MapReduce Ecosystem, maybe

4 WHAT IS HADOOP?

5 You tell me

6 Distributed Systems Collec&on of independent computers that appears to its users as a single coherent system* Take a bunch of Ethernet cable, wire together machines with a heuy amount of CPU, RAM, and storage Figure out how to build souware to parallelize storage/computa8on/etc. across our cluster While tolera8ng failures and all kinds of other garbage The hard part *Andrew S. Tanenbaum

7 Proper8es of Distributed Systems Reliability Availability Scalability Efficiency

8 Reliability Can the system deliver services in face of several component failures?

9 Availability How much latency is imposed on the system when a failure occurs? # of Nines Availability % Per year Per month Per week 1 90% 36.5 days 72 hours 16.8 hours 2 99% 3.65 days 7.20 hours 1.68 hours % 8.76 hours 43.8 minutes 10.1 minutes % minutes 4.38 minutes 1.01 minutes % 5.26 minutes 25.9 seconds 6.05 seconds % 31.5 seconds 2.59 seconds ms % 3.15 seconds ms ms % ms ms ms % ms ms ms

10 Scalability Can the system scale to support a growing number of tasks?

11 Efficiency How efficient is the system, in terms of latency and throughput?

12 WHY SHOULD I CARE?

13 The Short Answer is

14 The V s of Big Data! Volume Velocity Variety Veracity Value New and Improved!

15 Data Sources Social Media Web Logs Video Networks Sensors Transac8ons (banking, etc.) , Text Messaging Paper Documents

16 Value in all of this! Fraud Detec8on Predic8ve Models Recommenda8ons Analyzing Threats Market Analysis Others!

17 How do we extract value? Monolithic Compu8ng! Build bigger faster computers! Limited solu8on Expensive and does not scale as data volume increases

18 Enter Distributed Compu8ng Processing is distributed across many computers Distribute the workload to powerful compute nodes with some separate storage

19 New Class of Challenges Does not scale gracefully Hardware failure is common Sor8ng, combining, and analyzing data spread across thousands of machines is not easy

20 RDBMS is s8ll alive! SQL is s8ll very relevant, with many complex business rules mapping to a rela8onal model But there is much more than can be done by leaving rela8onal behind and looking at all of your data NoSQL can expand what you have, but it has its disadvantages

21 NoSQL Downsides Increased middle-8er complexity Constraints on query capability No standard seman8cs for query Complex to setup and maintain

22 An Ideal Cluster Linear horizontal scalability Analy8cs run in isola8on Simple API with mul8ple language support Available in spite of hardware failure

23 Enter Hadoop Hits these major requirements Two core pieces Distributed File System (HDFS) Flexible analy8c framework (MapReduce) Many ecosystem components to expand on what core Hadoop offers

24 Scalability Near-linear horizontal scalability Clusters are built on commodity hardware Component failure is an expecta8on

25 Data Access Moving data from storage to a processor is expensive Store data and process the data on the same machines Process data intelligently by being local

26 Disk Performance Disk technology has made significant advancements Take advantage of mul8ple disks in parallel 1 disk, 3TB of data, 300MB/s, ~2.5 hours to read 1,000 disks, same data, ~10 seconds to read Distribu8on of data and co-loca8on of processing makes this a reality

27 Complex Processing Code Hadoop framework abstracts complex distributed compu8ng environment No synchroniza8on code No networking code No I/O code MapReduce developer focuses on the analysis Job runs the same on one node or 4,000 nodes!

28 Fault Tolerance Failure is inevitable, and it is planned for Component failure is automa8cally detected and seamlessly handled System con8nues to operate as expected with minimal degrada8on

29 Hadoop History Spun off of Nutch, an open source web search engine Google Whitepapers GFS MapReduce Nutch re-architected to birth Hadoop Hadoop is very mainstream today

30 Hadoop Versions Hadoop is the latest release Two Java APIs, referred to as the old and new APIs Two task management systems MapReduce v1 JobTracker, TaskTracker MapReduce v2 A YARN applica8on MR runs in containers in the YARN framework

31 Common Use Cases Log Processing Image Iden8fica8on Extract Transform Load (ETL) Recommenda8on Engines Time-Series Storage and Processing Building Search Indexes Long-Term Archive Audit Logging

32 Non-Use Cases Data processing handled by one large server ACID Transac8ons

33 LinkedIn Architecture

34 LinkedIn Applica8ons

37 HDFS

38 Hadoop Distributed File System Inspired by Google File System (GFS) High performance file system for storing data Rela8vely simple centralized management Fault tolerance through data replica8on Op8mized for MapReduce processing Exposing data locality Linearly scalable Wripen in Java APIs in all the useful languages

39 Hadoop Distributed File System Use of commodity hardware Files are write once, read many Leverages large streaming reads vs random Favors high throughput vs low latency Modest number of huge files

40 HDFS Architecture Split large files into blocks Distribute and replicate blocks to nodes Two key services Master NameNode Many DataNodes Backup/Checkpoint NameNode for HA User interacts with a UNIX file system

41 NameNode Single master service for HDFS Was a single point of failure Stores file to block to loca8on mappings in a namespace All transac8ons are logged to disk Can recover based on checkpoints of the namespace and transac8on logs

42 NameNode Memory HDFS prefers fewer larger files as a result of the metadata Consider 1GB of data with 64MB block size Stored as one file Name: 1 item Blocks: 16*3 = 48 items Total items: 49 Stored as MB files Names: 1024 items Blocks: 1024*3 = 3072 items Total items: 4096 Each item is around 200 bytes

43 Checkpoint Node (Secondary NN) Performs checkpoints of the namespace and logs Not a hot backup! HDFS 2.0 introduced NameNode HA Ac8ve and a Standby NameNode service coordinated via ZooKeeper

44 DataNode Stores blocks on local disk Sends frequent heartbeats to NameNode Sends block reports to NameNode Clients connect to DataNodes for I/O

45 How HDFS Works - Writes 1 Client contacts NameNode to write data Client NameNode 3 2 NameNode says write it to these nodes Client sequen8ally writes blocks to DataNode A1 A2 A3 A4 DataNode A DataNode B DataNode C DataNode D

46 How HDFS Works - Writes Client NameNode DataNodes replicate data blocks, orchestrated by the NameNode A1 A2 A2 A1 A3 A2 A4 A1 A4 A3 A4 A3 DataNode A DataNode B DataNode C DataNode D

47 How HDFS Works - Reads 1 Client contacts NameNode to read data Client NameNode 3 2 NameNode says you can find it here Client sequen8ally reads blocks from DataNode A1 A2 A2 A1 A3 A2 A4 A1 A4 A3 A4 A3 DataNode A DataNode B DataNode C DataNode D

48 How HDFS Works - Failure Client NameNode Client connects to another node serving that block A1 A2 A2 A1 A3 A2 A4 A1 A4 A3 A4 A3 DataNode A DataNode B DataNode C DataNode D

49 HDFS Blocks Default block size was 64MB, now 128 MB Configurable Larger sizes are common, say 256 MB or 512 MB Default replica8on is three Also configurable, not rarely changed Stored as files on the DataNode s local fs Cannot associate any block with its true file

50 Replica8on Strategy First copy is wripen to the same node as the client If the client is not part of the cluster, first block goes to a random node Second copy is wripen to a node on a different rack Third copy is wripen to a different node on the same rack as the second copy

51 Data Locality Key in achieving performance MapReduce tasks run as close to data as possible Some terms Local On-Rack Off-Rack

52 Data Corrup8on Use of checksums to to ensure block integrity Checksum is calculatd on read and compared against that when it was wripen Fast to calculate and space-efficient If the checksums differ, client reads the block from another DataNode Corrupted block will be deleted and replicated by a non-corrupt block DataNode periodically runs a background thread to do this checksum process For those files that aren t read very ouen

53 Fault Tolerance If no heartbeat is received from DN within a configurable 8me window, it is considered lost 10 minute default NameNode will: Determine which blocks were on the lost node Locate other DataNodes with valid copies Instruct DataNodes with copies to replicate blocks to other DataNodes

54 Interac8ng with HDFS Primary interfaces are CLI and Java API Web UI for read-only access WebHDFS provides RESTful read/write HppFS also exists snakebite is a Python library from Spo8fy FUSE, allows HDFS to be mounted on standard file system Typically used so legacy apps can use HDFS data hdfs dfs -help will show CLI usage

55 HADOOP MAPREDUCE

56 MapReduce! A programming model for processing data Contains two phases! Map Perform a map func8on on key/value pairs Reduce Perform a reduce func8on on key/value groups Groups are created by sor8ng map output Opera8ons on key/value pairs open the door for very parallelizable algorithms

57 Hadoop MapReduce Automa8c paralleliza8on and distribu8on of tasks Framework handles scheduling task and repea8ng failed tasks Developer can code many pieces of the puzzle Framework handles a Shuffle and Sort phase between map and reduce tasks. Developer need only focus on the task at hand, rather than how to manage where data comes from and where it goes

58 MRv1 and MRv2 Both manage compute resources, jobs, and tasks Job API the same MRv1 is proven in produc8on JobTracker / TaskTrackers MRv2 is a new applica8on on YARN YARN is a generic playorm for developing distributed applica8ons

59 MRv1 - JobTracker Monitors job and task progress Issues task a?empts to TaskTrackers Re-tries failed task apempts Four failed apempts of same task = one failed job Default scheduler is FIFO order CapacityScheduler or FairScheduler is preferred Single point of failure for MapReduce

60 MRv1 TaskTrackers Runs on same node as DataNodes Sends heartbeats and task reports to JT Configurable number of map and reduce slots Runs map and reduce task a?empts in a separate JVM

61 How MapReduce Works 1 Client submits job to JobTracker Client JobTracker JobTracker submits tasks to TaskTrackers 4 JobTracker reports metrics 2 A1 A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3 DataNode A DataNode B DataNode C DataNode D TaskTracker A TaskTracker B TaskTracker C TaskTracker D B1 B3 B4 B2 B3 B1 B3 B2 B4 B4 B1 B2 Job output is wripen to DataNodes w/replica8on 3

62 How MapReduce Works - Failure Client JobTracker JobTracker assigns task to different node A1 A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3 DataNode A DataNode B DataNode C DataNode D TaskTracker A TaskTracker B TaskTracker C TaskTracker D B1 B3 B4 B2 B3 B1 B3 B2 B4 B4 B1 B2

63 Map Input (0, "hadoop is fun") (52, "I love hadoop") (104, "Pig is more fun") Map Task 0 Map Task 1 Map Task 2 Map Output ("hadoop", 1) ("is", 1) ("I", 1) ("love", 1) ("Pig", 1) ("is", 1) ("fun", 1) ("hadoop", 1) ("more", 1) ("fun", 1) SHUFFLE AND SORT Reducer Input Groups ("fun", {1,1}) ("hadoop", {1,1}) ("love", {1}) ("I", {1}) ("is", {1,1}) ("more", {1}) ("Pig", {1}) Reduce Task 0 Reduce Task 1 Reducer Output ("fun", 2) ("hadoop", 2) ("love", 1) ("I", 1) ("is", 2) ("more", 1) ("Pig", 1)

64 Example -- Word Count Count the number of 8mes each word is used in a body of text Uses TextInputFormat and TextOutputFormat map(byte_offset, line) foreach word in line emit(word, 1) reduce(word, counts) sum = 0 foreach count in counts sum += count emit(word, sum)

65 Mapper Code public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable ONE = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context context) { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); } } while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, ONE); }

66 Shuffle and Sort Mapper 0 Mapper 1 Mapper 2 Mapper 3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 1 Mapper outputs to a single par88oned file 2 Reducers copy their parts P0 P0 P0 P0 P1 P1 P1 P1 P2 P2 P2 P2 P3 P3 P3 P3 P0 P1 P2 P3 3 Reducer merges par88ons Reducer 0 Reducer 1 Reducer 2 Reducer 3

67 Reducer Code public class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterable<IntWritable> values, Context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); } } } context.write(key, new IntWritable(sum));

68 Resources, Wrap-up, etc. hpp://hadoop.apache.org Suppor8ve community Hadoop-DC Data Science MD Bal8more Hadoop Users Group Plenty of resources available to learn more Books lists Blogs

69 That Ecosystem I Men8oned HDFS MapReduce YARN Sqoop Flume NiFi Pig Hive Streaming HBase Greenplum Accumulo Avro Geode Flink Twill DataFu Falcon HCatalog Samza Tajo HAWQ Parquet Mahout Oozie Storm ZooKeeper Spark Cassandra Ka a Crunch Azkaban Knox Ambari Mesos Drill Giraph Nutch Sentry Chukwa MRUnit Phoenix Tez Ranger

70 References Hadoop: The Defini8ve Guide, Chapter 16.2 hpp:// hpp://datameer2.datameer.com/blog/wpcontent/uploads/2013/01/ hadoop_ecosystem_full2.png Google Images