Hadoop 2.6 Configuration and More Examples

Transcription

1 Hadoop 2.6 Configuration and More Examples Big Data 2015

2 Apache Hadoop & YARN Apache Hadoop (1.X)! De facto Big Data open source platform Running for about 5 years in production at hundreds of companies like Yahoo, Ebay and Facebook Hadoop 2.X Significant improvements in HDFS distributed storage layer. High Availability, NFS, Snapshots YARN next generation compute framework for Hadoop designed from the ground up based on experience gained from Hadoop 1 YARN running in production at Yahoo for about a year

3 1st Generation Hadoop: Batch Focus HADOOP 1.0 Built for Web-Scale Batch Apps Single'App' Single'App' All other usage patterns MUST leverage same infrastructure INTERACTIVE ONLINE Single'App' BATCH Single'App' BATCH Single'App' BATCH Forces Creation of Silos to Manage Mixed Workloads HDFS HDFS HDFS

4 Hadoop 1 Architecture JobTracker Manage Cluster Resources & Job Scheduling TaskTracker Per-node agent Manage Tasks

5 Hadoop 1 Limitations Lacks Support for Alternate Paradigms and Services Force everything needs to look like Map Reduce Iterative applications in MapReduce are 10x slower Scalability Max Cluster size ~5,000 nodes Max concurrent tasks ~40,000 Availability Failure Kills Queued & Running Jobs Hard partition of resources into map and reduce slots Non-optimal Resource Utilization

6 Hadoop as Next-Gen Platform Single Use System Batch Apps HADOOP 1.0 Multi Purpose Platform Batch, Interactive, Online, Streaming, HADOOP 2.0 MapReduce% (data*processing)* Others% MapReduce% (cluster*resource*management* *&*data*processing)* YARN% (cluster*resource*management)* HDFS% (redundant,*reliable*storage)* HDFS2% (redundant,*highly8available*&*reliable*storage)*

7 Hadoop 2 - YARN Architecture ResourceManager (RM) Central agent - Manages and allocates cluster resources NodeManager (NM) Per-Node agent - Manages and enforces node resource allocations ApplicationMaster (AM) Per-Application Manages application Client Resource Manager Node Manager Node Manager App Mstr Container lifecycle and task scheduling MapReduce Status Job Submission Node Status Resource Request Node Manager

8 YARN: Taking Hadoop Beyond Batch Store ALL DATA in one place Interact with that data in MULTIPLE WAYS with Predictable Performance and Quality of Service Applica'ons+Run+Na'vely+in#Hadoop+ BATCH+ (MapReduce)+ INTERACTIVE+ (Tez)+ ONLINE+ (HBase)+ STREAMING+ (Storm,+S4, )+ GRAPH+ (Giraph)+ INLMEMORY+ (Spark)+ HPC+MPI+ (OpenMPI)+ OTHER+ (Search)+ (Weave )+ YARN+(Cluster*Resource*Management)*** HDFS2+(Redundant,*Reliable*Storage)*

9 5 Key Benefits of YARN 1. New&Applica-ons&&&Services& 2. Improved&cluster&u-liza-on& 3. Scale& 4. Experimental&Agility& 5. Shared&Services&

10 5 Key Benefits of YARN Yahoo! leverages YARN 40,000+ nodes running YARN across over 365PB of data ~400,000 jobs per day for about 10 million hours of compute time Estimated a 60% 150% improvement on node usage per day using YARN Eliminated Colo (~10K nodes) due to increased utilization For more details check out the YARN SOCC 2013 paper Apache Hadoop YARN: Yet Another Resource Negotiator Vinod Kumar Vavilapalli et al.

11 Environment variables In the bash_profile export all needed environment variables

12 Hadoop 2 Configuration Allow remote login

13 Hadoop 2 Configuration Download the binary release of apache hadoop: hadoop tar.gz

14 Hadoop 2 Configuration At you can find a WIKI about Hadoop 2

15 Hadoop 2 Configuration In the etc/hadoop directory of the hadoop-home directory, set the following files hadoop-env.sh core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml

16 Hadoop 2 Configuration In the hadoop-env.sh file you have to edit the following

17 Hadoop 2 Configuration core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml

18 Hadoop 2 Configuration & Running Final configuration: $:~ hadoop-*/bin/hdfs namenode -format Running hadoop: $:~ hadoop-*/sbin/start-dfs.sh

19 Hadoop 2: commands on hdfs $:~hadoop-*/bin/hdfs dfs <command> <parameters> create a directory in hdfs! $:~hadoop-*/bin/hdfs dfs -mkdir input copy a local file in hdfs $:~hadoop-*/bin/hdfs dfs -put /tmp/example.txt input!! copy result files from hdfs to local file system $:~hadoop-*/bin/hdfs dfs -get output/result localoutput delete a directory in hdfs $:~hadoop-*/bin/hdfs dfs rm -r input

20 Hadoop 2 Configuration & Running Final configuration: $:~ hadoop-*/bin/hdfs namenode -format Running hadoop: $:~ hadoop-*/sbin/start-dfs.sh Make the HDFS directories required to execute MapReduce jobs: $:~ hadoop-*/bin/hdfs dfs -mkdir /user $:~ hadoop-*/bin/hdfs dfs -mkdir /user/<username> Copy the input files into the distributed filesystem: $:~ hadoop-*/bin/hdfs dfs -put etc/hadoop input

21 Hadoop2: browse hdfs

22 Hadoop 2 Configuration & Running Check all running daemons in Hadoop using the command jps $:~ jps NameNode DataNode SecondaryNameNode Jps

23 Hadoop 2: execute MR application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: Word Count in MapReduce $:~hadoop-*/bin/hdfs dfs -mkdir output $:~hadoop-*/bin/hdfs dfs -put /tmp/parole.txt input $:~hadoop-*/bin/hadoop jar /tmp/word.jar WordCount input/parole.txt output/result { { path on hdfs to reach the file path on hdfs of the directory to generate to store the result

24 Let s start with some examples!

25 Average Word Length by initial letter sopra la panca la capra campa, sotto la panca la capra crepa INPUT (0, sopra) (6, la) (9, panca) (15, la) (18, capra) (24, campa,) MAP (s, 5) (l, 2) (p, 5) (l, 2) (c, 5) (c, 6) (s, 5) (l, 2) (p, 5) (l, 2) (c, 5) (c, 5) SHUFFLE & SORT (c, [5,6]) (l, [2,2]) (p, [5]) (s, [5]) REDUCE (c, 5.25) (l, 2) (p, 5) (s, 5)

26 Hadoop 2: execute AverageWordLength application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: AverageWordLength in MapReduce $:~hadoop-*/bin/hdfs dfs -mkdir output $:~hadoop-*/bin/hdfs dfs -put /example_data/words.txt input $:~hadoop-*/bin/hadoop jar /example_jar/bg-example1-1.jar avgword/averagewordlength input/words.txt output/result_avg_word { { path on hdfs to reach the file path on hdfs of the directory to generate to store the result

27 Hadoop 2: clean AverageWordLength application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: AverageWordLength in MapReduce $:~hadoop-*/bin/hdfs dfs rm -r output

28 { Inverted Index there is a tab (\t) 1 if you prick us do we not bleed, 2 if you tickle us do we not laugh, 3 if you poison us do we not die and, 4 if you wrong us shall we not revenge

29 Inverted Index 1 if you prick us do we not bleed, 2 if you tickle us do we not laugh, 3 if you poison us do we not die and, 4 if you wrong us shall we not revenge INPUT (1, if you prick ) (2, if you tickle ) (3, if you poison ) (4, if you wrong ) MAP (if, 1) (you, 1) (prick, 1) (if, 2) (you, 2) (tickle, 2) REDUCE (if, [1,2, ]) (you, [1,2, ]) (prick, [1]) (tickle, [2])

30 Hadoop 2: execute InvertedIndex application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: InvertedIndex in MapReduce $:~hadoop-*/bin/hdfs dfs -mkdir output $:~hadoop-*/bin/hdfs dfs -put /example_data/words2.txt input $:~hadoop-*/bin/hadoop jar /example_jar/bg-example1-1.jar invertedindex/invertedindex input/words2.txt output/result_ii { { path on hdfs to reach the file path on hdfs of the directory to generate to store the result

31 Hadoop 2: clean InvertedIndex application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: InvertedIndex in MapReduce $:~hadoop-*/bin/hdfs dfs rm -r output

32 Top K records (citations) A,B C,B F,B A,A C,A R,F R,C B,B (B, 4) (A, 2) (F, 1) (C, 1) The objective is to find the top-10 most frequently cited patents in descending order.! The format of the input data is CITING_PATENT,CITED_PATENT.

33 Top K records (citations) A,B C,B F,B A,A C,A R,F R,C B,B MAP1 (CITEDPATENT, 1) (B, 1) (B, 1) (B, 1) (A, 1) (A, 1) (F, 1) REDUCE1 (CITEDPATENT, N) (B, 4) (A, 2) (F, 1) (CITEDPATENT, N) (CITEDPATENT, N) (CITEDPATENT, N) (B, 4) (A, 2) (F, 1) MAP2 (B, 4) (A, 2) (F, 1) COPY REDUCE2 (B, 4) (A, 2) (F, 1) FILTER

34 Hadoop 2: execute TopKRecords application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: TopKRecords in MapReduce $:~hadoop-*/bin/hdfs dfs -mkdir output $:~hadoop-*/bin/hdfs dfs -put /example_data/citations.txt input $:~hadoop-*/bin/hadoop jar /example_jar/bg-example1-1.jar topk/topkrecords input/citations.txt output/result_topk { { path on hdfs to reach the file path on hdfs of the directory to generate to store the result

35 Hadoop 2: clean TopKRecords application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: TopKRecords in MapReduce $:~hadoop-*/bin/hdfs dfs rm -r output

36 Disjoint selection ratings UserID::MovieID::Rating::Timestamp! 1::1193::5:: ::661::3:: ::3068::4:: ::1537::4:: ::647::3:: ::3534::3:: users UserID::Gender::Age::Occupation::Zip-Code! 1::F::1::10:: ::M::56::16:: ::M::25::15::55117!!

37 Disjoint selection ratings UserID::MovieID::Rating::Timestamp! 1::1193::5:: ::661::3:: ::3068::4:: ::1537::4:: ::647::3:: ::3534::3:: users UserID::Gender::Age::Occupation::Zip-Code! 1::F::1::10:: ::M::56::16:: ::M::25::15::55117!! 1::1193::5:: ::661::3:: ::F::1::10:: ::M::56::16:: ::3068::4:: ::1537::4:: ::647::3:: ::M::25::15:: ::3534::3::

38 Disjoint selection users_ratings 1::1193::5:: ::661::3:: ::F::1::10:: ::M::56::16:: ::3068::4:: ::1537::4:: ::647::3:: ::M::25::15:: ::3534::3:: The objective is to find users who are not "power taggers", ie, those who have tagged less than MIN_RATINGS (25) movies.! The problem is complicated by users and ratings are mixed in a single file.

39 Disjoint selection 1::1193::5:: ::661::3:: ::F::1::10:: ::M::56::16:: ::3068::4:: ::1537::4:: ::647::3:: ::M::25::15:: ::3534::3:: For this example consider MIN_RATINGS (3) (userid, recordtype) (userid, numberofratings) MAP (1, R) (1, R) (1, U) (2, U) (2, R) (2, R) REDUCE (1, 2) (3, 1)

40 Hadoop 2: execute DisjointSelection application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: DisjointSelection in MapReduce $:~hadoop-*/bin/hdfs dfs -mkdir output $:~hadoop-*/bin/hdfs dfs -put /example_data/users_ratings.txt input $:~hadoop-*/bin/hadoop jar /example_jar/bg-example1-1.jar rating/disjointselection input/users_ratings.txt output/result_rating { { path on hdfs to reach the file path on hdfs of the directory to generate to store the result

41 Hadoop 2: clean DisjointSelection application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: DisjointSelection in MapReduce $:~hadoop-*/bin/hdfs dfs rm -r output

42 Stopping the Hadoop 2 DFS Stop hadoop: $:~ hadoop-*/sbin/stop-dfs.sh That's all! Happy Map Reducing!

43 Hadoop 2 on AMAZON

44 Hadoop 2 on AMAZON AWS Code Creden+als User Name Access Key Id Secret Access Key BigData AKIAJRJMBITGSMBKLTLQ jjxqik8z/xxxkyl4ynszqudt+wgadfcabkjpeov0 AWS Web Console Creden+als User Name Password Direct Signin Link BigData XkHNB0v)IVhf hqp://signin.aws.amazon.com/console

47 Hadoop 2 on AMAZON Regions

48 Hadoop 2 on AMAZON S3 and buckets

49 Hadoop 2 on AMAZON S3 and buckets

50 EC2 virtual servers Hadoop 2 on AMAZON

51 Hadoop 2 on AMAZON file.pem

52 Hadoop 2 on AMAZON Elastic Map Reduce

53 Hadoop 2 on AMAZON EMR cluster

57 Hadoop 2 Running on AWS Authorization of the. pen file $:~ chmod 400 BigData_Key.pem Upload files (data and your personal jars) on hadoop of EMR cluster: $:~ scp -i <file.pem> <file> hadoop@<dns_emr_cluster>:~ $:~ scp -i BigData_Key.pem./example_data/words.txt hadoop@ec compute-1.amazonaws.com:~

58 Hadoop 2 Running on AWS Connection to hadoop of EMR cluster: $:~ ssh hadoop@<dns_emr_cluster> -i <file.pem> $:~ ssh hadoop@ec compute-1.amazonaws.com -i BigData_Key.pem Then you can execute MR jobs as in your local machine Finally TERMINATE your cluster

59 Hadoop 2.6 Configuration and More Examples Big Data 2015