Hadoop 2.6 Configuration and More Examples Big Data 2015
Apache Hadoop & YARN Apache Hadoop (1.X)! De facto Big Data open source platform Running for about 5 years in production at hundreds of companies like Yahoo, Ebay and Facebook Hadoop 2.X Significant improvements in HDFS distributed storage layer. High Availability, NFS, Snapshots YARN next generation compute framework for Hadoop designed from the ground up based on experience gained from Hadoop 1 YARN running in production at Yahoo for about a year
1st Generation Hadoop: Batch Focus HADOOP 1.0 Built for Web-Scale Batch Apps Single'App' Single'App' All other usage patterns MUST leverage same infrastructure INTERACTIVE ONLINE Single'App' BATCH Single'App' BATCH Single'App' BATCH Forces Creation of Silos to Manage Mixed Workloads HDFS HDFS HDFS
Hadoop 1 Architecture JobTracker Manage Cluster Resources & Job Scheduling TaskTracker Per-node agent Manage Tasks
Hadoop 1 Limitations Lacks Support for Alternate Paradigms and Services Force everything needs to look like Map Reduce Iterative applications in MapReduce are 10x slower Scalability Max Cluster size ~5,000 nodes Max concurrent tasks ~40,000 Availability Failure Kills Queued & Running Jobs Hard partition of resources into map and reduce slots Non-optimal Resource Utilization
Hadoop as Next-Gen Platform Single Use System Batch Apps HADOOP 1.0 Multi Purpose Platform Batch, Interactive, Online, Streaming, HADOOP 2.0 MapReduce% (data*processing)* Others% MapReduce% (cluster*resource*management* *&*data*processing)* YARN% (cluster*resource*management)* HDFS% (redundant,*reliable*storage)* HDFS2% (redundant,*highly8available*&*reliable*storage)*
Hadoop 2 - YARN Architecture ResourceManager (RM) Central agent - Manages and allocates cluster resources NodeManager (NM) Per-Node agent - Manages and enforces node resource allocations ApplicationMaster (AM) Per-Application Manages application Client Resource Manager Node Manager Node Manager App Mstr Container lifecycle and task scheduling MapReduce Status Job Submission Node Status Resource Request Node Manager
YARN: Taking Hadoop Beyond Batch Store ALL DATA in one place Interact with that data in MULTIPLE WAYS with Predictable Performance and Quality of Service Applica'ons+Run+Na'vely+in#Hadoop+ BATCH+ (MapReduce)+ INTERACTIVE+ (Tez)+ ONLINE+ (HBase)+ STREAMING+ (Storm,+S4, )+ GRAPH+ (Giraph)+ INLMEMORY+ (Spark)+ HPC+MPI+ (OpenMPI)+ OTHER+ (Search)+ (Weave )+ YARN+(Cluster*Resource*Management)*** HDFS2+(Redundant,*Reliable*Storage)*
5 Key Benefits of YARN 1. New&Applica-ons&&&Services& 2. Improved&cluster&u-liza-on& 3. Scale& 4. Experimental&Agility& 5. Shared&Services&
5 Key Benefits of YARN Yahoo! leverages YARN 40,000+ nodes running YARN across over 365PB of data ~400,000 jobs per day for about 10 million hours of compute time Estimated a 60% 150% improvement on node usage per day using YARN Eliminated Colo (~10K nodes) due to increased utilization For more details check out the YARN SOCC 2013 paper Apache Hadoop YARN: Yet Another Resource Negotiator Vinod Kumar Vavilapalli et al.
Environment variables In the bash_profile export all needed environment variables
Hadoop 2 Configuration Allow remote login
Hadoop 2 Configuration Download the binary release of apache hadoop: hadoop-2.6.0.tar.gz
Hadoop 2 Configuration At https://hadoop.apache.org/docs/stable/ you can find a WIKI about Hadoop 2
Hadoop 2 Configuration In the etc/hadoop directory of the hadoop-home directory, set the following files hadoop-env.sh core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml
Hadoop 2 Configuration In the hadoop-env.sh file you have to edit the following
Hadoop 2 Configuration core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml
Hadoop 2 Configuration & Running Final configuration: $:~ hadoop-*/bin/hdfs namenode -format Running hadoop: $:~ hadoop-*/sbin/start-dfs.sh
Hadoop 2: commands on hdfs $:~hadoop-*/bin/hdfs dfs <command> <parameters> create a directory in hdfs! $:~hadoop-*/bin/hdfs dfs -mkdir input copy a local file in hdfs $:~hadoop-*/bin/hdfs dfs -put /tmp/example.txt input!! copy result files from hdfs to local file system $:~hadoop-*/bin/hdfs dfs -get output/result localoutput delete a directory in hdfs $:~hadoop-*/bin/hdfs dfs rm -r input
Hadoop 2 Configuration & Running Final configuration: $:~ hadoop-*/bin/hdfs namenode -format Running hadoop: $:~ hadoop-*/sbin/start-dfs.sh Make the HDFS directories required to execute MapReduce jobs: $:~ hadoop-*/bin/hdfs dfs -mkdir /user $:~ hadoop-*/bin/hdfs dfs -mkdir /user/<username> Copy the input files into the distributed filesystem: $:~ hadoop-*/bin/hdfs dfs -put etc/hadoop input
Hadoop2: browse hdfs http://localhost:50070
Hadoop 2 Configuration & Running Check all running daemons in Hadoop using the command jps $:~ jps 20758 NameNode 20829 DataNode 20920 SecondaryNameNode 21001 Jps
Hadoop 2: execute MR application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: Word Count in MapReduce $:~hadoop-*/bin/hdfs dfs -mkdir output $:~hadoop-*/bin/hdfs dfs -put /tmp/parole.txt input $:~hadoop-*/bin/hadoop jar /tmp/word.jar WordCount input/parole.txt output/result { { path on hdfs to reach the file path on hdfs of the directory to generate to store the result
Let s start with some examples!
Average Word Length by initial letter sopra la panca la capra campa, sotto la panca la capra crepa INPUT (0, sopra) (6, la) (9, panca) (15, la) (18, capra) (24, campa,) MAP (s, 5) (l, 2) (p, 5) (l, 2) (c, 5) (c, 6) (s, 5) (l, 2) (p, 5) (l, 2) (c, 5) (c, 5) SHUFFLE & SORT (c, [5,6]) (l, [2,2]) (p, [5]) (s, [5]) REDUCE (c, 5.25) (l, 2) (p, 5) (s, 5)
Hadoop 2: execute AverageWordLength application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: AverageWordLength in MapReduce $:~hadoop-*/bin/hdfs dfs -mkdir output $:~hadoop-*/bin/hdfs dfs -put /example_data/words.txt input $:~hadoop-*/bin/hadoop jar /example_jar/bg-example1-1.jar avgword/averagewordlength input/words.txt output/result_avg_word { { path on hdfs to reach the file path on hdfs of the directory to generate to store the result
Hadoop 2: clean AverageWordLength application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: AverageWordLength in MapReduce $:~hadoop-*/bin/hdfs dfs rm -r output
{ Inverted Index there is a tab (\t) 1 if you prick us do we not bleed, 2 if you tickle us do we not laugh, 3 if you poison us do we not die and, 4 if you wrong us shall we not revenge
Inverted Index 1 if you prick us do we not bleed, 2 if you tickle us do we not laugh, 3 if you poison us do we not die and, 4 if you wrong us shall we not revenge INPUT (1, if you prick ) (2, if you tickle ) (3, if you poison ) (4, if you wrong ) MAP (if, 1) (you, 1) (prick, 1) (if, 2) (you, 2) (tickle, 2) REDUCE (if, [1,2, ]) (you, [1,2, ]) (prick, [1]) (tickle, [2])
Hadoop 2: execute InvertedIndex application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: InvertedIndex in MapReduce $:~hadoop-*/bin/hdfs dfs -mkdir output $:~hadoop-*/bin/hdfs dfs -put /example_data/words2.txt input $:~hadoop-*/bin/hadoop jar /example_jar/bg-example1-1.jar invertedindex/invertedindex input/words2.txt output/result_ii { { path on hdfs to reach the file path on hdfs of the directory to generate to store the result
Hadoop 2: clean InvertedIndex application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: InvertedIndex in MapReduce $:~hadoop-*/bin/hdfs dfs rm -r output
Top K records (citations) A,B C,B F,B A,A C,A R,F R,C B,B (B, 4) (A, 2) (F, 1) (C, 1) The objective is to find the top-10 most frequently cited patents in descending order.! The format of the input data is CITING_PATENT,CITED_PATENT.
Top K records (citations) A,B C,B F,B A,A C,A R,F R,C B,B MAP1 (CITEDPATENT, 1) (B, 1) (B, 1) (B, 1) (A, 1) (A, 1) (F, 1) REDUCE1 (CITEDPATENT, N) (B, 4) (A, 2) (F, 1) (CITEDPATENT, N) (CITEDPATENT, N) (CITEDPATENT, N) (B, 4) (A, 2) (F, 1) MAP2 (B, 4) (A, 2) (F, 1) COPY REDUCE2 (B, 4) (A, 2) (F, 1) FILTER
Hadoop 2: execute TopKRecords application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: TopKRecords in MapReduce $:~hadoop-*/bin/hdfs dfs -mkdir output $:~hadoop-*/bin/hdfs dfs -put /example_data/citations.txt input $:~hadoop-*/bin/hadoop jar /example_jar/bg-example1-1.jar topk/topkrecords input/citations.txt output/result_topk { { path on hdfs to reach the file path on hdfs of the directory to generate to store the result
Hadoop 2: clean TopKRecords application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: TopKRecords in MapReduce $:~hadoop-*/bin/hdfs dfs rm -r output
Disjoint selection ratings UserID::MovieID::Rating::Timestamp! 1::1193::5::978300760 1::661::3::978302109 2::3068::4::978299000 2::1537::4::978299620 2::647::3::978299351 3::3534::3::978297068 users UserID::Gender::Age::Occupation::Zip-Code! 1::F::1::10::48067 2::M::56::16::70072 3::M::25::15::55117!!
Disjoint selection ratings UserID::MovieID::Rating::Timestamp! 1::1193::5::978300760 1::661::3::978302109 2::3068::4::978299000 2::1537::4::978299620 2::647::3::978299351 3::3534::3::978297068 users UserID::Gender::Age::Occupation::Zip-Code! 1::F::1::10::48067 2::M::56::16::70072 3::M::25::15::55117!! 1::1193::5::978300760 1::661::3::978302109 1::F::1::10::48067 2::M::56::16::70072 2::3068::4::978299000 2::1537::4::978299620 2::647::3::978299351 3::M::25::15::55117 3::3534::3::978297068
Disjoint selection users_ratings 1::1193::5::978300760 1::661::3::978302109 1::F::1::10::48067 2::M::56::16::70072 2::3068::4::978299000 2::1537::4::978299620 2::647::3::978299351 3::M::25::15::55117 3::3534::3::978297068 The objective is to find users who are not "power taggers", ie, those who have tagged less than MIN_RATINGS (25) movies.! The problem is complicated by users and ratings are mixed in a single file.
Disjoint selection 1::1193::5::978300760 1::661::3::978302109 1::F::1::10::48067 2::M::56::16::70072 2::3068::4::978299000 2::1537::4::978299620 2::647::3::978299351 3::M::25::15::55117 3::3534::3::978297068 For this example consider MIN_RATINGS (3) (userid, recordtype) (userid, numberofratings) MAP (1, R) (1, R) (1, U) (2, U) (2, R) (2, R) REDUCE (1, 2) (3, 1)
Hadoop 2: execute DisjointSelection application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: DisjointSelection in MapReduce $:~hadoop-*/bin/hdfs dfs -mkdir output $:~hadoop-*/bin/hdfs dfs -put /example_data/users_ratings.txt input $:~hadoop-*/bin/hadoop jar /example_jar/bg-example1-1.jar rating/disjointselection input/users_ratings.txt output/result_rating { { path on hdfs to reach the file path on hdfs of the directory to generate to store the result
Hadoop 2: clean DisjointSelection application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: DisjointSelection in MapReduce $:~hadoop-*/bin/hdfs dfs rm -r output
Stopping the Hadoop 2 DFS Stop hadoop: $:~ hadoop-*/sbin/stop-dfs.sh That's all! Happy Map Reducing!
Hadoop 2 on AMAZON
Hadoop 2 on AMAZON AWS Code Creden+als User Name Access Key Id Secret Access Key BigData AKIAJRJMBITGSMBKLTLQ jjxqik8z/xxxkyl4ynszqudt+wgadfcabkjpeov0 AWS Web Console Creden+als User Name Password Direct Signin Link BigData XkHNB0v)IVhf hqp://signin.aws.amazon.com/console
Hadoop 2 on AMAZON
Hadoop 2 on AMAZON
Hadoop 2 on AMAZON Regions
Hadoop 2 on AMAZON S3 and buckets
Hadoop 2 on AMAZON S3 and buckets
EC2 virtual servers Hadoop 2 on AMAZON
Hadoop 2 on AMAZON file.pem
Hadoop 2 on AMAZON Elastic Map Reduce
Hadoop 2 on AMAZON EMR cluster
Hadoop 2 on AMAZON EMR cluster
Hadoop 2 on AMAZON EMR cluster
Hadoop 2 on AMAZON EMR cluster
Hadoop 2 Running on AWS Authorization of the. pen file $:~ chmod 400 BigData_Key.pem Upload files (data and your personal jars) on hadoop of EMR cluster: $:~ scp -i <file.pem> <file> hadoop@<dns_emr_cluster>:~ $:~ scp -i BigData_Key.pem./example_data/words.txt hadoop@ec2-54-164-153-7.compute-1.amazonaws.com:~
Hadoop 2 Running on AWS Connection to hadoop of EMR cluster: $:~ ssh hadoop@<dns_emr_cluster> -i <file.pem> $:~ ssh hadoop@ec2-54-164-153-7.compute-1.amazonaws.com -i BigData_Key.pem Then you can execute MR jobs as in your local machine Finally TERMINATE your cluster
Hadoop 2.6 Configuration and More Examples Big Data 2015