Hadoop Job Oriented Training Agenda

1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com

Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module Objectives 1.2 Additional Content 1.3 The Three Vs of BigData 1.4 Six Key Hadoop Data Types 1.5 About Use Cases 1.5.1 Sentiment Use Case 1.5.2 Geolocation Use Case 1.6 About Hadoop 1.6.1 Relational Databases vs Hadoop 1.6.2 About Hadoop 2 1.6.3 New in Hadoop 2.x 1.6.4 The Hadoop Ecosystem 1.7 The Hortonworks Data Platform(HDP) 1.8 The Path to ROI 1.9 Review Questions 1.10 Lab: Start an HDP 2.1 Cluster 1.10.1 Objective: Start an HDP cluster in your VM

Module 2 M o d u l e 2 The Hadoop Distributed FileSystem(HDFS) This module covers the details of how files are stored and maintained in the Hadoop Distributed File System (HDFS). 2.1 Module Objectives 2.2 Additional Content 2.3 About HDFS 2.3.1 Hadoop vs RDBMS 2.3.2 An Example of Disk Read Performance 2.3.3 HDFS Components 2.4 Understanding Block Storage 2.5 Demonstration: Understanding Block Storage 2.5.1 Objective: To understand how data is partitioned into blocks and stored in HDFS 2.6 The NameNode 2.7 The DataNodes 2.7.1 DataNode Failure 2.8 HDFS Commands 2.8.1 Examples of HDFS Commands 2.8.2 HDFS File Permissions 2.9 Review Questions 2.10 Lab: Using HDFS Commands 2.10.1 Objective: Become familiar with adding, removing, and viewing files in HDFS

Module 3 M o d u l e 3 Inputting Data into HDFS This module covers the various ways to input data into the Hadoop Distributed File System, including the Sqoop and Flume frameworks. 3.1 Module Objectives 3.2 Additional Content 3.2.1 The Hadoop Client 3.2.2 WebHDFS 3.3 Overview of Flume 3.3.1 Flume Example 3.4 Overview of Sqoop 3.4.1 The Sqoop Import Tool 3.4.2 Importing a Table 3.4.3 Importing Specific Columns 3.4.4 Importing from a Query 3.4.5 The Sqoop Export Tool 3.4.6 Exporting to a Table 3.5 Review Questions 3.6 Lab: Importing RDBMS Data into HDFS 3.6.1 Objective: Import data from a database into HDFS 3.7 Lab: Exporting HDFS Data to RDBMS 3.7.1 Objective: Export data from HDFS into a MySQL table using Sqoop

Module 5 Module 4 M o d u l e 4 The MapReduce Framework This module covers the details of the MapReduce programming paradigm. 4.1 Module Objectives 4.2 Additional Content 4.3 Overview of MapReduce 4.3.1 Understanding MapReduce 4.3.2 The Key/Value Pairs of MapReduce 4.3.3 WordCount in MapReduce 4.4 Demonstration: Understanding MapReduce 4.4.1 Objective: To understand how MapReduce works 4.5 The Map Phase 4.6 The Reduce Phase 4.7 Review Questions 4.8 Lab: Running MapReduce Job 4.8.1 Objective: Run a Java MapReduce job M o d u l e 5 Introduction to Pig This module covers the Pig framework and describes how to load and transform data using the Pig programming language. 5.1 Module Objectives 5.2 Additional Content 5.3 About Pig 5.3.1 Pig Latin

5.3.2 The Grunt Shell 5.4 Demonstration: Understanding Pig 5.4.1 Objective: To understand Pigscripts and relations 5.5 Pig Latin Relation Names 5.5.1 Pig Latin Field Names 5.5.2 Pig Data Types 5.5.3 Pig Complex Types 5.6 Defining a Schema 5.7 Lab: Getting Started with Pig 5.7.1 Objective: Use Pig to navigate through HDFS and explore a dataset 5.8 The GROUP Operator 5.8.1 GROUP ALL 5.8.2 Relations without a Schema 5.9 The FOREACH GENERATE Operator 5.9.1 Specifying Ranges in FOREACH 5.9.2 Field Names in FOREACH 5.9.3 FOREACH with Groups 5.9.4 The FILTER Operator 5.10 The LIMIT Operator 5.11 Review Questions 5.12 Lab: Exploring Data with Pig 5.12.1 Objective: Use Pig to navigate through HDFS and explore a

Module 6 dataset M o d u l e 6 Advanced Pig Programming This module covers some of the more advanced features of Pig, including sorting, parallelization, joins, and user-defined functions. 6.1 Module Objectives 6.2 Additional Content 6.3 The ORDER BY Operator 6.4 The CASE Operator 6.5 Parameter Substitution 6.6 The DISTINCT Operator 6.7 Using PARALLEL 6.8 The FLATTEN Operator 6.9 Lab: Splitting Dataset 6.9.1 Objective: Research the WhiteHouse visitor data and look for members of Congress 6.10 Nested FOREACH 6.11 About Joins 6.11.1 Performing an InnerJoin 6.11.2 Performing an OuterJoin 6.11.3 Replicated Joins 6.12 The COGROUP Operator

6.13 Pig User-Defined Functions 6.13.1 UDF Example 6.13.2 Invoking a UDF 6.13.3 Tips for Optimizing PigScripts 6.14 Lab: Joining Datasets 6.14.1 Objective: Join two datasets in Pig 6.15 Lab: Preparing Data for Hive 6.15.1 Objective: Transform and export a dataset for use with Hive 6.16 Overview of the DataFu Library 6.16.1 Computing Quantiles 6.17 Demonstration: Computing PageRank 6.17.1 Objective: Tounderstand how to use the Page Rank UDF in DataFu 6.18 Review Questions 6.19 Lab: Analyzing Clickstream Data 6.19.1 Objective: Become familiar with using the DataFu library to sessionize clickstream data 6.20 Lab: Analyzing StockMarket Data using Quantiles 6.20.1 Objective: Use DataFu to compute quantiles

Module 7 M o d u l e 7 Hive Programming This module covers the details of the Hive framework and HiveQL programming language. 7.1 Module Objectives 7.2 Additional Content 7.3 About Hive 7.3.1 Comparing Hive to SQL 7.3.2 Hive Architecture 7.3.3 Submitting Hive Queries 7.4 Defining a Hive-Managed Table 7.4.1 Defining an External Table 7.4.2 Defining a Table LOCATION 7.4.3 Loading Data into Hive Table 7.5 Performing Queries 7.6 Lab: Understanding Hive Tables 7.6.1 Objective: Understand how Hive table data is stored in HDFS 7.7 Hive Partitions 7.7.1 Hive Buckets 7.7.2 Skewed Tables 7.8 Demonstration: Understanding Partitions and Skew work 7.8.1 Objective: To understand how Hive partitioning and skewed tables 7.9 Sorting Data

7.9.1 Using Distribute By 7.9.2 Storing Results to File 7.9.3 Specifying MapReduce Properties 7.10 Lab: Analyzing Big Data with Hive 7.10.1 Objective: Analyze the WhiteHouse visitor data 7.11 Lab: Understanding MapReduce in Hive 7.11.1 Objective: To understand how Hivequeries get executed as MapReduce jobs 7.12 Hive Join Strategies 7.12.1 Shuffle Joins 7.12.2 Map(Broadcast) Joins 7.12.3 Sort-Merge-Bucket(SMB) Joins 7.12.4 Invoking a Hive UDF 7.12.5 Computing ngrams in Hive 7.13 Demonstration: Computing ngrams 7.13.1 Objective: To understand how to compute ngrams using Hive 7.14 Review Questions 7.15 Lab: Joining Datasets in Hive 7.15.1 Objective: Perform a join of two datasets in Hive 7.16 Lab: Computing ngrams of Emails in Avro Format 7.16.1 Objective: Use Hive to compute ngrams

Module 8 M o d u l e 8 Using HCatalog This module covers the details of how HCatalog is used to provide a central repository for defining and sharing schemas for data stored in Hadoop. 8.1 Module Objectives 8.2 Additional Content 8.3 About HCatalog 8.4 HCatalog in the Ecosystem 8.5 Defining a New Schema 8.5.1 Using HCat Loader with Pig 8.5.2 Using HCat Storer with Pig 8.5.3 The Pig SQL Command 8.6 Review Questions 8.7 Lab: Using HCatalog withpig 8.7.1 Objective: Use HCatalog to provide the schema for a Pig relation

Module 9 M o d u l e 9 Advanced Hive Programming This module covers some of the more advanced features of Hive programming, including views, the windowing functions, and the various optimization capabilities of Hive. 9.1 Module Objectives 9.2 Additional Content 9.3 Performing a Multi-Table/File Insert 9.4 Understanding Views 9.4.1 Defining Views 9.4.2 Using Views 9.4.3 The TRANSFORM Clause 9.4.4 The OVERClause 9.5 Using Windows 9.5.1 Hive Analytics Functions 9.6 Lab: Advanced Hive Programming 9.6.1 Objective: To understand how some of the more advanced features of Hive work 9.7 Hive File Formats 9.7.1 Hive SerDes 9.7.2 Hive ORCFiles 9.8 Computing Table Statistics 9.8.1 Hive Cost Based Optimization 9.8.2 Vectorization

Module 10 9.9 Using Hive Server2 9.10 Understanding Hive on Tez 9.10.1 Using Tez for Hive Queries 9.11 Demonstration: Hive Optimizations 9.11.1 Objective: To become familiar with someways to optimize Hive 9.12 Hive Optimization Tips 9.12.1 Hive Query Tunings 9.13 Review Questions 9.14 Lab: Streaming Data with Hive and Python 9.14.1 Objective: Use a custom reducer script to optimize a Hive query M o d u l e 1 0 Hadoop 2 and YARN This module covers the newer features of Hadoop 2, like YARN, HDFS Federation, and NameNode high availability. 10.1 Module Objectives 10.2 Additional Content 10.3 About HDFS Federation 10.3.1 Multiple Federated NameNodes 10.3.2 Multiple Namespaces 10.3.3 Overview of HDFS High Availability 10.3.4 Quorum Journal Manager 10.3.5 Configuring Automatic Failover

Module 11 10.4 About YARN 10.4.1 Open-source YARN Use Cases 10.4.2 The Components of YARN 10.4.3 Lifecycle of YARN Application 10.4.4 Cluster View Example 10.5 Review Questions 10.6 Lab: Running YARN Application 10.6.1 Objective: To run a YARN application M o d u l e 1 1 Defining Workflow with Oozie This module covers how to implement a Hadoop workflow using the Apache Oozie framework. 11.1 Module Objectives 11.2 Additional Content 11.3 Overview of Oozie 11.3.1 Defining an Oozie Workflow 11.3.2 Pig Actions 11.3.3 Hive Actions 11.3.4 MapReduce Actions 11.3.5 Submitting Workflow Job 11.3.6 Fork and Join Nodes 11.4 Defining an Oozie Coordinator Job 11.4.1 Schedule Job Based on Time

Module 12 11.4.2 Schedule Job Based on Data Availability 11.5 Review Questions 11.6 Lab: Defining an Oozie Workflow 11.6.1 Objective: Define and run an Oozie workflow M o d u l e 1 2 Hadoop Streaming This module covers an overview of the streaming capabilities of Hadoop. 12.1 Module Objectives 12.2 Hadoop Streaming 12.3 Running a Hadoop Streaming Job * Note: Contents subject to change