Hadoop Job Oriented Training Agenda



Similar documents
Qsoft Inc

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

HADOOP. Revised 10/19/2015

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Workshop on Hadoop with Big Data

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Complete Java Classes Hadoop Syllabus Contact No:

COURSE CONTENT Big Data and Hadoop Training

Big Data Course Highlights

ITG Software Engineering

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

Peers Techno log ies Pv t. L td. HADOOP

Training Catalog. Summer 2015 Training Catalog. Apache Hadoop Training from the Experts. Apache Hadoop Training From the Experts

BIG DATA HADOOP TRAINING

Implement Hadoop jobs to extract business value from large and varied data sets

Constructing a Data Lake: Hadoop and Oracle Database United!

Hadoop Ecosystem B Y R A H I M A.

Hadoop: The Definitive Guide

Fundamentals Curriculum HAWQ

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Certified Big Data and Apache Hadoop Developer VS-1221

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

HDP Hadoop From concept to deployment.

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Upcoming Announcements

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

ITG Software Engineering

Hadoop implementation of MapReduce computational model. Ján Vaňo

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

HADOOP BIG DATA DEVELOPER TRAINING AGENDA

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

A Brief Outline on Bigdata Hadoop

Data processing goes big

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Introduction to Big Data Training

BIG DATA - HADOOP PROFESSIONAL amron

Cloudera Certified Developer for Apache Hadoop

Distributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Deploying Hadoop with Manager

BIG DATA & HADOOP DEVELOPER TRAINING & CERTIFICATION

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE

Big Data Too Big To Ignore

Apache Hadoop Ecosystem

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Internals of Hadoop Application Framework and Distributed File System

Chase Wu New Jersey Ins0tute of Technology

[Type text] Week. National summer training program on. Big Data & Hadoop. Why big data & Hadoop is important?

American International Journal of Research in Science, Technology, Engineering & Mathematics

Big Data With Hadoop

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Data Security in Hadoop

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Oracle Big Data Essentials

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

A very short Intro to Hadoop

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Case Study : 3 different hadoop cluster deployments

MapReduce with Apache Hadoop Analysing Big Data

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Bringing Big Data to People

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

The Hadoop Eco System Shanghai Data Science Meetup

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Big Data Weather Analytics Using Hadoop

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Large scale processing using Hadoop. Ján Vaňo

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

How To Analyze Network Traffic With Mapreduce On A Microsoft Server On A Linux Computer (Ahem) On A Network (Netflow) On An Ubuntu Server On An Ipad Or Ipad (Netflower) On Your Computer

Data Analyst Program- 0 to 100

HDFS. Hadoop Distributed File System

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

Modern Data Architecture for Predictive Analytics

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

Manifest for Big Data Pig, Hive & Jaql

Course 20467: Designing Self-Service Business Intelligence and Big Data Solutions

Big data for the Masses The Unique Challenge of Big Data Integration

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Click Stream Data Analysis Using Hadoop

Transcription:

1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com

Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module Objectives 1.2 Additional Content 1.3 The Three Vs of BigData 1.4 Six Key Hadoop Data Types 1.5 About Use Cases 1.5.1 Sentiment Use Case 1.5.2 Geolocation Use Case 1.6 About Hadoop 1.6.1 Relational Databases vs Hadoop 1.6.2 About Hadoop 2 1.6.3 New in Hadoop 2.x 1.6.4 The Hadoop Ecosystem 1.7 The Hortonworks Data Platform(HDP) 1.8 The Path to ROI 1.9 Review Questions 1.10 Lab: Start an HDP 2.1 Cluster 1.10.1 Objective: Start an HDP cluster in your VM

Module 2 M o d u l e 2 The Hadoop Distributed FileSystem(HDFS) This module covers the details of how files are stored and maintained in the Hadoop Distributed File System (HDFS). 2.1 Module Objectives 2.2 Additional Content 2.3 About HDFS 2.3.1 Hadoop vs RDBMS 2.3.2 An Example of Disk Read Performance 2.3.3 HDFS Components 2.4 Understanding Block Storage 2.5 Demonstration: Understanding Block Storage 2.5.1 Objective: To understand how data is partitioned into blocks and stored in HDFS 2.6 The NameNode 2.7 The DataNodes 2.7.1 DataNode Failure 2.8 HDFS Commands 2.8.1 Examples of HDFS Commands 2.8.2 HDFS File Permissions 2.9 Review Questions 2.10 Lab: Using HDFS Commands 2.10.1 Objective: Become familiar with adding, removing, and viewing files in HDFS

Module 3 M o d u l e 3 Inputting Data into HDFS This module covers the various ways to input data into the Hadoop Distributed File System, including the Sqoop and Flume frameworks. 3.1 Module Objectives 3.2 Additional Content 3.2.1 The Hadoop Client 3.2.2 WebHDFS 3.3 Overview of Flume 3.3.1 Flume Example 3.4 Overview of Sqoop 3.4.1 The Sqoop Import Tool 3.4.2 Importing a Table 3.4.3 Importing Specific Columns 3.4.4 Importing from a Query 3.4.5 The Sqoop Export Tool 3.4.6 Exporting to a Table 3.5 Review Questions 3.6 Lab: Importing RDBMS Data into HDFS 3.6.1 Objective: Import data from a database into HDFS 3.7 Lab: Exporting HDFS Data to RDBMS 3.7.1 Objective: Export data from HDFS into a MySQL table using Sqoop

Module 5 Module 4 M o d u l e 4 The MapReduce Framework This module covers the details of the MapReduce programming paradigm. 4.1 Module Objectives 4.2 Additional Content 4.3 Overview of MapReduce 4.3.1 Understanding MapReduce 4.3.2 The Key/Value Pairs of MapReduce 4.3.3 WordCount in MapReduce 4.4 Demonstration: Understanding MapReduce 4.4.1 Objective: To understand how MapReduce works 4.5 The Map Phase 4.6 The Reduce Phase 4.7 Review Questions 4.8 Lab: Running MapReduce Job 4.8.1 Objective: Run a Java MapReduce job M o d u l e 5 Introduction to Pig This module covers the Pig framework and describes how to load and transform data using the Pig programming language. 5.1 Module Objectives 5.2 Additional Content 5.3 About Pig 5.3.1 Pig Latin

5.3.2 The Grunt Shell 5.4 Demonstration: Understanding Pig 5.4.1 Objective: To understand Pigscripts and relations 5.5 Pig Latin Relation Names 5.5.1 Pig Latin Field Names 5.5.2 Pig Data Types 5.5.3 Pig Complex Types 5.6 Defining a Schema 5.7 Lab: Getting Started with Pig 5.7.1 Objective: Use Pig to navigate through HDFS and explore a dataset 5.8 The GROUP Operator 5.8.1 GROUP ALL 5.8.2 Relations without a Schema 5.9 The FOREACH GENERATE Operator 5.9.1 Specifying Ranges in FOREACH 5.9.2 Field Names in FOREACH 5.9.3 FOREACH with Groups 5.9.4 The FILTER Operator 5.10 The LIMIT Operator 5.11 Review Questions 5.12 Lab: Exploring Data with Pig 5.12.1 Objective: Use Pig to navigate through HDFS and explore a

Module 6 dataset M o d u l e 6 Advanced Pig Programming This module covers some of the more advanced features of Pig, including sorting, parallelization, joins, and user-defined functions. 6.1 Module Objectives 6.2 Additional Content 6.3 The ORDER BY Operator 6.4 The CASE Operator 6.5 Parameter Substitution 6.6 The DISTINCT Operator 6.7 Using PARALLEL 6.8 The FLATTEN Operator 6.9 Lab: Splitting Dataset 6.9.1 Objective: Research the WhiteHouse visitor data and look for members of Congress 6.10 Nested FOREACH 6.11 About Joins 6.11.1 Performing an InnerJoin 6.11.2 Performing an OuterJoin 6.11.3 Replicated Joins 6.12 The COGROUP Operator

6.13 Pig User-Defined Functions 6.13.1 UDF Example 6.13.2 Invoking a UDF 6.13.3 Tips for Optimizing PigScripts 6.14 Lab: Joining Datasets 6.14.1 Objective: Join two datasets in Pig 6.15 Lab: Preparing Data for Hive 6.15.1 Objective: Transform and export a dataset for use with Hive 6.16 Overview of the DataFu Library 6.16.1 Computing Quantiles 6.17 Demonstration: Computing PageRank 6.17.1 Objective: Tounderstand how to use the Page Rank UDF in DataFu 6.18 Review Questions 6.19 Lab: Analyzing Clickstream Data 6.19.1 Objective: Become familiar with using the DataFu library to sessionize clickstream data 6.20 Lab: Analyzing StockMarket Data using Quantiles 6.20.1 Objective: Use DataFu to compute quantiles

Module 7 M o d u l e 7 Hive Programming This module covers the details of the Hive framework and HiveQL programming language. 7.1 Module Objectives 7.2 Additional Content 7.3 About Hive 7.3.1 Comparing Hive to SQL 7.3.2 Hive Architecture 7.3.3 Submitting Hive Queries 7.4 Defining a Hive-Managed Table 7.4.1 Defining an External Table 7.4.2 Defining a Table LOCATION 7.4.3 Loading Data into Hive Table 7.5 Performing Queries 7.6 Lab: Understanding Hive Tables 7.6.1 Objective: Understand how Hive table data is stored in HDFS 7.7 Hive Partitions 7.7.1 Hive Buckets 7.7.2 Skewed Tables 7.8 Demonstration: Understanding Partitions and Skew work 7.8.1 Objective: To understand how Hive partitioning and skewed tables 7.9 Sorting Data

7.9.1 Using Distribute By 7.9.2 Storing Results to File 7.9.3 Specifying MapReduce Properties 7.10 Lab: Analyzing Big Data with Hive 7.10.1 Objective: Analyze the WhiteHouse visitor data 7.11 Lab: Understanding MapReduce in Hive 7.11.1 Objective: To understand how Hivequeries get executed as MapReduce jobs 7.12 Hive Join Strategies 7.12.1 Shuffle Joins 7.12.2 Map(Broadcast) Joins 7.12.3 Sort-Merge-Bucket(SMB) Joins 7.12.4 Invoking a Hive UDF 7.12.5 Computing ngrams in Hive 7.13 Demonstration: Computing ngrams 7.13.1 Objective: To understand how to compute ngrams using Hive 7.14 Review Questions 7.15 Lab: Joining Datasets in Hive 7.15.1 Objective: Perform a join of two datasets in Hive 7.16 Lab: Computing ngrams of Emails in Avro Format 7.16.1 Objective: Use Hive to compute ngrams

Module 8 M o d u l e 8 Using HCatalog This module covers the details of how HCatalog is used to provide a central repository for defining and sharing schemas for data stored in Hadoop. 8.1 Module Objectives 8.2 Additional Content 8.3 About HCatalog 8.4 HCatalog in the Ecosystem 8.5 Defining a New Schema 8.5.1 Using HCat Loader with Pig 8.5.2 Using HCat Storer with Pig 8.5.3 The Pig SQL Command 8.6 Review Questions 8.7 Lab: Using HCatalog withpig 8.7.1 Objective: Use HCatalog to provide the schema for a Pig relation

Module 9 M o d u l e 9 Advanced Hive Programming This module covers some of the more advanced features of Hive programming, including views, the windowing functions, and the various optimization capabilities of Hive. 9.1 Module Objectives 9.2 Additional Content 9.3 Performing a Multi-Table/File Insert 9.4 Understanding Views 9.4.1 Defining Views 9.4.2 Using Views 9.4.3 The TRANSFORM Clause 9.4.4 The OVERClause 9.5 Using Windows 9.5.1 Hive Analytics Functions 9.6 Lab: Advanced Hive Programming 9.6.1 Objective: To understand how some of the more advanced features of Hive work 9.7 Hive File Formats 9.7.1 Hive SerDes 9.7.2 Hive ORCFiles 9.8 Computing Table Statistics 9.8.1 Hive Cost Based Optimization 9.8.2 Vectorization

Module 10 9.9 Using Hive Server2 9.10 Understanding Hive on Tez 9.10.1 Using Tez for Hive Queries 9.11 Demonstration: Hive Optimizations 9.11.1 Objective: To become familiar with someways to optimize Hive 9.12 Hive Optimization Tips 9.12.1 Hive Query Tunings 9.13 Review Questions 9.14 Lab: Streaming Data with Hive and Python 9.14.1 Objective: Use a custom reducer script to optimize a Hive query M o d u l e 1 0 Hadoop 2 and YARN This module covers the newer features of Hadoop 2, like YARN, HDFS Federation, and NameNode high availability. 10.1 Module Objectives 10.2 Additional Content 10.3 About HDFS Federation 10.3.1 Multiple Federated NameNodes 10.3.2 Multiple Namespaces 10.3.3 Overview of HDFS High Availability 10.3.4 Quorum Journal Manager 10.3.5 Configuring Automatic Failover

Module 11 10.4 About YARN 10.4.1 Open-source YARN Use Cases 10.4.2 The Components of YARN 10.4.3 Lifecycle of YARN Application 10.4.4 Cluster View Example 10.5 Review Questions 10.6 Lab: Running YARN Application 10.6.1 Objective: To run a YARN application M o d u l e 1 1 Defining Workflow with Oozie This module covers how to implement a Hadoop workflow using the Apache Oozie framework. 11.1 Module Objectives 11.2 Additional Content 11.3 Overview of Oozie 11.3.1 Defining an Oozie Workflow 11.3.2 Pig Actions 11.3.3 Hive Actions 11.3.4 MapReduce Actions 11.3.5 Submitting Workflow Job 11.3.6 Fork and Join Nodes 11.4 Defining an Oozie Coordinator Job 11.4.1 Schedule Job Based on Time

Module 12 11.4.2 Schedule Job Based on Data Availability 11.5 Review Questions 11.6 Lab: Defining an Oozie Workflow 11.6.1 Objective: Define and run an Oozie workflow M o d u l e 1 2 Hadoop Streaming This module covers an overview of the streaming capabilities of Hadoop. 12.1 Module Objectives 12.2 Hadoop Streaming 12.3 Running a Hadoop Streaming Job * Note: Contents subject to change