Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Similar documents
Qsoft Inc

COURSE CONTENT Big Data and Hadoop Training

Hadoop Job Oriented Training Agenda

Complete Java Classes Hadoop Syllabus Contact No:

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop: The Definitive Guide

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Implement Hadoop jobs to extract business value from large and varied data sets

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

Hadoop Ecosystem B Y R A H I M A.

BIG DATA HADOOP TRAINING

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

ITG Software Engineering

Big Data Course Highlights

Peers Techno log ies Pv t. L td. HADOOP

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Cloudera Certified Developer for Apache Hadoop

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Introduction to Big Data Training

Moving From Hadoop to Spark

This is a brief tutorial that explains the basics of Spark SQL programming.

Important Notice. (c) Cloudera, Inc. All rights reserved.

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Workshop on Hadoop with Big Data

Big Data With Hadoop

The Hadoop Eco System Shanghai Data Science Meetup

Constructing a Data Lake: Hadoop and Oracle Database United!

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

HADOOP. Revised 10/19/2015

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

HDFS. Hadoop Distributed File System

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Internals of Hadoop Application Framework and Distributed File System

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop: The Definitive Guide

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Comparing SQL and NOSQL databases

Architectures for massive data management

Unified Big Data Analytics Pipeline. 连 城

Scaling Out With Apache Spark. DTL Meeting Slides based on

Hadoop & Spark Using Amazon EMR

Certified Big Data and Apache Hadoop Developer VS-1221

Chase Wu New Jersey Ins0tute of Technology

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Oracle Big Data Fundamentals Ed 1 NEW

ITG Software Engineering

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Deploying Hadoop with Manager

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE


Hive Interview Questions

Big Data Analytics. Lucas Rego Drumond

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

How To Scale Out Of A Nosql Database

BIG DATA & HADOOP DEVELOPER TRAINING & CERTIFICATION

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Unified Big Data Processing with Apache Spark. Matei

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

MapReduce with Apache Hadoop Analysing Big Data

Open source Google-style large scale data analysis with Hadoop

The Internet of Things and Big Data: Intro

How To Create A Data Visualization With Apache Spark And Zeppelin

Map Reduce & Hadoop Recommended Text:

Unlocking Hadoop for Your Rela4onal DB. Kathleen Technical Account Manager, Cloudera Sqoop PMC Member BigData.

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Introduction to Spark

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Hadoop IST 734 SS CHUNG

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

BIG DATA - HADOOP PROFESSIONAL amron

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Integrating Big Data into the Computing Curricula

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Apache Sentry. Prasad Mujumdar

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Apache HBase. Crazy dances on the elephant back

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Big Data and Scripting Systems build on top of Hadoop

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

SQL on NoSQL (and all of the data) With Apache Drill

Big Data Too Big To Ignore

Transcription:

Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce The Reduce Phase of MapReduce MapReduce Explained MapReduce Word Count Job MapReduce Shared-Nothing Architecture Similarity with SQL Aggregation Operations Example of Map & Reduce Operations using JavaScript Problems Suitable for Solving with MapReduce Typical MapReduce Jobs Fault-tolerance of MapReduce Distributed Computing Economics MapReduce Systems Hadoop Overview Apache Hadoop Apache Hadoop Logo Typical Hadoop Applications Hadoop Clusters Hadoop Design Principles Hadoop's Core Components Hadoop Simple Definition High-Level Hadoop Architecture Hadoop-based Systems for Data Analysis Other Hadoop Ecosystem Projects Hadoop Caveats Hadoop Distributions Cloudera Distribution of Hadoop (CDH) Cloudera Distributions Hortonworks Data Platform (HDP) MapR

Hadoop Distributed File System Overview Hadoop Distributed File System (HDFS) HDFS High Availability HDFS "Fine Print" Hadoop Security HDFS Rack-awareness Data Blocks Data Block Replication Example HDFS NameNode Directory Diagram Accessing HDFS Examples of HDFS Commands Client Interactions with HDFS for the Read Operation Read Operation Sequence Diagram Client Interactions with HDFS for the Write Operation Communication inside HDFS MapReduce with Hadoop Hadoop's MapReduce MapReduce v1 ("Classic MapReduce") JobTracker and TaskTracker (the Classic MapReduce) YARN (MapReduce v2) YARN vs MRv1 YARN As Data Operating System MapReduce Programming Options Java MapReduce API The Structure of a Java MapReduce Program The Mapper Class The Reducer Class The Driver Class Compiling Classes Running the MapReduce Job The Structure of a Single MapReduce Program Combiner Pass (Optional) Hadoop's Streaming MapReduce Python Word Count Mapper Program Example Python Word Count Reducer Program Example Setting up Java Classpath for Streaming Support Streaming Use Cases The Streaming API vs Java MapReduce API Amazon Elastic MapReduce

Apache Pig Scripting Platform What is Pig? Pig Latin Apache Pig Logo Pig Execution Modes Local Execution Mode MapReduce Execution Mode Running Pig Running Pig in Batch Mode What is Grunt? Pig Latin Statements Pig Programs Pig Latin Script Example SQL Equivalent Differences between Pig and SQL Statement Processing in Pig Comments in Pig Supported Simple Data Types Supported Complex Data Types Arrays Defining Relation's Schema The bytearray Generic Type Using Field Delimiters Referencing Fields in Relations Apache Pig HDFS Interface The HDFS Interface FSShell Commands (Short List) Grunt's Old File System Commands Apache Pig Relational and Eval Operators Pig Relational Operators Example of Using the JOIN Operator Example of Using the Order By Operator Caveats of Using Relational Operators Pig Eval Functions Caveats of Using Eval Functions (Operators) Example of Using Single-column Eval Operations Example of Using Eval Operators For Global Operations

Apache Pig Miscellaneous Topics Utility Commands Handling Compression User-Defined Functions Filter UDF Skeleton Code Apache Pig Performance Apache Pig Performance Performance Enhancer - Use the Right Schema Type Performance Enhancer - Apply Data Filters Use the PARALLEL Clause Examples of the PARALLEL Clause Performance Enhancer - Limiting the Data Sets Displaying Execution Plan Compress the Results of Intermediate Jobs Example of Running Pig with LZO Compression Codec Apache Oozie What is Oozie? Apache Oozie Logo Oozie Terminology Directed Acyclic Graph Oozie Job Types Oozie Architecture Oozie Configuration Oozie Workflows The Flow Of Oozie Workflows More on Oozie Workflows Oozie Workflow Control Nodes More Oozie Workflow Control Nodes A Workflow Example A More Complex Workflow Example Oozie Coordinator The Pig Action Template The Pig Action Template

Hive What is Hive? Apache Hive Logo Hive's Value Proposition Who uses Hive? Hive's Main Sub-Systems Hive Features The "Classic" Hive Architecture The New Hive Architecture HiveQL Where are the Hive Tables Located? "Legacy" Hive Command-line Interface (CLI) The Beeline Command Shell Hive Command-line Interface Hive Command-line Interface (CLI) The Hive Interactive Shell Running Host OS Commands from the Hive Shell Interfacing with HDFS from the Hive Shell The Hive in Unattended Mode The Hive CLI Integration with the OS Shell Executing HiveQL Scripts Comments in Hive Scripts Variables and Properties in Hive CLI Setting Properties in CLI Example of Setting Properties in CLI Hive Namespaces Using the SET Command Setting Properties in the Shell Setting Properties for the New Shell Session Hive Data Definition Language Hive Data Definition Language Creating Databases in Hive Using Databases Creating Tables in Hive Supported Data Type Categories Common Numeric Types String and Date / Time Types Miscellaneous Types

Example of the CREATE TABLE Statement Working with Complex Types Table Partitioning Table Partitioning Table Partitioning on Multiple Columns Viewing Table Partitions Row Format Data Serializers / Deserializers File Format Storage More on File Formats The EXTERNAL DDL Parameter Example of Using EXTERNAL Creating an Empty Table Dropping a Table Table / Partition(s) Truncation Alter Table/Partition/Column Views Create View Statement Why Use Views? Restricting Amount of Viewable Data Examples of Restricting Amount of Viewable Data Creating and Dropping Indexes Describing Data Hive Data Manipulation Language Hive Data Manipulation Language (DML) Using the LOAD DATA statement Example of Loading Data into a Hive Table Loading Data with the INSERT Statement Appending and Replacing Data with the INSERT Statement Examples of Using the INSERT Statement Multi Table Inserts Multi Table Inserts Syntax Multi Table Inserts Example Hive Select Statement HiveQL The SELECT Statement Syntax The WHERE Clause Examples of the WHERE Statement Partition-based Queries

Example of an Efficient SELECT Statement The DISTINCT Clause Supported Numeric Operators Built-in Mathematical Functions Built-in Aggregate Functions Built-in Statistical Functions Other Useful Built-in Functions The GROUP BY Clause The HAVING Clause The LIMIT Clause The ORDER BY Clause The JOIN Clause The CASE Clause Example of CASE Clause Apache Sqoop What is Sqoop? Apache Sqoop Logo Sqoop Import / Export Sqoop Help Examples of Using Sqoop Commands Data Import Example Fine-tuning Data Import Controlling the Number of Import Processes Data Splitting Helping Sqoop Out Example of Executing Sqoop Load in Parallel A Word of Caution: Avoid Complex Free-Form Queries Using Direct Export from Databases Example of Using Direct Export from MySQL More on Direct Mode Import Changing Data Types Example of Default Types Overriding File Formats The Apache Avro Serialization System Binary vs Text More on the SequenceFile Binary Format Generating the Java Table Record Source Code Data Export from HDFS Export Tool Common Arguments Data Export Control Arguments Data Export Example Using a Staging Table INSERT and UPDATE Statements

INSERT Operations UPDATE Operations Example of the Update Operation Failed Exports Sqoop2 Sqoop2 Architecture Cloudera Impala What is Cloudera Impala? Impala's Logo Impala Architecture Benefits of Using Impala Key Features How Impala Handles SQL Queries Impala Programming Interfaces Impala SQL Language Reference Differences Between Impala and HiveQL Impala Shell Impala Shell Main Options Impala Shell Commands Impala Common Shell Commands Cloudera Web Admin UI Impala Browse-based Query Editor Apache HBase What is HBase? HBase Design HBase Features HBase High Availability The Write-Ahead Log (WAL) and MemStore HBase vs RDBS HBase vs Apache Cassandra Not Good Use Cases for HBase Interfacing with HBase HBase Thrift And REST Gateway HBase Table Design Column Families A Cell's Value Versioning Timestamps Accessing Cells HBase Table Design Digest

Table Horizontal Partitioning with Regions HBase Compaction Loading Data in HBase Column Families Notes Rowkey Notes HBase Shell HBase Shell Command Groups Creating and Populating a Table in HBase Shell Getting a Cell's Value Counting Rows in an HBase Table Apache HBase Java API HBase Java Client HBase Scanners Using ResultScanner Efficiently The Scan Class The KeyValue Class The Result Class Getting Versions of Cell Values Example The Cell Interface HBase Java Client Example Scanning the Table Rows Dropping a Table The Bytes Utility Class Introduction to Functional Programming What is Functional Programming (FP)? Terminology: First-Class and Higher-Order Functions Terminology: Lambda vs Closure A Short List of Languages that Support FP FP with Java FP With JavaScript Imperative Programming in JavaScript The JavaScript map (FP) Example The JavaScript reduce (FP) Example Using reduce to Flatten an Array of Arrays (FP) Example The JavaScript filter (FP) Example

Introduction to Apache Spark What is Spark A Short History of Spark Where to Get Spark? The Spark Platform Spark Logo Common Spark Use Cases Languages Supported by Spark Running Spark on a Cluster The Driver Process The Executor and Worker Processes The Spark Application Architecture Interfaces with Data Storage Systems Limitations of Hadoop's MapReduce Spark vs MapReduce Spark as an Alternative to Apache Tez The Resilient Distributed Dataset (RDD) Spark Streaming (Micro-batching) Spark SQL Example of Spark SQL Spark Machine Learning Library GraphX The Spark Shell The Spark Shell The Spark Shell UI The Spark Context (sc) and SQL Context (sqlcontext) The Shell Spark Context Loading Files Saving Files Basic Spark ETL Operations Spark RDDs The Resilient Distributed Dataset (RDD) Ways to Create an RDD Custom RDDs Supported Data Types RDD Operations

RDDs are Immutable Spark Actions RDD Transformations Other RDD Operations Chaining RDD Operations RDD Lineage The Big Picture Parallelized Collections More on parallelize() Method The Pair RDD Where do I use Pair RDDs? Example of Creating a Pair RDD with Map Example of Creating a Pair RDD with keyby Miscellaneous Pair RDD Operations RDD Caching RDD Persistence The Tachyon Storage Parallel Data Processing with Spark Running Spark on a Cluster Spark Stand-alone Option The High-Level Execution Flow in Stand-alone Spark Cluster Data Partitioning Data Partitioning Diagram Single Local File System RDD Partitioning Multiple File RDD Partitioning Special Cases for Small-sized Files Parallel Data Processing of Partitions Spark Application, Jobs, and Tasks Stages and Shuffles The "Big Picture" Introduction to Spark SQL What is Spark SQL? Uniform Data Access with Spark SQL Hive Integration Hive Interface Integration with BI Tools Spark SQL is No Longer Experimental Developer API! What is a DataFrame? The SQLContext Object

The SQLContext API Changes Between Spark SQL 1.3 to 1.4 Example of Spark SQL (Scala Example) Example of Working with a JSON File Example of Working with a Parquet File Using JDBC Sources JDBC Connection Example Performance & Scalability of Spark SQL Lab Exercises Lab 1. Learning the Lab Environment Lab 2. The Hadoop Distributed File System Lab 3. Hadoop Streaming MapReduce Lab 4. Programming Java MapReduce Jobs on Hadoop Lab 5. Getting Started with Apache Pig Lab 6. Apache Pig HDFS Command-Line Interface Lab 7. Working with Data Sets in Apache Pig Lab 8. Using Relational Operators in Apache Pig Lab 9. Apache Oozie Lab 10. The Hive Shell Lab 11. Hive Data Definition Language Lab 12. Using Select Statement in HiveQL Lab 13. Table Partitioning in Hive Lab 14. Data Import and Export with Sqoop Lab 15. Using Impala Lab 16. Using HBase Shell Lab 17. Elements of Functional Programming with JavaScript Lab 18. The Spark Shell Lab 19. Spark ETL and HDFS Interface Lab 20. Common Map / Reduce Programs in Spark Lab 21. Spark SQL