HADOOP. Revised 10/19/2015

Transcription

1 HADOOP Revised 10/19/2015

2 This Page Intentionally Left Blank

3 Table of Contents Hortonworks HDP Developer: Java... 1 Hortonworks HDP Developer: Apache Pig and Hive... 2 Hortonworks HDP Developer: Windows... 3 Hortonworks HDP Operations: Hadoop Administration Hortonworks HDP Data Science... 5 Hortonworks HDP Developer: Custom YARN Applications... 6 Hortonworks HDP Operations: Migrating to the Hortonworks Data Platform... 7 Hortonworks HDP Analyst: Apache HBase Essentials... 8 Hortonworks HDP Operations: Apache HBase Management... 9 Hortonworks HDP Developer: Storm and Trident Fundamentals Workshop i

4 This Page Intentionally Left Blank ii

5 Hortonworks HDP Developer: Java 4 Days TE7411_ This advanced four-day course provides Java programmers a deep-dive into Hadoop 2.0 application development. Students will learn how to design and develop efficient and effective MapReduce applications for Hadoop 2.0 using the Hortonworks Data Platform. Students who attend this course will learn how to harness the power of Hadoop 2.0 to manipulate, analyze and perform computations on their Big Data. This class is for experienced Java software engineers who need to design and develop Java MapReduce applications for Hadoop 2.0. This course assumes students have experience developing Java applications and using a Java IDE. Labs are completed using the Eclipse IDE and Maven. No prior Hadoop knowledge is required. Explain Hadoop 2.0 and the Hadoop Distributed File System Explain the new YARN framework in Hadoop 2.0 Develop a Java MapReduce application Develop a custom RawComparator class Use the Distributed Cache Explain the various join techniques in Hadoop Run a MapReduce application on YARN Use combiners and in-map aggregation to improve the performance of a MapReduce job Write a custom partitioner to avoid data skew on reducers Perform a secondary sort by writing custom key and group comparator classes Recognize use cases for the various built-in input and output formats Write a custom input and output format for a MapReduce job. Optimize a MapReduce job by following best practices Configure various aspects of a MapReduce job to optimize mappers and reducers Day 1 Understanding Hadoop and HDFS Writing MapReduce Applications Map Aggregation Day 2 Partitioning and Sorting Input and Output Formats Optimizing MapReduce Jobs Day 3 Advanced MapReduce Features Unit Testing HBase Programming Day 4 Pig Programming Hive Programming Defining Workflow Lab Content Configuring a Hadoop 2.0 Development Environment Putting data into HDFS using Java Write a distributed grep MapReduce application Write an inverted index MapReduce application Configure and use a combiner Writing a custom combiner Writing a custom partitioner Globally sort output using the TotalOrderPartitioner Writing a MapReduce job whose data is sorted using a composite key Writing a custom InputFormat class Writing a custom OutputFormat class Compute a simple moving average of historical stock price data Use data compression Define a RawComparator Perform a map-side join Using a Bloom filter Unit testing a MapReduce job Import data into HBase Perform a map-side join Use a Bloom filter to join two large datasets Perform unit tests using the UnitMR API Explain the basic architecture of HBase Write an HBase MapReduce application Explain use cases for Pig and Hive Write a simple Pig script to explore and transform big data Write a Pig UDF (User-Defined Function) in Java Execute a Hive query Write a Hive UDF in Java Use the JobControl class to create a workflow of MapReduce jobs Writing an HBase MapReduce job Writing a User-Defined Pig Function Writing a User-Defined Hive Function Defining an Oozie workflow 1

6 Hortonworks HDP Developer: Apache Pig and Hive 4 Days TE7414_ This 4-day hands-on training course teaches students how to develop applications and analyze Big Data stored in Apache Hadoop 2.0 using Pig and Hive. Students will learn the details of Hadoop 2.0, YARN, the Hadoop Distributed File System (HDFS), an overview of MapReduce, and a deep dive into using Pig and Hive to perform data analytics on Big Data. Other topics covered include data ingestion using Sqoop and Flume, and defining workflow using Oozie. Labs are run in a Linux environment. required. This class is for data Aaalysts, BI analysts, BI developers, SAS developers and other types of analysts who need to answer questions and analyze Big Data stored in a Hadoop cluster. Students should be familiar with programming principles and have experience in software development. SQL experience is strongly recommended. Java knowledge is helpful. No prior Hadoop knowledge is Explain Hadoop 2.0 and YARN Explain use cases for Hadoop Explain how HDFS Federation works in Hadoop 2.0 Explain the various tools and frameworks in the Hadoop 2.0 ecosystem Explain the architecture of the Hadoop Distributed File System (HDFS) Use the Hadoop client to input data into HDFS Use Sqoop to transfer data between Hadoop and a relational database Explain the architecture of MapReduce Explain the architecture of YARN Run a MapReduce job on YARN Write a Pig script to explore and transform data in HDFS Define advanced Pig relations Day 1 Understanding Hadoop 2.0 The Hadoop Distributed File System (HDFS) Inputting Data into HDFS The MapReduce Framework and YARN Day 2 Introduction to Pig Advanced Pig Programming Day 3 Hive Programming Using HCatalog Advanced Hive Programming Day 4 Advanced Hive Programming (cont.) Data Analysis and Statistics Defining Workflow with Oozie Lab Content Use HDFS commands to add/remove files and folders from HDFS Use Sqoop to transfer data between HDFS and a RDBMS Run a MapReduce job Run a YARN application Explore and transform data using Pig Split a dataset using Pig Join two datasets using Pig Use Pig to transform and export a dataset for use with Hive Use HCatLoader and HCatStorer to retrieve HCatalog schemas from within a Pig script Understand how a Hive table is stored in HDFS Use Hive to discover useful information in a dataset Understand how Hive queries get executed as MapReduce jobs Perform a join of two datasets with Hive Use advanced Hive features like windowing, views and ORC files Use the Hive analytics functions (rank, dense_rank, cume_dist, row_number) Write a custom reducer in Python that reduces the number of underlying MapReduce jobs generated from a Hive query Analyze and sessionize clickstream data using the Pig DataFu library Use Pig to apply structure to unstructured Big Data Invoke a Pig User-Defined Function Use Pig to organize and analyze Big Data Understand how Hive tables are defined and implemented Use the new Hive windowing functions Explain and use the various Hive file formats Create and populate a Hive table that uses the new ORC file format Use Hive to run SQL-like queries to perform data analysis Use Hive to join datasets using a variety of techniques, including Map-side joins and Sort-Merge-Bucket joins Write efficient Hive queries Create ngrams and context ngrams using Hive Perform data analytics like quantiles and page rank on Big Data using the DataFu Pig library Compute quantiles of NYSE stock prices Use Hive to compute ngrams on Avro-formatted files Define an Oozie workflow 2

7 Hortonworks HDP Developer: Windows 4 Days TE7410_ This 4-day hands-on training course teaches students how to develop applications and analyze Big Data stored in Apache Hadoop on Windows using Pig and Hive. Students will learn the details of Hadoop 2.x, YARN, the Hadoop Distributed File System (HDFS), an overview of MapReduce, and a deep dive into using Pig and Hive to perform data analytics on Big Data. Other topics covered include using Sqoop to transfer data between Hadoop and Microsoft SQL Server, and connecting Microsoft Excel to Hadoop using the HiveODBC Driver. required. This course is for software developers who need to understand and develop applications for Hadoop 2.x on Windows. Students should be familiar with programming principles and have experience in software development. SQL knowledge and familiarity with Microsoft Windows is also helpful. No prior Hadoop knowledge is Explain Hadoop and YARN Write a Pig script to explore and transform data in HDFS Explain use cases for Hadoop Define advanced Pig relations Explain the various tools and frameworks in the Hadoop 2.x Use Pig to apply structure to unstructured Big Data ecosystem Invoke a Pig User-Defined Function Explain the components of the Hortonworks Data Platform on Use Pig to organize and analyze Big Data Windows Understand how Hive tables are defined and implemented Explain the deployment options for HDP on Windows Use the new Hive windowing functions Explain the architecture of the Hadoop Distributed File System Explain and use the various Hive file formats (HDFS) Create and populate a Hive table that uses the new ORC file Use the Hadoop client to input data into HDFS format Use Sqoop to transfer data between Hadoop and Microsoft SQL Use Hive to run SQL-like queries to perform data analysis Server Use Hive to join datasets using a variety of techniques, including Explain the architecture of MapReduce Map-side joins and Sort-Merge-Bucket joins Explain the architecture of YARN Write efficient Hive queries Run a MapReduce job on YARN Create ngrams and context ngrams using Hive Day 1 Understanding Hadoop The Hadoop Distributed File System (HDFS) Inputting Data into HDFS The MapReduce Framework Day 2 Introduction to Pig Advanced Pig Programming Day 3 Hive Programming Using HCatalog Advanced Hive Programming Day 4 The Hive ODBC Driver Hadoop 2 and YARN Appendix A: Defining Workflow with Oozie Hands-On Labs: Students will work through the following lab exercises using the Hortonworks Data Platform 2.1 on Windows. Start HDP on Windows Use HDFS commands to add/remove files and folders from HDFS Use Sqoop to transfer data between HDFS and Microsoft SQL Server Run a MapReduce job Explore and transform data using Pig Split a dataset using Pig Join two datasets using Pig Use Pig to transform and export a dataset for use with Hive Use HCatLoader and HCatStorer to retrieve HCatalog schemas from within a Pig script Understand how a Hive table is stored in HDFS Use Hive to discover useful information in a dataset Understand how Hive queries get executed as MapReduce jobs Perform a join of two datasets with Hive Use advanced Hive features like windowing, views and ORC files Use the Hive analytics functions (rank, dense_rank, cume_dist, row_number) Analyze and sessionize clickstream data using the Pig DataFu library Compute quantiles of NYSE stock prices Use Hive to compute ngrams on Avro-formatted files Connect Microsoft Excel to Hadoop using the HiveODBC Driver Run a YARN application Define an Oozie workflow 3

8 Hortonworks HDP Operations: Hadoop Administration 1 4 Days TE7408_ This course is designed for administrators who will be managing the Hortonworks Data Platform (HDP) 2.3 with Ambari. It covers installation, configuration, and other typical cluster maintenance tasks. This course is designed for IT administrators and operators responsible for installing, configuring and supporting an Apache Hadoop 2.3 deployment in a Linux environment. Attendees should be familiar with Hadoop and Linux environments. Summarize and enterprise environment Manage HDFS Storage including Big Data, Hadoop and the Configure HDFS Storage Hortonworks Data Platform (HDP) Configure HDFS Transparent Data Install HDP Encryption Manage Ambari Users and Groups Configure the YARN Resource Manager Manage Hadoop Services Submit YARN Jobs Use HDFS Storage Configure the YARN Capacity Scheduler Lab Content: Students will work through the following lab exercises using the Hortonworks Data Platform 2.2. Introduction to the Lab Environment Performing an Interactive Ambari HDP Cluster Installation Configuring Ambari Users and Groups Managing Hadoop Services Using HDFS Files and Directories Using WebHDFS Configuring HDFS ACLs Managing HDFS Managing HDFS Quotas Configuring HDFS Transparent Data Encryption Configuring and Managing YARN Non-Ambari YARN Management Configuring YARN Failure Sensitivity, Work Preserving Restarts, and Log Aggregation Settings Submitting YARN Jobs Configuring Different Workload Types Configuring User and Groups for YARN Labs Configuring YARN Resource Behavior and Queues User, Group and Fine-Tuned Resource Management Adding Worker Nodes Configuring Rack Awareness Configuring HDFS High Availability Configuring YARN High Availability Configuring and Managing Ambari Alerts Configuring and Managing HDFS Snapshots Using Distributed Copy (DistCP) Add and Remove Cluster Nodes Configure HDFS and YARN Rack Awareness Configure HDFS and YARN High Availability Monitor a Cluster Protect a Cluster with Backups 4

9 Hortonworks HDP Data Science 3 Days TE7412_ Data Science for the Hortonworks Data Platform covers data science principles and techniques through lecture and hands-on experience. During this three-day course, students will learn the processes and practice of data science, including machine learning and natural language processing. Students will also learn the tools and programming languages used by data scientists, including Python, IPython, Mahout, Pig, NumPy, pandas, SciPy, Scikit-learn, the Natural Language Toolkit (NLTK), and Spark MLlib. This class is for architects, software developers, analysts and data scientists who need to understand how to apply data science and machine learning on Hadoop. Students must have experience with at least one programming or scripting language, knowledge in statistics and/or mathematics, and a basic understanding of big data and Hadoop principles. Recognize use cases for data science Describe the architecture of Hadoop and YARN Explain the differences between supervised and unsupervised learning List the six machine learning tasks Recognize use cases for clustering, outlier detection, affinity analysis, classification, regression, and recommendation Use Mahout to run a machine learning algorithm on Hadoop Write Pig scripts to transform data on Hadoop Use Pig to prepare data for a machine learning algorithm Write a Python script Use NumPy to analyze big data Use the data structure classes in the pandas library Write a Python script that invokes a SciPy machine learning algorithm Day 1 Using Hadoop for Data Science Hadoop Architecture Machine Learning Introduction to Pig Day 2 Python Programming Analyzing Data with Python Running Python on Hadoop Day 3 Implementing Machine Learning Natural Language Processing Spark MLlib Hands-On Labs: Students will complete the following hands-on labs using their own 7-node Hadoop cluster (HDP 2.1) and IPython Notebook. Setting Up a Development Environment Using HDFS Commands Using Mahout for Machine Learning Getting Started with Pig Exploring Data with Pig Using the IPython Notebook Data Analysis with Python Interpolating Data Points Define a Pig UDF in Python Streaming Python with Pig K-Nearest Neighbor K-Means Clustering Using NLTK for Natural Language Processing Classifying Text using Naive Bayes Spark Programming Running Data Science Algorithms using Spark MLlib Explain the options for running Python code on a Hadoop cluster Write a Pig User Defined Function in Python Use Pig streaming on Hadoop with a Python script Write a Python script that invokes a scikit-learn machine learning algorithm Use the k-nearest neighbor algorithm to predict values based on a training data set Run a machine learning algorithm on a distributed data set on Hadoop Describe use cases for Natural Language Processing (NLP) Perform sentence segmentation on a large body of text Perform part-of-speech tagging Use the Natural Language Toolkit (NLTK) for implement NLP tasks and machine learning algorithms Explain the components of a Spark application 5

10 Hortonworks HDP Developer: Custom YARN Applications 2 Days TE7415_ This 2-day hands-on training course teaches students how to develop custom YARN applications for Apache Hadoop. Students will learn the details of the YARN architecture, the steps involved in writing a YARN application, the details of writing a YARN client and ApplicationMaster, and how to launch Containers. Applications are developed using Eclipse and Gradle connected remotely to a 7-node HDP 2.1 cluster running in a virtual machine that the students can keep for use after the training. This course is intended for software engineers familiar with Java who need to develop YARN applications on Hadoop 2.x by writing custom YARN clients and ApplicationMasters in Java. Students must have attended the Developing Applications with the Hortonworks Data Platform using Java course; or attended the Data Analysis with the Hortonworks Data Platform using Pig and Hive course; or possess similar Hadoop development knowledge and understand HDFS and the MapReduce framework. Explain the architecture of YARN Explain the lifecycle of a YARN application Write a YARN client application Run a YARN application on a Hadoop 2.x cluster Monitor the status of a running YARN application View the aggregated logs of a YARN application Write a YARN ApplicationMaster Explain the differences between synchronous and asynchronous ApplicationMasters Allocate Containers in a cluster Launch Containers on NodeManagers Write a custom Container to perform specific business logic Configure a ContainerLaunchContext Define a LocalResource for sharing application files across the cluster Day 1 Unit 1: The YARN Architecture Unit 2: Overview of a YARN Application Unit 3: Writing a YARN Client Day 2 Unit 4: Writing a YARN ApplicationMaster Unit 5: Containers Unit 6: Job Scheduling Lab Content: Students will work through the following lab exercises using the Hortonworks Data Platform 2.1. Running a YARN Application Setup a YARN Development Environment Writing a YARN Client Submitting an ApplicationMaster Writing an ApplicationMaster Requesting Containers Running Containers Writing Custom Containers Explain the job schedulers of the ResourceManager Define queues for the Capacity Scheduler 6

11 Hortonworks HDP Operations: Migrating to the Hortonworks Data Platform 2 Days TE7416_ This course is designed for administrators who are familiar with administering other Hadoop distributions and are migrating to the Hortonworks Data Platform (HDP). It covers installation, configuration, maintenance, security and performance topics. Oozie. This class is for experienced Hadoop administrators and operators responsible for installing, configuring and supporting the Hortonworks Data Platform. Attendees should be familiar with Hadoop fundamentals, have experience administering a Hadoop cluster, and installation of configuration of Hadoop components such as Sqoop, Flume, Hive, Pig and Install and configure an HDP 2.x cluster Use Ambari to monitor and manage a cluster Mount HDFS to a local filesystem using the NFS Gateway Commission and decommission worker nodes using Ambari Use Falcon to define and process data pipelines Take snapshots using the HDFS snapshot feature Configure Hive for Tez Use Ambari to configure the schedulers of the ResourceManager Hands-On Labs Install HDP 2.x using Ambari Add a new node to the cluster Stop and start HDP services Mount HDFS to a local file system Configure the capacity scheduler Use WebHDFS Dataset mirroring using Falcon Commission and decommission a worker node using Ambari Use HDFS snapshots Configure NameNode HA using Ambari Secure an HDP cluster using Ambari Setting up a Knox gateway Implement and configure NameNode HA using Ambari Secure an HDP cluster using Ambari Setup a Knox gateway 7

12 Hortonworks HDP Analyst: Apache HBase Essentials 2 Days TE7417_ This course is designed for big data analysts who want to use the HBase NoSQL database which runs on top of HDFS to provide real-time read/write access to sparse datasets. Topics include HBase architecture, services, installation and schema design. This class is for architects, software developers, and analysts responsible for implementing non-sql databases in order to handle sparse data sets commonly found in big data use cases. Students must have basic familiarity with data management systems. Familiarity with Hadoop or databases is helpful but not required. Integrate HBase with Hadoop and HDFS Describe architectural components and core concepts of HBase Understand HBase functionality Install and configure HBase Perform backup and recovery Monitor and manage HBase Describe how Apache Phoenix works with HBase Integrate HBase with Apache ZooKeeper Understand HBase schema design Import and export data Hands-On Labs Using Hadoop and MapReduce Using HBase Importing Data from MySQL to HBase Using Apache ZooKeeper Examining Configuration Files Using Backup and Snapshot HBase Shell Operations Creating Tables with Multiple Column Families Exploring HBase Schema Blocksize and Bloom filters Exporting Data Using a Java Data Access Object Application to interact with HBase Use HBase services and perform data operations Optimize HBase Access 8

13 Hortonworks HDP Operations: Apache HBase Management 4 Days TE7419_ This course is designed for administrators who will be installing, configuring and managing HBase clusters. It covers installation with Ambari, configuration, security and troubleshooting HBase implementations. The course includes an end-of-course project in which students work together to design and implement an HBase schema. This course is for architects, software developers, and analysts responsible for implementing non-sql databases in order to handle sparse data sets commonly found in big data use cases. Students must have basic familiarity with data management systems. Familiarity with Hadoop or databases is helpful but not required. Students new to Hadoop are encouraged to take the HDP Overview: Apache Hadoop Essentials course. Discuss running applications in the cloud Perform operational management Provision the cluster Perform backup and recovery Use the HBase shell Provide security Ingest data Monitor HBase and diagnose problems Hands on Labs Installing and Configuring HBase with Ambari Manually Installing HBase (Optional) Using Shell Commands Ingesting Data using ImportTSV Enabling HBase High Availability Viewing Log Files Configuring and Enabling Snapshots Configuring Cluster Replication Enabling Authentication and Authorization Diagnosing and Resolving Hot Spotting Region Splitting Monitoring JVM Garbage Collection End-of-Course Project: Designing an HBase Schema Perform maintenance Troubleshoot 9

14 Hortonworks HDP Developer: Storm and Trident Fundamentals Workshop 2 Days TE7418_ This course provides a technical introduction to the fundamentals of Apache Storm and Trident that includes the concepts, terminology, architecture, installation, operation, and management of Storm and Trident. Simple Storm and Trident code excerpts are provided throughout the course. The course also includes an introduction to, and code samples for, Apache Kafka. Apache Kafka is a messaging system that is commonly used in concert with Storm and Trident. This course is for data architects, data integration architects, technical infrastructure team, and Hadoop administrators or developers who want to understand the fundamentals of Storm and Trident. No previous Hadoop or programming knowledge is required. Students will need browser access to the Internet. Recognize differences between batch and real-time data processing Define Storm elements including tuples, streams, spouts, topologies, worker processes, executors, and stream groupings Explain Storm architectural components, including Nimbus, Supervisors, and ZooKeeper cluster Recognize/interpret Java code for a spout, bolt, or topology Identify how to install and configure a Storm cluster Identify how to develop and submit a topology to a local or remote distributed cluster Recognize and explain the differences between reliable and unreliable Storm operation Manage and monitor Storm using the command-line client or browser-based Storm User Interface (UI) Define Trident elements including tuples, streams, batches, partitions, topologies, Trident spouts, and operations Recognize and interpret the code for Trident operations, including filters, functions, aggregations, merges, and joins Recognize and understand Trident repartitioning operations See Course Objectives 10