Implement Hadoop jobs to extract business value from large and varied data sets



Similar documents
You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

COURSE CONTENT Big Data and Hadoop Training

Workshop on Hadoop with Big Data

Big Data Course Highlights

ITG Software Engineering

Peers Techno log ies Pv t. L td. HADOOP

Qsoft Inc

ITG Software Engineering

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

Hadoop: The Definitive Guide

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

BIG DATA - HADOOP PROFESSIONAL amron

Complete Java Classes Hadoop Syllabus Contact No:

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop Job Oriented Training Agenda

Data processing goes big

HiBench Introduction. Carson Wang Software & Services Group

BIG DATA HADOOP TRAINING

Apache Hadoop: The Big Data Refinery

A very short Intro to Hadoop

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Hadoop and Map-Reduce. Swati Gore

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

TRAINING PROGRAM ON BIGDATA/HADOOP

Cloudera Certified Developer for Apache Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop Ecosystem B Y R A H I M A.

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

HADOOP. Revised 10/19/2015

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Hadoop: The Definitive Guide

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Testing Big data is one of the biggest

Hadoop implementation of MapReduce computational model. Ján Vaňo

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Bringing Big Data to People

Big data for the Masses The Unique Challenge of Big Data Integration

Professional Hadoop Solutions

Dell In-Memory Appliance for Cloudera Enterprise

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

BIG DATA SOLUTION DATA SHEET

How Cisco IT Built Big Data Platform to Transform Data Management

HADOOP BIG DATA DEVELOPER TRAINING AGENDA

Big Data and Data Science: Behind the Buzz Words

HDP Hadoop From concept to deployment.

Lofan Abrams Data Services for Big Data Session # 2987

Navigating the Big Data infrastructure layer Helena Schwenk

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Large scale processing using Hadoop. Ján Vaňo

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Internals of Hadoop Application Framework and Distributed File System

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

How to Enhance Traditional BI Architecture to Leverage Big Data

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Open Source Technologies on Microsoft Azure

Luncheon Webinar Series May 13, 2013

Big Data Training - Hackveda

Apache Hadoop: Past, Present, and Future

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

A Brief Outline on Bigdata Hadoop

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Virtualizing Apache Hadoop. June, 2012

MapReduce with Apache Hadoop Analysing Big Data

Open source Google-style large scale data analysis with Hadoop

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

WHITE PAPER. Four Key Pillars To A Big Data Management Solution

WHITE PAPER USING CLOUDERA TO IMPROVE DATA PROCESSING

Hadoop & Spark Using Amazon EMR

Big Data and Apache Hadoop Adoption:

The Future of Data Management

IBM InfoSphere BigInsights Enterprise Edition

Introduction to Big Data Training

Community Driven Apache Hadoop. Apache Hadoop Basics. May Hortonworks Inc.

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE

CSE-E5430 Scalable Cloud Computing Lecture 2

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

I/O Considerations in Big Data Analytics

How To Handle Big Data With A Data Scientist

Microsoft Azure Data Technologies: An Overview

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Oracle Big Data SQL Technical Update

Manifest for Big Data Pig, Hive & Jaql

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Hadoop IST 734 SS CHUNG

Transcription:

Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to summarize data Develop Hive and Pig queries to simplify data analysis Test and debug jobs using MRUnit Monitor task execution and cluster health What is this course about? The availability of large data sets presents new opportunities and challenges to organizations of all sizes. This course provides the hands-on programming skills to leverage the Apache Hadoop platform to efficiently process a variety of Big Data. Additionally, you learn to test and deploy Big Data solutions on commodity clusters. This course also covers Pig, Hive, HBase and other components of the Hadoop ecosystem. Further, this course teaches testing, deployment and best practices to architect and develop a complete Big Data solution. Who will benefit from this course? This course is for developers, architects and testers who desire hands-on experience writing code for Hadoop. It can also be helpful to technical managers interested in the development process. What background do I need? You should have Java experience at the level of Course 471, Java Programming Introduction: Hands-On or equivalent experience. Exposure to SQL is helpful. Will there be any programming in the course? Yes! Approximately 40 percent of the course time is devoted to hands-on programming. What tools and platforms are used? The platform is Java running on RedHat Linux. The tools used include Eclipse and various text editors.

Which Big Data products does this course use? The course covers a number of Big Data products including Apache Hadoop, MapReduce, Hadoop Distributed File System (HDFS), HBase, Hive, and Pig. Additional parts of the Hadoop ecosystem will be covered such as Sqoop, Oozie and MRUnit. Other datastores will be mentioned for comparison. What is Big Data? Big Data is a term used to define data sets that have the potential to rapidly grow so large that they become unmanageable. The Big Data movement includes new tools and ways of storing information that allow efficient processing and analysis for informed business decision-making. What is Hadoop? Hadoop is an open source implementation of MapReduce by the Apache group and is the most widely used platform on which to solve problems in processing large, complex data sets that would otherwise be intractable using conventional means. It is a high performance distributed storage and processing system. Hadoop fills the gap in the market by effectively storing and providing computational capabilities for substantial amounts of data. There is commercial support from multiple vendors and prepackaged cloud solutions. What is MapReduce? MapReduce is a parallel programming model that allows distributed processing on large data sets on a cluster of computers. MapReduce was originally implemented by Google as part of their searching and indexing of the Internet. It has since grown in popularity and is quickly being adopted by most industries. How are Hadoop programs developed? Primarily programs are written in Java although Hadoop has facilities to handle programs written in other languages like C++, Python, and.net. Programs can also be written in scripting languages like Pig. Data in HDFS can be queried using a SQL-like syntax with Hive.

What are the advantages of using Hadoop? Hadoop provides the ability to process and analyze more data than was previously possible at a lower cost It runs on scalable commodity clusters It has self-healing capabilities to survive hardware failures It operates on various types of data and adapts to meet varying degrees of structure HDFS automatically provides robustness and redundancy for performance and reliability There are many associated projects that enhance the Hadoop ecosystem and ease development Introduction to Hadoop Identifying the business benefits of Hadoop Surveying the Hadoop ecosystem Selecting a suitable distribution Parallelizing Program Execution Meeting the challenges of parallel programming Investigating parallelizable challenges: algorithms, data and information exchange Estimating the storage and complexity of Big Data Parallel programming with MapReduce Dividing and conquering large-scale problems Uncovering jobs suitable for MapReduce Solving typical business problems

Implementing Real-World MapReduce Jobs Applying the Hadoop MapReduce paradigm Configuring the development environment Exploring the Hadoop distribution Creating the components of MapReduce jobs Introducing the Hadoop daemons Analyzing the stages of MapReduce processing:splitting, mapping, shuffling and reducing Building complex MapReduce jobs Selecting and employing multiple mappers and reducers Leveraging built-in mappers, reducers and partitioners Coordinating jobs with Oozie workflow scheduler Streaming tasks through various programming languages Customizing MapReduce Solving common data manipulation problems Executing algorithms:parallel sorts, joins and searches Analyzing log files, social media data and e-mails Implementing partitioners and combiners Identifying network bound, CPU bound and disk I/O bound parallel algorithms Reducing network traffic with combiners Dividing the workload efficiently using partitioners Collecting metrics with counters

Persisting Big Data with Distributed Data Stores Making the case for distributed data Achieving high performance data throughput Recovering from media failure through redundancy Interfacing with Hadoop Distributed File System (HDFS) Breaking down the structure and organization of HDFS Loading raw data and retrieving results Reading and writing data programmatically Partitioning text or binary data Manipulating Hadoop SequenceFile types Structuring data with HBase Migrating from structured to unstructured storage Applying NoSQL concepts with schema on read Connecting to HBase from MapReduce jobs Comparing HBase to other types of NoSQL data stores Simplifying Data Analysis with Query Languages Unleashing the power of SQL with Hive Structuring data with the Hive MetaStore Extracting, Transforming and Loading (ETL) data Querying with HiveQL Accessing Hive servers through JDBC Extending HiveQL with User-Defined Functions (UDF) Executing workflows with Pig

Developing Pig Latin scripts to consolidate workflows Integrating Pig queries with Java Interacting with data through the grunt console Extending Pig with User-Defined Functions (UDF) Managing and Deploying Big Data Solutions Testing and debugging Hadoop code Logging significant events for auditing and debugging Debugging in local mode Validating requirements with MRUnit Deploying, monitoring and tuning performance Deploying to a production cluster Optimizing performance with administrative tools Monitoring job execution through web user interfaces