Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah



Similar documents
Qsoft Inc

Hadoop: The Definitive Guide

COURSE CONTENT Big Data and Hadoop Training

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Peers Techno log ies Pv t. L td. HADOOP

Custom output format (cont.) TextToXMLConversionMapper, 158 Text-to-XML Job, 161 XMLOutputFormat and XMLRecordWriter,

Big Data Course Highlights

BIG DATA HADOOP TRAINING

Hadoop Ecosystem B Y R A H I M A.

ITG Software Engineering

Workshop on Hadoop with Big Data

Implement Hadoop jobs to extract business value from large and varied data sets

Complete Java Classes Hadoop Syllabus Contact No:

BIG DATA - HADOOP PROFESSIONAL amron

ITG Software Engineering

Hadoop: The Definitive Guide

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

TRAINING PROGRAM ON BIGDATA/HADOOP

HADOOP. Revised 10/19/2015

Chase Wu New Jersey Ins0tute of Technology

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Internals of Hadoop Application Framework and Distributed File System

Hadoop Job Oriented Training Agenda

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Hadoop IST 734 SS CHUNG

HADOOP MOCK TEST HADOOP MOCK TEST II

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Cloudera Certified Developer for Apache Hadoop

Apache Hadoop. Alexandru Costan

Map Reduce & Hadoop Recommended Text:

Certified Big Data and Apache Hadoop Developer VS-1221

Bringing Big Data to People

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Big Data With Hadoop

Big Data Training - Hackveda

Deploying Hadoop with Manager

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200

A Brief Outline on Bigdata Hadoop

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Open source Google-style large scale data analysis with Hadoop

Data processing goes big

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Professional Hadoop Solutions

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

How to Hadoop Without the Worry: Protecting Big Data at Scale

Integrating Big Data into the Computing Curricula

HiBench Introduction. Carson Wang Software & Services Group

Modernizing Your Data Warehouse for Hadoop

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Savanna Hadoop on. OpenStack. Savanna Technical Lead

Hadoop & Spark Using Amazon EMR

Oracle Big Data Essentials

Native Connectivity to Big Data Sources in MSTR 10

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

The Hadoop Eco System Shanghai Data Science Meetup

A Brief Introduction to Apache Tez

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

HADOOP BIG DATA DEVELOPER TRAINING AGENDA

How To Use Hadoop

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Getting to know Apache Hadoop

Hadoop and Map-Reduce. Swati Gore

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

NoSQL and Hadoop Technologies On Oracle Cloud

Oracle Big Data Fundamentals Ed 1 NEW

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Chapter 7. Using Hadoop Cluster and MapReduce

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Case Study : 3 different hadoop cluster deployments

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

DANIEL EKLUND UNDERSTANDING BIG DATA AND THE HADOOP TECHNOLOGIES NOVEMBER 2-3, 2015 RESIDENZA DI RIPETTA - VIA DI RIPETTA, 231 ROME (ITALY)

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Practical Hadoop. Security. Bhushan Lakhe

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Constructing a Data Lake: Hadoop and Oracle Database United!

WHAT S NEW IN SAS 9.4

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Alternatives to HIVE SQL in Hadoop File Structure

A very short Intro to Hadoop

Transcription:

Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah

Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big Data 1 What Is Big Data? 1 Key Idea Behind Big Data Techniques 2 Data Is Distributed Across Several Nodes 2 Applications Are Moved to the Data 3 Data Is Processed Local to a Node 3 Sequential Reads Preferred Over Random Reads 3 An Example 4 Big Data Programming Models 4 Massively Parallel Processing (MPP) Database Systems 4 In-Memory Database Systems 5 MapReduce Systems 5 Bulk Synchronous Parallel (BSP) Systems 6 Big Data and Transactional Systems 7 How Much Can We Scale? 8 A Compute-Intensive Example 8 Amdhal's Law 9 Business Use-Cases for Big Data 9 Summary 10 vii

Chapter 2: Hadoop Concepts 11 Introducing Hadoop 11 Introducing the MapReduce Model 12 Components of Hadoop 16 Hadoop Distributed File System (HDFS) 17 Secondary NameNode 22 TaskTracker 23 JobTracker 23 Hadoop 2.0 24 Components of YARN 26 HDFS High Availability 29 Summary 30 Chapter 3: Getting Started with the Hadoop Framework 31 Types of Installation 31 Stand-Alone Mode 31 Pseudo-Distributed Cluster 32 Multinode Node Cluster Installation 32 Preinstalled Using Amazon Elastic MapReduce 32 Setting up a Development Environment with a Cloudera Virtual Machine 33 Components of a MapReduce program 34 Your First Hadoop Program 34 Prerequisites to Run Programs in Local Mode 35 WordCount Using the Old API 36 Building the Application 38 Running WordCount in Cluster Mode 39 WordCount Using the New API 39 Building the Application 41 Running WordCount in Cluster Mode 41 Third-Party Libraries in Hadoop Jobs 41 Summary 46 viii

Chapter 4: Hadoop Administration 47 Hadoop Configuration Files 47 Configuring Hadoop Daemons 48 Precedence of Hadoop Configuration Files 49 Diving into Hadoop Configuration Files 49 core-site.xml 50 hdfs-*.xml 51 mapred-site.xml 52 yarn-site.xml 54 Memory Allocations in YARN 55 Scheduler 56 Capacity Scheduler 57 Fair Scheduler 59 Fair Scheduler Configuration 60 yarn-site.xml Configurations 61 Allocation File Format and Configurations 62 Determine Dominant Resource Share in drf Policy 63 Slaves File 64 Rack Awareness 64 Providing Hadoop with Network Topology 64 Cluster Administration Utilities 65 Check the HDFS 66 Command-Line HDFS Administration 68 Rebalancing HDFS Data 70 Copying Large Amounts of Data from the HDFS 71 Summary 72 Chapter 5: Basics of MapReduce Development 73 Hadoop and Data Processing 73 Reviewing the Airline Dataset 73 Preparing the Development Environment 75 Preparing the Hadoop System 75 ix

MapReduce Programming Patterns 76 Map-Only Jobs (SELECT and WHERE Queries) 76 Problem Definition: SELECT Clause 76 Problem Definition: WHERE Clause 84 Map and Reduce Jobs (Aggregation Queries) 87 Problem Definition: GROUP BY and SUM Clauses 88 Improving Aggregation Performance Using the Combiner 94 Problem Definition: Optimized Aggregators 95 Role of the Partitioner 100 Problem Definition: Split Airline Data by Month 100 Bringing it All Together 103 Summary 106 Chapter 6: Advanced MapReduce Development 107 MapReduce Programming Patterns 107 Introduction to Hadoop I/O 107 Problem Definition: Sorting 109 Problem Definition: Analyzing Consecutive Records 124 Problem Definition: Join Using MapReduce 134 Problem Definition: Join Using Map-Only jobs 140 Writing to Multiple Output Files in a Single MR Job 145 Collecting Statistics Using Counters 147 Summary 150 Chapter 7: Hadoop Input/Output 151 Compression Schemes 151 What Can Be Compressed? 152 Compression Schemes 152 Enabling Compression 153 Inside the Hadoop I/O processes 154 InputFormat 155 OutputFormat 156 Custom OutputFormat: Conversion from Text to XML 157 x

Custom InputFormat: Consuming a Custom XML file 161 Hadoop Files 170 SequenceFile 171 MapFiles 176 Avro Files 177 Summary 183 Chapter 8: Testing Hadoop Programs 185 Revisiting the Word Counter 185 Introducing MRUnit 187 Installing MRUnit 187 MRUnit Core Classes 187 Writing an MRUnit Test Case 188 Testing Counters 190 Features of MRUnit 193 Limitations of MRUnit 194 Testing with LocalJobRunner 194 Limitations of LocalJobRunner 197 Testing with MiniMRCIuster 197 Setting up the Development Environment 197 Example for MiniMRCIuster 199 Limitations of MiniMRCIuster 201 Testing MR Jobs with Access Network Resources 201 Summary 202 Chapter 9: Monitoring Hadoop 203 Writing Log Messages in Hadoop MapReduce Jobs 203 Viewing Log Messages in Hadoop MapReduce Jobs 206 User Log Management in Hadoop 2.x 209 Log Storage in Hadoop 2.x 209 Log Management Improvements 211 Viewing Logs Using Web-Based Ul 211 xi

Command-Line Interface 211 Log Retention 212 Hadoop Cluster Performance Monitoring 212 Using YARN REST APIs 213 Managing the Hadoop Cluster Using Vendor Tools 213 Ambari Architecture 214 Summary 215 Chapter 10: Data Warehousing Using Hadoop 217 Apache Hive 217 Installing Hive 218 Hive Architecture 218 Metastore 219 Compiler Basics 219 Hive Concepts 219 HiveQL Compiler Details 223 Data Definition Language 227 Data Manipulation Language 228 External Interfaces 229 Hive Scripts 231 Performance 232 MapReduce Integration 232 Creating Partitions 233 User-Defined Functions 234 Impala 236 Impala Architecture 237 Impala Features 237 Impala Limitations 237 Shark 238 Shark/Spark Architecture 238 Summary 239 xii

Chapter 11: Data Processing Using Pig 241 An Introduction to Pig 241 Running Pig 243 Executing in the Grunt Shell 244 Executing a Pig Script 244 Embedded Java Program 245 Pig Latin 246 Comments in a Pig Script 246 Execution of Pig Statements 247 Pig Commands 247 User-Defined Functions 252 Eval Functions Invoked in the Mapper 253 Eval Functions Invoked in the Reducer 253 Writing and Using a Custom FilterFunc 260 Comparison of PIG versus Hive 262 Crunch API 263 How Crunch Differs from Pig 263 Sample Crunch Pipeline 264 Summary 269 Chapter 12: HCatalog and Hadoop in the Enterprise 271 HCatalog and Enterprise Data Warehouse Users 271 HCatalog: A Brief Technical Background 272 HCatalog Command-Line Interface 274 WebHCat 274 HCatalog Interface for MapReduce 275 HCatalog Interface for Pig 278 HCatalog Notification Interface 279 Security and Authorization in HCatalog 279 Bringing It All Together 280 Summary 281 xiii

Chapter 13: Log Analysis Using Hadoop 283 Log File Analysis Applications 283 Web Analytics 283 Security Compliance and Forensics 284 Monitoring and Alerts 284 Internet of Things 285 Analysis Steps 286 Load 286 Refine 286 Visualize 287 Apache Flume 287 Core Concepts 288 Netflix Suro 290 Cloud Solutions 291 Summary 291 Chapter 14: Building Real-Time Systems Using HBase 293 What Is HBase? 293 Typical HBase Use-Case Scenarios 294 HBase Data Model 295 HBase Logical or Client-Side View 295 Differences Between HBase and RDBMSs 296 HBase Tables 297 HBase Cells 297 HBase Column Family 297 HBase Commands and APIs 298 Getting a Command List: help Command 299 Creating a Table: create Command 300 Adding Rows to a Table: put Command 300 Retrieving Rows from the Table: get Command 300 Reading Multiple Rows: scan Command 300 xiv

Counting the Rows in the Table: count Command 301 Deleting Rows: delete Command 301 Truncating a Table: truncate Command 301 Dropping a Table: drop Command 302 Altering a Table: alter Command 302 HBase Architecture 302 HBase Components 303 Compaction and Splits in HBase 309 Compaction 310 HBase Configuration: An Overview 311 hbase-defaultxml and hbase-site.xml 311 HBase Application Design 312 Tall vs. Wide vs. Narrow Table Design 312 Row Key Design 313 HBase Operations Using Java API 314 HBase Treats Everything as Bytes 314 Create an HBase Table 315 Administrative Functions Using HBaseAdmin 315 Accessing Data Using the Java API 316 HBase MapReduce Integration 320 A MapReduce Job to Read an HBase Table 320 HBase and MapReduce Clusters 323 Scenario I: Frequent MapReduce Jobs Against HBase Tables 323 Scenario II: HBase and MapReduce have Independent SLAs 323 Summary 323 Chapter 15: Data Science with Hadoop 325 Hadoop Data Science Methods 325 Apache Hama 326 Bulk Synchronous Parallel Model 326 Hama Hello World! 327 XV

Monte Carlo Methods 329 K-Means Clustering 333 Apache Spark 336 Resilient Distributed Datasets (RDDs) 336 Monte Carlo with Spark 337 KMeans with Spark 339 RHadoop 341 Summary 342 Chapter 16: Hadoop in the Cloud 343 Economics 343 Self-Hosted Cluster 343 Cloud-Hosted Cluster 344 Elasticity 344 On Demand 344 Bid Pricing 345 Hybrid Cloud 345 Logistics 345 Ingress/Egress 345 Data Retention 345 Security 346 Cloud Usage Models 346 Cloud Providers 347 Amazon Web Services 347 Google Cloud Platform 349 Microsoft Azure 350 Choosing a Cloud Vendor 350 Case Study: Amazon Web Services 351 Elastic MapReduce 351 Elastic Compute Cloud 354 Summary 356 xvi

Chapter 17: Building a YARN Application 357 YARN: A General-Purpose Distributed System 357 YARN: A Quick Review 359 Creating a YARN Application 361 POM Configuration 362 DownloadService.java Class 362 Clientjava 365 Steps to Launch the Application Master from the Client 365 ApplicationMaster.java 373 Communication Protocol between Application Master and Resource Manager: Application Master Protocol 373 Node Manager Communication Protocol: Container Management Protocol 373 Steps to Launch the Worker Tasks 373 Executing the Application Master 378 Launch the Application in Un-Managed Mode 379 Launch the Application in Managed Mode 379 Summary 379 Appendix A: Installing Hadoop 381 Installing Hadoop 2.2.0 on Windows 381 Preparing the Installation Environment 381 Building Hadoop 2.2.0 for Windows 383 Installing Hadoop 2.2.0 for Windows 383 Configuring Hadoop 2.2.0 383 Preparing the Hadoop Cluster 386 Starting HDFS 387 Starting MapReduce (YARN) 387 Verifying that the Cluster Is Running 387 Testing the Cluster 387 Installing Hadoop 2.2.0 on Linux 388 xvii

Appendix B: Using Maven with Eclipse 391 A Quick Introduction to Maven 391 Creating a Maven Project 391 Using Maven with Eclipse 393 Installing the m2e Maven Eclipse Plug-in 393 Creating a Maven Project from Eclipse 393 Building a Maven Project from Eclipse... 396 Appendix C: Apache Ambari 399 Hadoop Components Supported by Apache Ambari 399 Installing Apache Ambari 401 Trying the Ambari Sandbox on Your OS 401 Index 403 xviii