Hadoop: The Definitive Guide



Similar documents
COURSE CONTENT Big Data and Hadoop Training

Hadoop: The Definitive Guide

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Peers Techno log ies Pv t. L td. HADOOP

ITG Software Engineering

Complete Java Classes Hadoop Syllabus Contact No:

How To Write A Nosql Database In Spring Data Project

Implement Hadoop jobs to extract business value from large and varied data sets

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Qsoft Inc

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

BIG DATA HADOOP TRAINING

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

ITG Software Engineering

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

BIG DATA - HADOOP PROFESSIONAL amron

HADOOP. Revised 10/19/2015

Workshop on Hadoop with Big Data

Big Data Course Highlights

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop Ecosystem B Y R A H I M A.

MapReduce with Apache Hadoop Analysing Big Data

Practical Hadoop. Security. Bhushan Lakhe

L1: Introduction to Hadoop

Map Reduce & Hadoop Recommended Text:

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Certified Big Data and Apache Hadoop Developer VS-1221

Hadoop Job Oriented Training Agenda

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Cloudera Certified Developer for Apache Hadoop

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200

Large scale processing using Hadoop. Ján Vaňo

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Jenkins: The Definitive Guide

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Hadoop & its Usage at Facebook

Cloudera Manager Training: Hands-On Exercises

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop Development & BI- 0 to 100

Hadoop and Map-Reduce. Swati Gore

A Brief Outline on Bigdata Hadoop

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

HDFS. Hadoop Distributed File System

Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol

FIFTH EDITION. Oracle Essentials. Rick Greenwald, Robert Stackowiak, and. Jonathan Stern O'REILLY" Tokyo. Koln Sebastopol. Cambridge Farnham.

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Important Notice. (c) Cloudera, Inc. All rights reserved.

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Windows PowerShell Cookbook

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Rails Cookbook. Rob Orsini. O'REILLY 8 Beijing Cambridge Farnham Koln Paris Sebastopol Taipei Tokyo

Hadoop. Sunday, November 25, 12

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

CSE-E5430 Scalable Cloud Computing Lecture 2

Deploying Hadoop with Manager

Apache Hadoop FileSystem and its Usage in Facebook

Open source Google-style large scale data analysis with Hadoop

HADOOP MOCK TEST HADOOP MOCK TEST II

HADOOP MOCK TEST HADOOP MOCK TEST I

How To Scale Out Of A Nosql Database

The Hadoop Eco System Shanghai Data Science Meetup

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Hadoop & Spark Using Amazon EMR

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Getting to know Apache Hadoop

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Big Data Too Big To Ignore

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Click Stream Data Analysis Using Hadoop

Data Analyst Program- 0 to 100

Apache Hadoop. Alexandru Costan

Big Data With Hadoop

TRAINING PROGRAM ON BIGDATA/HADOOP

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

The Hadoop Framework

HiBench Introduction. Carson Wang Software & Services Group

Developing a MapReduce Application

Transcription:

Hadoop: The Definitive Guide Tom White foreword by Doug Cutting O'REILLY~ Beijing Cambridge Farnham Köln Sebastopol Taipei Tokyo

Table of Contents Foreword Preface xiii xv 1. Meet Hadoop 1 Da~! 1 Data Storage and Analysis 3 Camparison with Other Systems 4 RDBMS 4 Grid Computing 6 Volunteer Computing 8 ABrief History of Hadoop 9 The Apache Hadoop Projeet 12 2. MapReduce 15 A Weather Dataset Data Format Analyzing the Data with Unix Tools Analyzing the Data with Hadoop Map and Reduce Java MapReduce Scaling Out Data Flow Combiner Functions Running a Distributed MapReduce Job Hadoop Streaming Ruby Python Hadoop Pipes Compiling and Running IS IS 17 18 18 20 27 27 29 32 32 33 3S 36 38 v

3. The Hadoop Distributed Filesystem 41 The Design of HOFS HOFS Concepts Blocks Namenodes and Oatanodes The Command-Line Interface Basic Filesystem Operations Hadoop Filesystems Interfaces The Java Interface Reading Oata from a Hadoop URL Reading Oata Using the FileSystem API Writing Oata Oirectories Querying the Filesystem Oeleting Oata Oata Flow Anatomy of a File Read Anatomy of a File Write Coherency Model Parallel Copying with distcp Keeping an HOFS Cluster Balanced Hadoop Archives Using Hadoop Archives Limitations 41 42 42 44 45 45 47 49 SI SI 52 56 57 58 62 63 63 66 68 70 71 71 72 73 4. Hadoop 1/0 75 Oata Integrity Oata Integrity in HOFS LocalFileSystem ChecksumFileSystem Compression Codecs Compression and Input Splits Using Compression in MapReduce Serialization The Writable Interface Writable Classes Implementing a Custom Writable Serialization Frameworks File-Based Oata Structures SequenceFile MapFile 75 75 76 77 77 79 83 84 86 87 89 96 101 103 103 110 vi I Table ofcontents

5. Developing amapreduce Application 115 The Configuration API Combining Resourees Variable Expansion Configuring the Development Environment Managing Configuration GenerieOptionsParser, Tool, and ToolRunner Writing a Unit Test Mapper Redueer Running Loeally on Test Data Running a Job in a Loeal Job Runner Testing the Driver Running on a Cluster Paekaging Launehing a Job The MapReduee Web UI Retrieving the Results Debugging a Job Using a Remote Debugger Tuning ajob Profiling Tasks MapReduee Workflows Deeomposing a Problem inta MapReduee Jobs Running Dependent Jobs 116 117 117 118 118 121 123 124 126 127 127 130 132 132 132 134 136 138 144 145 146 149 149 151 6. How MapReduce Works 153 Anatamy of a MapReduee Job Run Job Submission Job Initialization Task Assignment Task Exeeution Progress and Status Updates Job Completion Failures Task Failure Tasktraeker Failure Jobtraeker Failure Job Seheduling The Fair Seheduler Shuffle and Son The MapSide The Reduee Side 153 153 155 155 156 156 158 159 159 161 161 161 162 163 163 164 Table of Contents I vii

Configuration Tuning Task Execution Speculative Execution Task JVM Reuse Skipping Bad Records The Task Execution Environment 166 168 169 170 171 172 7. MapReduce Types and Formats 175 MapReduce Types The Default MapReduce Job Input Formats Input Splits and Records Text Input Binary Input Multiple Inputs Database Input (and Output) Output Formats Text Output Binary Output Multiple Outputs Lazy Output Database Output 175 178 184 185 196 199 200 201 202 202 203 203 210 210 8. MapReduce Features 211 Counters Built-in Counters User-Defined Java Counters User-Defined Streaming Counters Sorting Preparation Partial Sort Total Sort Secondary Sort Joins Map-Side Joins Reduce-Side Joins Side Data Distribution Using the Job Configuration Distributed Cache MapReduce Library Classes 211 211 213 218 218 218 219 223 227 233 233 235 238 238 239 243 9. Setting Up ahadoop Cluster 245 Cluster Specification 245 viii I Table of Contents

Network Topology Cluster Setup and Installation Installing Java Creating a Hadoop User Installing Hadoop Testing the Installation SSH Configuration Hadoop Configuration Configuration Management Environment Settings Important Hadoop Daemon Properties Hadoop Daemon Addresses and Ports Other Hadoop Properties Post Install Benchmarking a Hadoop Cluster Hadoop Benchmarks User Jobs Hadoop in the Cloud Hadoop on Amazon EC2 247 249 249 250 250 250 251 251 252 254 258 263 264 266 266 267 269 269 269 10. Administering Hadoop 273 HDFS Persistent Data Structures Safe Mode Audit Logging Tools Monitoring Logging Metrics Java Management Extensions Maintenance Routine Administration Procedures Commissioning and Decommissioning Nodes Upgrades 273 273 278 280 280 285 285 286 289 292 292 293 296 11. Pig 301 Installing and Running Pig 302 Execution Types 302 Running Pig Programs 304 Grunt 304 Pig Latin Editors 305 An Example 305 Generating Examples 307 Table ofcontents I ix

Comparison with Databases Pig Latin Structure Statements Expressions Types Schemas Functions User-Defined Functions A Filter UDF An Eval UDF ALoad UDF Data Processing Operators Loading and Storing Data Filtering Data Grouping and Joining Data Sorting Data Combining and Splitting Data Pig in Practice Parallelism Parameter Substitution 308 309 310 311 314 315 317 320 322 322 325 327 331 331 331 334 338 339 340 340 341 12. HBase 343 HBasics Backdrop Concepts Whirlwind Tour of the Data Model Implementation Installation Test Drive Clients Java REST and Thrift Example Schemas Loading Data Web Queries HBase Versus RDBMS Successful Service HBase Use Case: HBase at streamy.com Praxis Versions 343 344 344 344 345 348 349 350 351 353 354 354 355 358 361 362 363 363 365 365 x I Table ofcontents

Love and Hate: HBase and HDFS UI Metrics Schema Design 366 367 367 367 13. ZooKeeper.........,.. 369 Installing and Running ZooKeeper 370 An Example 371 Group Membership in ZooKeeper 372 Creating the Group 372 ]oining a Group 374 Listing Members in a Group 376 Deleting a Group 378 The ZooKeeper Service 378 Data Model 379 Operations 380 Implementation 384 Consistency 386 Sessions 388 States 389 Building Applications with ZooKeeper 391 A Configuration Service 391 The Resilient ZooKeeper Application 394 A Lock Service 398 More Distributed Data Structures and Protocols 400 ZooKeeper in Produetion 401 Resilience and Performance 401 Configuration 402 14. (ase Studies 405 Hadoop Usage at Last.fm Last.fm: The Social Music Revolution Hadoop at Last.fm Generating Charts with Hadoop The Track Statistics Program Summary Hadoop and Hive at Facebook Introduction Hadoop at Facebook Hypothetical Use Case Studies Hive Problems and Future Work Nutch Search Engine 405 405 405 406 407 414 414 414 414 417 420 424 425 Table of Contents I xi

Background Data Structures Selected Examples of Hadoop Data Processing in Nutch Summary Log Processing at Rackspace Requirements/The Problem Brief History Choosing Hadoop Collection and Storage MapReduce for Logs Cascading Fields, TupIes, and Pipes Operations Taps, Schemes, and Flows Cascading in Practice Flexibility Hadoop and Cascading at ShareThis Summary TeraByte Sort on Apache Hadoop 425 426 429 438 439 439 440 440 440 442 447 448 451 452 454 456 457 461 461 A. Installing Apache Hadoop 465 B. Cloudera's Distribution for Hadoop 471 C. Preparing the NCDC Weather Data 475 Index 479 xii I Table ofcontents