Hadoop: The Definitive Guide

Size: px

Start display at page:

Download "Hadoop: The Definitive Guide"

Ada Sullivan
10 years ago
Views:

1 Hadoop: The Definitive Guide Tom White foreword by Doug Cutting O'REILLY~ Beijing Cambridge Farnham Köln Sebastopol Taipei Tokyo

2 Table of Contents Foreword Preface xiii xv 1. Meet Hadoop 1 Da~! 1 Data Storage and Analysis 3 Camparison with Other Systems 4 RDBMS 4 Grid Computing 6 Volunteer Computing 8 ABrief History of Hadoop 9 The Apache Hadoop Projeet MapReduce 15 A Weather Dataset Data Format Analyzing the Data with Unix Tools Analyzing the Data with Hadoop Map and Reduce Java MapReduce Scaling Out Data Flow Combiner Functions Running a Distributed MapReduce Job Hadoop Streaming Ruby Python Hadoop Pipes Compiling and Running IS IS S v

3 3. The Hadoop Distributed Filesystem 41 The Design of HOFS HOFS Concepts Blocks Namenodes and Oatanodes The Command-Line Interface Basic Filesystem Operations Hadoop Filesystems Interfaces The Java Interface Reading Oata from a Hadoop URL Reading Oata Using the FileSystem API Writing Oata Oirectories Querying the Filesystem Oeleting Oata Oata Flow Anatomy of a File Read Anatomy of a File Write Coherency Model Parallel Copying with distcp Keeping an HOFS Cluster Balanced Hadoop Archives Using Hadoop Archives Limitations SI SI Hadoop 1/0 75 Oata Integrity Oata Integrity in HOFS LocalFileSystem ChecksumFileSystem Compression Codecs Compression and Input Splits Using Compression in MapReduce Serialization The Writable Interface Writable Classes Implementing a Custom Writable Serialization Frameworks File-Based Oata Structures SequenceFile MapFile vi I Table ofcontents

Coherency Model Parallel Copying with distcp Keeping an HOFS Cluster Balanced Hadoop Archives Using Hadoop Archives Limitations 41 42 42 44 45 45 47 49 SI SI 52 56 57 58 62 63 63 66 68 70 71 71 72 73

4 5. Developing amapreduce Application 115 The Configuration API Combining Resourees Variable Expansion Configuring the Development Environment Managing Configuration GenerieOptionsParser, Tool, and ToolRunner Writing a Unit Test Mapper Redueer Running Loeally on Test Data Running a Job in a Loeal Job Runner Testing the Driver Running on a Cluster Paekaging Launehing a Job The MapReduee Web UI Retrieving the Results Debugging a Job Using a Remote Debugger Tuning ajob Profiling Tasks MapReduee Workflows Deeomposing a Problem inta MapReduee Jobs Running Dependent Jobs How MapReduce Works 153 Anatamy of a MapReduee Job Run Job Submission Job Initialization Task Assignment Task Exeeution Progress and Status Updates Job Completion Failures Task Failure Tasktraeker Failure Jobtraeker Failure Job Seheduling The Fair Seheduler Shuffle and Son The MapSide The Reduee Side Table of Contents I vii

Retrieving the Results Debugging a Job Using a Remote Debugger Tuning ajob Profiling Tasks MapReduee Workflows Deeomposing a Problem inta MapReduee Jobs Running Dependent Jobs 116 117 117 118 118 121

5 Configuration Tuning Task Execution Speculative Execution Task JVM Reuse Skipping Bad Records The Task Execution Environment MapReduce Types and Formats 175 MapReduce Types The Default MapReduce Job Input Formats Input Splits and Records Text Input Binary Input Multiple Inputs Database Input (and Output) Output Formats Text Output Binary Output Multiple Outputs Lazy Output Database Output MapReduce Features 211 Counters Built-in Counters User-Defined Java Counters User-Defined Streaming Counters Sorting Preparation Partial Sort Total Sort Secondary Sort Joins Map-Side Joins Reduce-Side Joins Side Data Distribution Using the Job Configuration Distributed Cache MapReduce Library Classes Setting Up ahadoop Cluster 245 Cluster Specification 245 viii I Table of Contents

Text Output Binary Output Multiple Outputs Lazy Output Database Output 175 178 184 185 196 199 200 201 202 202 203 203 210 210 8.

6 Network Topology Cluster Setup and Installation Installing Java Creating a Hadoop User Installing Hadoop Testing the Installation SSH Configuration Hadoop Configuration Configuration Management Environment Settings Important Hadoop Daemon Properties Hadoop Daemon Addresses and Ports Other Hadoop Properties Post Install Benchmarking a Hadoop Cluster Hadoop Benchmarks User Jobs Hadoop in the Cloud Hadoop on Amazon EC Administering Hadoop 273 HDFS Persistent Data Structures Safe Mode Audit Logging Tools Monitoring Logging Metrics Java Management Extensions Maintenance Routine Administration Procedures Commissioning and Decommissioning Nodes Upgrades Pig 301 Installing and Running Pig 302 Execution Types 302 Running Pig Programs 304 Grunt 304 Pig Latin Editors 305 An Example 305 Generating Examples 307 Table ofcontents I ix

Cloud Hadoop on Amazon EC2 247 249 249 250 250 250 251 251 252 254 258 263 264 266 266 267 269 269 269 10.

7 Comparison with Databases Pig Latin Structure Statements Expressions Types Schemas Functions User-Defined Functions A Filter UDF An Eval UDF ALoad UDF Data Processing Operators Loading and Storing Data Filtering Data Grouping and Joining Data Sorting Data Combining and Splitting Data Pig in Practice Parallelism Parameter Substitution HBase 343 HBasics Backdrop Concepts Whirlwind Tour of the Data Model Implementation Installation Test Drive Clients Java REST and Thrift Example Schemas Loading Data Web Queries HBase Versus RDBMS Successful Service HBase Use Case: HBase at streamy.com Praxis Versions x I Table ofcontents

8 Love and Hate: HBase and HDFS UI Metrics Schema Design ZooKeeper , Installing and Running ZooKeeper 370 An Example 371 Group Membership in ZooKeeper 372 Creating the Group 372 ]oining a Group 374 Listing Members in a Group 376 Deleting a Group 378 The ZooKeeper Service 378 Data Model 379 Operations 380 Implementation 384 Consistency 386 Sessions 388 States 389 Building Applications with ZooKeeper 391 A Configuration Service 391 The Resilient ZooKeeper Application 394 A Lock Service 398 More Distributed Data Structures and Protocols 400 ZooKeeper in Produetion 401 Resilience and Performance 401 Configuration (ase Studies 405 Hadoop Usage at Last.fm Last.fm: The Social Music Revolution Hadoop at Last.fm Generating Charts with Hadoop The Track Statistics Program Summary Hadoop and Hive at Facebook Introduction Hadoop at Facebook Hypothetical Use Case Studies Hive Problems and Future Work Nutch Search Engine Table of Contents I xi

ZooKeeper Service 378 Data Model 379 Operations 380 Implementation 384 Consistency 386 Sessions 388 States 389 Building Applications with ZooKeeper 391 A Configuration Service 391 The Resilient

9 Background Data Structures Selected Examples of Hadoop Data Processing in Nutch Summary Log Processing at Rackspace Requirements/The Problem Brief History Choosing Hadoop Collection and Storage MapReduce for Logs Cascading Fields, TupIes, and Pipes Operations Taps, Schemes, and Flows Cascading in Practice Flexibility Hadoop and Cascading at ShareThis Summary TeraByte Sort on Apache Hadoop A. Installing Apache Hadoop 465 B. Cloudera's Distribution for Hadoop 471 C. Preparing the NCDC Weather Data 475 Index 479 xii I Table ofcontents

Practice Flexibility Hadoop and Cascading at ShareThis Summary TeraByte Sort on Apache Hadoop 425 426 429 438 439 439 440 440 440 442 447 448 451 452 454

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop