Hadoop: The Definitive Guide

Hadoop: The Definitive Guide Tom White foreword by Doug Cutting O'REILLY~ Beijing Cambridge Farnham Köln Sebastopol Taipei Tokyo

Table of Contents Foreword Preface xiii xv 1. Meet Hadoop 1 Da~! 1 Data Storage and Analysis 3 Camparison with Other Systems 4 RDBMS 4 Grid Computing 6 Volunteer Computing 8 ABrief History of Hadoop 9 The Apache Hadoop Projeet 12 2. MapReduce 15 A Weather Dataset Data Format Analyzing the Data with Unix Tools Analyzing the Data with Hadoop Map and Reduce Java MapReduce Scaling Out Data Flow Combiner Functions Running a Distributed MapReduce Job Hadoop Streaming Ruby Python Hadoop Pipes Compiling and Running IS IS 17 18 18 20 27 27 29 32 32 33 3S 36 38 v

3. The Hadoop Distributed Filesystem 41 The Design of HOFS HOFS Concepts Blocks Namenodes and Oatanodes The Command-Line Interface Basic Filesystem Operations Hadoop Filesystems Interfaces The Java Interface Reading Oata from a Hadoop URL Reading Oata Using the FileSystem API Writing Oata Oirectories Querying the Filesystem Oeleting Oata Oata Flow Anatomy of a File Read Anatomy of a File Write Coherency Model Parallel Copying with distcp Keeping an HOFS Cluster Balanced Hadoop Archives Using Hadoop Archives Limitations 41 42 42 44 45 45 47 49 SI SI 52 56 57 58 62 63 63 66 68 70 71 71 72 73 4. Hadoop 1/0 75 Oata Integrity Oata Integrity in HOFS LocalFileSystem ChecksumFileSystem Compression Codecs Compression and Input Splits Using Compression in MapReduce Serialization The Writable Interface Writable Classes Implementing a Custom Writable Serialization Frameworks File-Based Oata Structures SequenceFile MapFile 75 75 76 77 77 79 83 84 86 87 89 96 101 103 103 110 vi I Table ofcontents

5. Developing amapreduce Application 115 The Configuration API Combining Resourees Variable Expansion Configuring the Development Environment Managing Configuration GenerieOptionsParser, Tool, and ToolRunner Writing a Unit Test Mapper Redueer Running Loeally on Test Data Running a Job in a Loeal Job Runner Testing the Driver Running on a Cluster Paekaging Launehing a Job The MapReduee Web UI Retrieving the Results Debugging a Job Using a Remote Debugger Tuning ajob Profiling Tasks MapReduee Workflows Deeomposing a Problem inta MapReduee Jobs Running Dependent Jobs 116 117 117 118 118 121 123 124 126 127 127 130 132 132 132 134 136 138 144 145 146 149 149 151 6. How MapReduce Works 153 Anatamy of a MapReduee Job Run Job Submission Job Initialization Task Assignment Task Exeeution Progress and Status Updates Job Completion Failures Task Failure Tasktraeker Failure Jobtraeker Failure Job Seheduling The Fair Seheduler Shuffle and Son The MapSide The Reduee Side 153 153 155 155 156 156 158 159 159 161 161 161 162 163 163 164 Table of Contents I vii

Configuration Tuning Task Execution Speculative Execution Task JVM Reuse Skipping Bad Records The Task Execution Environment 166 168 169 170 171 172 7. MapReduce Types and Formats 175 MapReduce Types The Default MapReduce Job Input Formats Input Splits and Records Text Input Binary Input Multiple Inputs Database Input (and Output) Output Formats Text Output Binary Output Multiple Outputs Lazy Output Database Output 175 178 184 185 196 199 200 201 202 202 203 203 210 210 8. MapReduce Features 211 Counters Built-in Counters User-Defined Java Counters User-Defined Streaming Counters Sorting Preparation Partial Sort Total Sort Secondary Sort Joins Map-Side Joins Reduce-Side Joins Side Data Distribution Using the Job Configuration Distributed Cache MapReduce Library Classes 211 211 213 218 218 218 219 223 227 233 233 235 238 238 239 243 9. Setting Up ahadoop Cluster 245 Cluster Specification 245 viii I Table of Contents

Network Topology Cluster Setup and Installation Installing Java Creating a Hadoop User Installing Hadoop Testing the Installation SSH Configuration Hadoop Configuration Configuration Management Environment Settings Important Hadoop Daemon Properties Hadoop Daemon Addresses and Ports Other Hadoop Properties Post Install Benchmarking a Hadoop Cluster Hadoop Benchmarks User Jobs Hadoop in the Cloud Hadoop on Amazon EC2 247 249 249 250 250 250 251 251 252 254 258 263 264 266 266 267 269 269 269 10. Administering Hadoop 273 HDFS Persistent Data Structures Safe Mode Audit Logging Tools Monitoring Logging Metrics Java Management Extensions Maintenance Routine Administration Procedures Commissioning and Decommissioning Nodes Upgrades 273 273 278 280 280 285 285 286 289 292 292 293 296 11. Pig 301 Installing and Running Pig 302 Execution Types 302 Running Pig Programs 304 Grunt 304 Pig Latin Editors 305 An Example 305 Generating Examples 307 Table ofcontents I ix

Comparison with Databases Pig Latin Structure Statements Expressions Types Schemas Functions User-Defined Functions A Filter UDF An Eval UDF ALoad UDF Data Processing Operators Loading and Storing Data Filtering Data Grouping and Joining Data Sorting Data Combining and Splitting Data Pig in Practice Parallelism Parameter Substitution 308 309 310 311 314 315 317 320 322 322 325 327 331 331 331 334 338 339 340 340 341 12. HBase 343 HBasics Backdrop Concepts Whirlwind Tour of the Data Model Implementation Installation Test Drive Clients Java REST and Thrift Example Schemas Loading Data Web Queries HBase Versus RDBMS Successful Service HBase Use Case: HBase at streamy.com Praxis Versions 343 344 344 344 345 348 349 350 351 353 354 354 355 358 361 362 363 363 365 365 x I Table ofcontents

Love and Hate: HBase and HDFS UI Metrics Schema Design 366 367 367 367 13. ZooKeeper.........,.. 369 Installing and Running ZooKeeper 370 An Example 371 Group Membership in ZooKeeper 372 Creating the Group 372 ]oining a Group 374 Listing Members in a Group 376 Deleting a Group 378 The ZooKeeper Service 378 Data Model 379 Operations 380 Implementation 384 Consistency 386 Sessions 388 States 389 Building Applications with ZooKeeper 391 A Configuration Service 391 The Resilient ZooKeeper Application 394 A Lock Service 398 More Distributed Data Structures and Protocols 400 ZooKeeper in Produetion 401 Resilience and Performance 401 Configuration 402 14. (ase Studies 405 Hadoop Usage at Last.fm Last.fm: The Social Music Revolution Hadoop at Last.fm Generating Charts with Hadoop The Track Statistics Program Summary Hadoop and Hive at Facebook Introduction Hadoop at Facebook Hypothetical Use Case Studies Hive Problems and Future Work Nutch Search Engine 405 405 405 406 407 414 414 414 414 417 420 424 425 Table of Contents I xi

Background Data Structures Selected Examples of Hadoop Data Processing in Nutch Summary Log Processing at Rackspace Requirements/The Problem Brief History Choosing Hadoop Collection and Storage MapReduce for Logs Cascading Fields, TupIes, and Pipes Operations Taps, Schemes, and Flows Cascading in Practice Flexibility Hadoop and Cascading at ShareThis Summary TeraByte Sort on Apache Hadoop 425 426 429 438 439 439 440 440 440 442 447 448 451 452 454 456 457 461 461 A. Installing Apache Hadoop 465 B. Cloudera's Distribution for Hadoop 471 C. Preparing the NCDC Weather Data 475 Index 479 xii I Table ofcontents