Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah
Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big Data 1 What Is Big Data? 1 Key Idea Behind Big Data Techniques 2 Data Is Distributed Across Several Nodes 2 Applications Are Moved to the Data 3 Data Is Processed Local to a Node 3 Sequential Reads Preferred Over Random Reads 3 An Example 4 Big Data Programming Models 4 Massively Parallel Processing (MPP) Database Systems 4 In-Memory Database Systems 5 MapReduce Systems 5 Bulk Synchronous Parallel (BSP) Systems 6 Big Data and Transactional Systems 7 How Much Can We Scale? 8 A Compute-Intensive Example 8 Amdhal's Law 9 Business Use-Cases for Big Data 9 Summary 10 vii
Chapter 2: Hadoop Concepts 11 Introducing Hadoop 11 Introducing the MapReduce Model 12 Components of Hadoop 16 Hadoop Distributed File System (HDFS) 17 Secondary NameNode 22 TaskTracker 23 JobTracker 23 Hadoop 2.0 24 Components of YARN 26 HDFS High Availability 29 Summary 30 Chapter 3: Getting Started with the Hadoop Framework 31 Types of Installation 31 Stand-Alone Mode 31 Pseudo-Distributed Cluster 32 Multinode Node Cluster Installation 32 Preinstalled Using Amazon Elastic MapReduce 32 Setting up a Development Environment with a Cloudera Virtual Machine 33 Components of a MapReduce program 34 Your First Hadoop Program 34 Prerequisites to Run Programs in Local Mode 35 WordCount Using the Old API 36 Building the Application 38 Running WordCount in Cluster Mode 39 WordCount Using the New API 39 Building the Application 41 Running WordCount in Cluster Mode 41 Third-Party Libraries in Hadoop Jobs 41 Summary 46 viii
Chapter 4: Hadoop Administration 47 Hadoop Configuration Files 47 Configuring Hadoop Daemons 48 Precedence of Hadoop Configuration Files 49 Diving into Hadoop Configuration Files 49 core-site.xml 50 hdfs-*.xml 51 mapred-site.xml 52 yarn-site.xml 54 Memory Allocations in YARN 55 Scheduler 56 Capacity Scheduler 57 Fair Scheduler 59 Fair Scheduler Configuration 60 yarn-site.xml Configurations 61 Allocation File Format and Configurations 62 Determine Dominant Resource Share in drf Policy 63 Slaves File 64 Rack Awareness 64 Providing Hadoop with Network Topology 64 Cluster Administration Utilities 65 Check the HDFS 66 Command-Line HDFS Administration 68 Rebalancing HDFS Data 70 Copying Large Amounts of Data from the HDFS 71 Summary 72 Chapter 5: Basics of MapReduce Development 73 Hadoop and Data Processing 73 Reviewing the Airline Dataset 73 Preparing the Development Environment 75 Preparing the Hadoop System 75 ix
MapReduce Programming Patterns 76 Map-Only Jobs (SELECT and WHERE Queries) 76 Problem Definition: SELECT Clause 76 Problem Definition: WHERE Clause 84 Map and Reduce Jobs (Aggregation Queries) 87 Problem Definition: GROUP BY and SUM Clauses 88 Improving Aggregation Performance Using the Combiner 94 Problem Definition: Optimized Aggregators 95 Role of the Partitioner 100 Problem Definition: Split Airline Data by Month 100 Bringing it All Together 103 Summary 106 Chapter 6: Advanced MapReduce Development 107 MapReduce Programming Patterns 107 Introduction to Hadoop I/O 107 Problem Definition: Sorting 109 Problem Definition: Analyzing Consecutive Records 124 Problem Definition: Join Using MapReduce 134 Problem Definition: Join Using Map-Only jobs 140 Writing to Multiple Output Files in a Single MR Job 145 Collecting Statistics Using Counters 147 Summary 150 Chapter 7: Hadoop Input/Output 151 Compression Schemes 151 What Can Be Compressed? 152 Compression Schemes 152 Enabling Compression 153 Inside the Hadoop I/O processes 154 InputFormat 155 OutputFormat 156 Custom OutputFormat: Conversion from Text to XML 157 x
Custom InputFormat: Consuming a Custom XML file 161 Hadoop Files 170 SequenceFile 171 MapFiles 176 Avro Files 177 Summary 183 Chapter 8: Testing Hadoop Programs 185 Revisiting the Word Counter 185 Introducing MRUnit 187 Installing MRUnit 187 MRUnit Core Classes 187 Writing an MRUnit Test Case 188 Testing Counters 190 Features of MRUnit 193 Limitations of MRUnit 194 Testing with LocalJobRunner 194 Limitations of LocalJobRunner 197 Testing with MiniMRCIuster 197 Setting up the Development Environment 197 Example for MiniMRCIuster 199 Limitations of MiniMRCIuster 201 Testing MR Jobs with Access Network Resources 201 Summary 202 Chapter 9: Monitoring Hadoop 203 Writing Log Messages in Hadoop MapReduce Jobs 203 Viewing Log Messages in Hadoop MapReduce Jobs 206 User Log Management in Hadoop 2.x 209 Log Storage in Hadoop 2.x 209 Log Management Improvements 211 Viewing Logs Using Web-Based Ul 211 xi
Command-Line Interface 211 Log Retention 212 Hadoop Cluster Performance Monitoring 212 Using YARN REST APIs 213 Managing the Hadoop Cluster Using Vendor Tools 213 Ambari Architecture 214 Summary 215 Chapter 10: Data Warehousing Using Hadoop 217 Apache Hive 217 Installing Hive 218 Hive Architecture 218 Metastore 219 Compiler Basics 219 Hive Concepts 219 HiveQL Compiler Details 223 Data Definition Language 227 Data Manipulation Language 228 External Interfaces 229 Hive Scripts 231 Performance 232 MapReduce Integration 232 Creating Partitions 233 User-Defined Functions 234 Impala 236 Impala Architecture 237 Impala Features 237 Impala Limitations 237 Shark 238 Shark/Spark Architecture 238 Summary 239 xii
Chapter 11: Data Processing Using Pig 241 An Introduction to Pig 241 Running Pig 243 Executing in the Grunt Shell 244 Executing a Pig Script 244 Embedded Java Program 245 Pig Latin 246 Comments in a Pig Script 246 Execution of Pig Statements 247 Pig Commands 247 User-Defined Functions 252 Eval Functions Invoked in the Mapper 253 Eval Functions Invoked in the Reducer 253 Writing and Using a Custom FilterFunc 260 Comparison of PIG versus Hive 262 Crunch API 263 How Crunch Differs from Pig 263 Sample Crunch Pipeline 264 Summary 269 Chapter 12: HCatalog and Hadoop in the Enterprise 271 HCatalog and Enterprise Data Warehouse Users 271 HCatalog: A Brief Technical Background 272 HCatalog Command-Line Interface 274 WebHCat 274 HCatalog Interface for MapReduce 275 HCatalog Interface for Pig 278 HCatalog Notification Interface 279 Security and Authorization in HCatalog 279 Bringing It All Together 280 Summary 281 xiii
Chapter 13: Log Analysis Using Hadoop 283 Log File Analysis Applications 283 Web Analytics 283 Security Compliance and Forensics 284 Monitoring and Alerts 284 Internet of Things 285 Analysis Steps 286 Load 286 Refine 286 Visualize 287 Apache Flume 287 Core Concepts 288 Netflix Suro 290 Cloud Solutions 291 Summary 291 Chapter 14: Building Real-Time Systems Using HBase 293 What Is HBase? 293 Typical HBase Use-Case Scenarios 294 HBase Data Model 295 HBase Logical or Client-Side View 295 Differences Between HBase and RDBMSs 296 HBase Tables 297 HBase Cells 297 HBase Column Family 297 HBase Commands and APIs 298 Getting a Command List: help Command 299 Creating a Table: create Command 300 Adding Rows to a Table: put Command 300 Retrieving Rows from the Table: get Command 300 Reading Multiple Rows: scan Command 300 xiv
Counting the Rows in the Table: count Command 301 Deleting Rows: delete Command 301 Truncating a Table: truncate Command 301 Dropping a Table: drop Command 302 Altering a Table: alter Command 302 HBase Architecture 302 HBase Components 303 Compaction and Splits in HBase 309 Compaction 310 HBase Configuration: An Overview 311 hbase-defaultxml and hbase-site.xml 311 HBase Application Design 312 Tall vs. Wide vs. Narrow Table Design 312 Row Key Design 313 HBase Operations Using Java API 314 HBase Treats Everything as Bytes 314 Create an HBase Table 315 Administrative Functions Using HBaseAdmin 315 Accessing Data Using the Java API 316 HBase MapReduce Integration 320 A MapReduce Job to Read an HBase Table 320 HBase and MapReduce Clusters 323 Scenario I: Frequent MapReduce Jobs Against HBase Tables 323 Scenario II: HBase and MapReduce have Independent SLAs 323 Summary 323 Chapter 15: Data Science with Hadoop 325 Hadoop Data Science Methods 325 Apache Hama 326 Bulk Synchronous Parallel Model 326 Hama Hello World! 327 XV
Monte Carlo Methods 329 K-Means Clustering 333 Apache Spark 336 Resilient Distributed Datasets (RDDs) 336 Monte Carlo with Spark 337 KMeans with Spark 339 RHadoop 341 Summary 342 Chapter 16: Hadoop in the Cloud 343 Economics 343 Self-Hosted Cluster 343 Cloud-Hosted Cluster 344 Elasticity 344 On Demand 344 Bid Pricing 345 Hybrid Cloud 345 Logistics 345 Ingress/Egress 345 Data Retention 345 Security 346 Cloud Usage Models 346 Cloud Providers 347 Amazon Web Services 347 Google Cloud Platform 349 Microsoft Azure 350 Choosing a Cloud Vendor 350 Case Study: Amazon Web Services 351 Elastic MapReduce 351 Elastic Compute Cloud 354 Summary 356 xvi
Chapter 17: Building a YARN Application 357 YARN: A General-Purpose Distributed System 357 YARN: A Quick Review 359 Creating a YARN Application 361 POM Configuration 362 DownloadService.java Class 362 Clientjava 365 Steps to Launch the Application Master from the Client 365 ApplicationMaster.java 373 Communication Protocol between Application Master and Resource Manager: Application Master Protocol 373 Node Manager Communication Protocol: Container Management Protocol 373 Steps to Launch the Worker Tasks 373 Executing the Application Master 378 Launch the Application in Un-Managed Mode 379 Launch the Application in Managed Mode 379 Summary 379 Appendix A: Installing Hadoop 381 Installing Hadoop 2.2.0 on Windows 381 Preparing the Installation Environment 381 Building Hadoop 2.2.0 for Windows 383 Installing Hadoop 2.2.0 for Windows 383 Configuring Hadoop 2.2.0 383 Preparing the Hadoop Cluster 386 Starting HDFS 387 Starting MapReduce (YARN) 387 Verifying that the Cluster Is Running 387 Testing the Cluster 387 Installing Hadoop 2.2.0 on Linux 388 xvii
Appendix B: Using Maven with Eclipse 391 A Quick Introduction to Maven 391 Creating a Maven Project 391 Using Maven with Eclipse 393 Installing the m2e Maven Eclipse Plug-in 393 Creating a Maven Project from Eclipse 393 Building a Maven Project from Eclipse... 396 Appendix C: Apache Ambari 399 Hadoop Components Supported by Apache Ambari 399 Installing Apache Ambari 401 Trying the Ambari Sandbox on Your OS 401 Index 403 xviii