t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Hadoop Beginner's Guide Learn how to crunch big data to extract meaning from data avalanche Garry Turkington [ PUBLISHING t] open source I I community experience distilled ftu\ ij$ BIRMINGHAMMUMBAI

') infrastructure Preface 1 Chapter 1: What It's All About 7 Big data processing 8 The value of data 8 Historically for the few and not the many 9 Classic data processing systems 9 Limiting factors 10 A different approach 11 All roads lead to scaleout 11 Share nothing 11 Expect failure 12 Smart software, dumb hardware 13 Move processing, not data 13 Build applications, not infrastructure 14 Hadoop 15 Thanks, Google 15 Thanks, Doug 15 Thanks, Yahoo 15 Parts of Hadoop 15 Common building blocks 16 HDFS 16 MapReduce 17 Better together 18 Common architecture 19 What it is and isn't good for 19 Cloud computing with Amazon Web Services 20 Too many clouds 20 A third way 20 Different types of costs 21 AWS on demand from Amazon 22 Elastic Compute Cloud (EC2) 22 Simple Storage Service (S3) 22

checking downloading setting configuring formatting starting WordCount, WordCount Elastic MapReduce (EMR) 22 What this book covers 23 A dual approach 23 Summary 24 Chapter 2: Getting Hadoop Up and Running 25 Hadoop on a local Ubuntu host 25 Other operating systems 26 the prerequisites 26 Setting up Hadoop 27 A note on versions 27 Hadoop 28 up SSH 29 Configuring and running Hadoop 30 using Hadoop to calculate Pi 30 Three modes 32 the pseudodistributed mode 32 Configuring the base directory and formatting the filesystem 34 changing the base HDFS directory 34 the NameNode 35 Starting and using Hadoop 36 Hadoop 36 using HDFS 38 the Hello World of MapReduce 39 Monitoring Hadoop from the browser 42 The HDFS web UI 42 Using Elastic MapReduce 45 Setting up an account on Amazon Web Services 45 Creating an AWS account 45 Signing up for the necessary services 45 in EMR using the management console 46 Other ways of using EMR 54 AWS credentials 54 The EMR commandline tools 54 The AWS ecosystem 55 Comparison of local versus EMR Hadoop 55 Summary 56 Chapter 3: Understanding MapReduce 57 Key/value pairs 57 What it mean 57 Why key/value data? 58 Some realworld examples 59 MapReduce as a series of key/value transformations 59

implementing building running running WordCount The Hadoop Java API for MapReduce 60 The 0.20 MapReduce Java API 61 The Mapper class 61 The Reducer class 62 The Driver class 63 Writing MapReduce programs 64 setting up the classpath 65 WordCount 65 a JAR file 68 WordCount on a local Hadoop cluster 68 WordCount on EMR 69 The pre0.20 Java MapReduce API 72 Hadoopprovided mapper and reducer implementations 73 WordCount the easy way 73 Walking through a run of WordCount 75 Startup 75 Splitting the input 75 Task assignment 75 Task startup 76 Ongoing JobTracker monitoring 76 Mapper input 76 Mapper execution 77 Mapper output and reduce input 77 Partitioning 77 The optional partition function 78 Reducer input 78 Reducer execution 79 Reducer output 79 Shutdown 79 That's all there is to it! 80 Apart from the combiner...maybe 80 Why have a combiner? 80 with a combiner 80 When you can use the reducer as the combiner 81 fixing WordCount to work with a combiner 81 Reuse is your friend 82 Hadoopspecific data types 83 The Writable and WritableComparable interfaces 83 Introducing the wrapper classes 84 Primitive wrapper classes 85 Array wrapper classes 85 Map wrapper classes 85 I in 1

using creating the Writable wrapper classes 86 Other wrapper classes 88 Making your own 88 Input/output 88 Files, splits, and records 89 InputFormat and RecordReader 89 Hadoopprovided InputFormat 90 Hadoopprovided RecordReader 90 Output formats and RecordWriter 91 Hadoopprovided OutputFormat 91 Don't forget Sequence files 91 Summary 92 Chapter 4: Developing MapReduce Programs 93 Using languages other than Java with Hadoop 94 How Hadoop Streaming works 94 Why to use Hadoop Streaming 94 WordCount using Streaming 95 Differences in jobs when using Streaming 97 Analyzing a large dataset 98 Getting the UFO sighting dataset 98 Getting a feel for the dataset 99 summarizing the UFO data 99 Examining UFO shapes 101 summarizing the shape data 102 correlating sighting duration to UFO shape 103 Using Streaming scripts outside Hadoop 106 performing the shape/time analysis from the command line 107 Java shape and location analysis 107 using ChainMapper for field validation/analysis 108 Too many abbreviations 112 Using the Distributed Cache 113 using the Distributed Cache to improve location output 114 Counters, status, and other output 117 counters, task states, and writing log output 118 Too much information! 125 Summary 126 Chapter 5: Advanced MapReduce Techniques 127 Simple, advanced, and inbetween 127 Joins 128

representing the creating examining a When this is a bad idea 128 Mapside versus reduceside joins 128 Matching account and sales information 129 reduceside joins using Multiplelnputs 129 DataJoinMapper and TaggedMapperOutput 134 Implementing mapside joins 135 Using the Distributed Cache 135 Pruning data to fit in the cache 135 Using a data representation instead of raw data 136 Using multiple mappers 136 To join or not to join... 137 Graph algorithms 137 Graph 101 138 Graphs and MapReduce match made somewhere 138 Representing a graph 139 the graph 140 Overview of the algorithm 140 The mapper 141 The reducer 141 Iterative application 141 creating the source code 142 first run 146 Time for action the second run 147 the third run 148 the fourth and last run 149 Running multiple jobs 151 Final thoughts on graphs 151 Using languageindependent data structures 151 Candidate technologies 152 Introducing Avro 152 getting and installing Avro 152 Avro and schemas 154 Time for action defining the schema 154 the source Avro data with Ruby 155 Time for action consuming the Avro data with Java 156 Using Avro within MapReduce 158 generating shape summaries in MapReduce 158 examining the output data with Ruby 163 the output data with Java 163 Going further with Avro 165 Summary 166

the killing Chapter 6: When Things Break 167 Failure 167 Embrace failure 168 Or at least don't fear it 168 Don't try this at home 168 Types of failure 168 Hadoop node failure 168 The dfsadmin command 169 Cluster setup, test files, and block sizes 169 Fault tolerance and Elastic MapReduce 170 killing a DataNode process 170 NameNode and DataNode communication 173 replication factor in action 174 Time for action intentionally causing missing blocks 176 When data may be lost 178 Block corruption 179 killing a TaskTracker process 180 Comparing the DataNode and TaskTracker failures 183 Permanent failure 184 Killing the cluster masters 184 killing the JobTracker 184 Starting a replacement JobTracker 185 the NameNode process 186 Starting a replacement NameNode 188 The role of the NameNode in more detail 188 File systems, files, blocks, and nodes 188 The single most important piece of data in the cluster fsimage 189 DataNode startup 189 Safe mode 190 SecondaryNameNode 190 So what to do when the NameNode process has a critical failure? 190 BackupNode/CheckpointNode and NameNode HA 191 Hardware failure 191 Host failure 191 Host corruption 192 The risk of correlated failures 192 Task failure due to software 192 Failure of slow running tasks 192 causing task failure 193 Hadoop's handling of slowrunning tasks 195 Speculative execution 195 Hadoop's handling of failing tasks 195 Task failure due to data 196 Handling dirty data through code 196 Using Hadoop's skip mode 197 Ivil

examining handling dirty data by using skip mode 197 To skip or not to skip... 202 Summary 202 Chapter 7: Keeping Things Running 205 A note on EMR 206 Hadoop configuration properties 206 Default values 206 browsing default properties 206 Additional property elements 208 Default storage location 208 Where to set properties 209 Setting up a cluster 209 How many hosts? 210 Calculating usable space on a node 210 Location of the master nodes 211 Sizing hardware 211 Processor/memory/storage ratio 211 EMR as a prototyping platform 212 Special node requirements 213 Storage types 213 Commodity versus enterprise class storage 214 Single disk versus RAID 214 Finding the balance 214 Network storage 214 Hadoop networking configuration 215 How blocks are placed 215 Rack awareness 216 the default rack configuration 216 adding a rack awareness script 217 What is commodity hardware anyway? 219 Cluster access control 220 The Hadoop security model 220 demonstrating the default security 220 User identity 223 More granular access control 224 Working around the security model via physical access control 224 Managing the NameNode 224 Configuring multiple locations for the fsimage class 225 adding an additional fsimage location 225 Where to write the fsimage copies 226 Swapping to another NameNode host 227 Having things ready before disaster strikes 227

swapping to a new NameNode host 227 Don't celebrate quite yet! 229 What about MapReduce? 229 Managing HDFS 230 Where to write data 230 Using balancer 230 When to rebalance 230 MapReduce management 231 Command line job management 231 Job priorities and scheduling 231 changing job priorities and killing a job 232 Alternative schedulers 233 Capacity Scheduler 233 Fair Scheduler 234 Enabling alternative schedulers 234 When to use alternative schedulers 234 Scaling 235 Adding capacity to a local Hadoop cluster 235 Adding capacity to an EMR job flow 235 Expanding a running job flow 235 Summary 236 Chapter 8: A Relational View on Data with Hive 237 Overview of Hive 237 Why use Hive? 238 Thanks, Facebook! 238 Setting up Hive 238 Prerequisites 238 Getting Hive 239 installing Hive 239 Using Hive 241 creating a table for the UFO data 241 inserting the UFO data 244 Validating the data 246 validating the table 246 redefining the table with the correct column separator 248 Hive tablesreal or not? 250 creating a table from an existing file 250 performing a join 252 Hive and SQL views 254 using views 254 Handling dirty data in Hive 257 1 villi

making installing configuring setting exporting exporting query output 258 Partitioning the table 260 a partitioned UFO sighting table 260 Bucketing, clustering, and sorting... oh my! 264 User Defined Function 264 adding a new User Defined Function (UDF) 265 To preprocess or not to preprocess... 268 Hive versus Pig 269 What we didn't cover 269 Hive on Amazon Web Services 270 running UFO analysis on EMR 270 Using interactive job flows for development 277 Integration with other AWS products 278 Summary 278 Chapter 9: Working with Relational Databases 279 Common data paths 279 Hadoop as an archive store 280 Hadoop as a preprocessing step 280 Hadoop as a data input tool 281 The serpent eats its own tail 281 Setting up MySQL 281 and setting up MySQL 281 Did it have to be so hard? 284 MySQL to allow remote connections 285 Don't do this in production! 286 up the employee database 286 Be careful with data file access rights 287 Getting data into Hadoop 287 Using MySQL tools and manual import 288 Accessing the database from the mapper 288 A better wayintroducing Sqoop 289 downloading and configuring Sqoop 289 Sqoop and Hadoop versions 290 Sqoop and HDFS 291 data from MySQL to HDFS 291 Sqoop's architecture 294 Importing data into Hive using Sqoop 294 exporting data from MySQL into Hive 295 a more selective import 297 Datatype issues 298 llxl

importing importing fixing getting capturing capturing using a type mapping 299 data from a raw query Sqoop and Hive partitions 302 Field and line terminators 302 Getting data out of Hadoop 303 Writing data from within the reducer 303 Writing SQL import files from the reducer 304 A better waysqoop again 304 importing data from Hadoop into MySQL 304 Differences between Sqoop imports and exports 306 Inserts versus updates 307 Sqoop and Hive exports 300 Hive data into MySQL 308 the mapping and rerunning the export 310 Other Sqoop features 312 AWS considerations 313 Considering RDS 313 Summary 314 Chapter 10: Data Collection with Flume 315 A note about AWS 315 Data data everywhere 316 Types of data 316 Getting network traffic into Hadoop 316 web server data into Hadoop 316 Getting files into Hadoop 318 Hidden issues 318 Keeping network data on the network 318 Hadoop dependencies 318 Reliability 318 Recreating the wheel 318 A common framework approach 319 Introducing Apache Flume 319 A note on versioning 319 installing and configuring Flume 320 Using Flume to capture network data 321 network traffic to a log file 321 logging to the console 324 Writing network data to log files 326 capturing the output of a command in a flat file 326 Logs versus files 327 a remote file in a local flat file 328 Sources, sinks, and channels 330 307

adding multi Sources Sinks Channels Or roll your own 331 Understanding the Flume configuration files 331 It's all about events 332 writing network traffic onto HDFS 333 timestamps 335 To Sqoop or to Flume... 330 330 330 337 level Flume networks 338 writing to multiple sinks 340 Selectors replicating and multiplexing 342 Handling sink failure 342 Next, the world 343 The bigger picture 343 Data lifecycle 343 Staging data 344 Scheduling 344 Summary 345 Chapter 11: Where to Go Next 347 What we did and didn't cover in this book 347 Upcoming Hadoop changes 348 Alternative distributions 349 Why alternative distributions? 349 Bundling 349 Free and commercial extensions 349 Choosing a distribution 351 Other Apache projects 352 HBase 352 Oozie 352 Whir 353 Mahout 353 MRUnit 354 Other programming abstractions 354 Pig 354 Cascading 354 AWS resources 355 HBase on EMR 355 SimpleDB 355 DynamoDB 355

Sources of information Source code Mailing lists and forums Linkedln groups HUGs Conferences Summary Appendix: Pop Quiz Answers Chapter 3, Understanding MapReduce Chapter 7, Keeping Things Running Index