Hadoop 101 Lars George NoSQL- Ma4ers, Cologne April 26, 2013 1
What s Ahead? Overview of Apache Hadoop (and related tools) What it is Why it s relevant How it works No prior experience needed Feel free to ask quesqons
About Me Director EMEA Services @ Cloudera ConsulQng on Hadoop projects (everywhere) Apache Commi4er HBase and Whirr O Reilly Author HBase The DefiniQve Guide Contact Now in Japanese! lars@cloudera.com @larsgeorge 日 本 語 版 も 出 ました!
What is Apache Hadoop? A scalable data storage and processing system An open source Apache project Hadoop clusters are built from standard hardware Distributed and fault- tolerant Widely implemented by many organizaqons
What Does Apache Hadoop Offer? Core Hadoop offers two key features Storage: Hadoop Distributed File System (HDFS) Processing: MapReduce Other related tools provide addiqonal capabiliqes This includes Hive, Sqoop, Flume, and Mahout CollecQvely known as the Hadoop ecosystem
Why Do We Need Apache Hadoop? Let s explore this first Then we ll delve into technical details acerwards
We Generated Li4le Data Before Consider an evening of dinner and a movie (1992) Look up restaurant in phone book Consult map (on paper!) for direcqons Drive to restaurant Pay with cash Check newspaper for movie showqmes Buy Qcket at box office window Not much data is being generated here That s OK, because storage cost > $3500/GB back then
We Generate Lots of Data Now Consider a similar evening in 2012 Look up restaurant using Yelp on mobile phone Use phone s map socware to find the restaurant Check into restaurant on Foursquare Pay with credit card Watch movie online via Nenlix Streaming Tweet about how bad the movie was Lots of data being generated here That s OK, because storage only costs $0.05/GB now
The Value of Volume One tweet is an anecdote But a million tweets may signal important trends One person s product review is an opinion But a million reviews might uncover a design flaw One person s diagnosis is an isolated case But a million medical records could idenqfy the cause TradiQonal tools can t handle big data But Hadoop scales well into the petabytes
How Are OrganizaQons Using Hadoop? Just a few examples... AnalyQcs Product recommendaqons Ad targeqng Fraud detecqon Natural language processing Route opqmizaqon
Where Did Hadoop Come From? Spinoff of Apache Nutch Inspired by two Google publicaqons The Google File System MapReduce: Simplified Data Processing on Large Clusters
Hallmarks of Hadoop s Design Machine failure is unavoidable embrace it Build reliability into the system More is usually be4er than faster Throughput is more important than latency Network bandwidth is a precious resource You have far more data than code
HDFS: Hadoop Distributed File System Inspired by the Google File System Provides low- cost storage for massive amounts of data Not a general purpose filesystem Highly opqmized for processing data with MapReduce Cannot modify file content once wri4en It s actually a user- space Java process Accessed using special commands or APIs
HDFS Blocks When data is loaded into HDFS, it s split into blocks Blocks are of a fixed size (64 MB by default) These are huge when compared to UNIX filesystems Block 1 (64 MB) 230 MB Input File Block 2 (64 MB) Block 3 (64 MB) Block 4 (38 MB)
HDFS ReplicaQon Each block is then replicated to mulqple machines Default replicaqon factor is three (but configurable) Slave node A Slave node B Block 1 (64 MB) Slave node C Slave node D Slave node E
HDFS Demo I will now demonstrate the following 1. How to create a directory in HDFS 2. How to copy a local file to HDFS 3. How to display the contents of a file in HDFS 4. How to remove a file from HDFS
Basic HDFS Architecture NameNode: HDFS Master daemon Manages namespace and metadata Only one NameNode per cluster * DataNode: HDFS slave daemon Provides storage and retrieval for data blocks
HDFS Architectural VariaQons Secondary NameNode Performs periodic merges on the NameNode s data Despite the name, this does not provide failover High Availability (reliability) AcQve/standby configuraqon Standby NameNode replaces older Secondary NameNode HDFS federaqon (scalability) MulQple namespaces per cluster Independent of high availability
MapReduce IntroducQon MapReduce is a programming model It s a way of processing data You can implement MapReduce in any language MapReduce has its roots in funcqonal programming Many languages have funcqons named map and reduce These funcqons have largely the same purpose in Hadoop Popularized for large- scale processing by Google MapReduce processing in Hadoop is batch- oriented
MapReduce Benefits Scalability Hadoop divides the processing job into individual tasks Tasks execute in parallel (independently) across cluster Simplicity Each task receives one input record Each task emits zero or more output records Ease of use Hadoop provides job scheduling and other infrastructure Don t have to write any file or network I/O code
MapReduce: Data Locality by Code RouQng Slave node A Slave node B Block 1 (64 MB) Block 2 (64 MB) Block 3 (64 MB) Slave node C Slave node D JobTracker Slave node E
MapReduce Architecture Like HDFS, MapReduce in Hadoop is master/slave JobTracker is the master daemon One per cluster Performs job scheduling and monitoring TaskTracker is the slave daemon Many per cluster Executes the individual tasks that make up a job Collocated with DataNode daemon (data locality)
MapReduce Code for Hadoop Usually wri4en in Java This uses Hadoop s API directly Data is passed as parameters to Map and Reduce methods Output is emi4ed via Java method calls You can do basic MapReduce in other languages Using the Hadoop Streaming wrapper program Map and Reduce funcqons use STDIN / STDOUT for data Some advanced features require Java code
MapReduce Example in Python The following example uses Python Via Hadoop Streaming It processes log files and summarizes events by type I ll explain both the data flow and the code
Job Input Here s the job input 2012-09-06 22:16:49.391 CDT INFO "This can wait" 2012-09-06 22:16:49.392 CDT INFO "Blah blah" 2012-09-06 22:16:49.394 CDT WARN "Hmmm..." 2012-09-06 22:16:49.395 CDT INFO "More blather" 2012-09-06 22:16:49.397 CDT WARN "Hey there" 2012-09-06 22:16:49.398 CDT INFO "Spewing data" 2012-09-06 22:16:49.399 CDT ERROR "Oh boy!" Each mapper gets a chunk of this data to process This chunk is called an InputSplit
Python Code for Map FuncQon Our map funcqon will parse the event type And then output that event (key) and a literal 1 (value) 1 2 3 4 5 6 7 8 9 10 11 12 13 #!/usr/bin/env python import sys levels = ['TRACE', 'DEBUG', 'INFO', 'WARN', 'ERROR', 'FATAL'] for line in sys.stdin: fields = line.split() for field in fields: field = field.strip().upper() if field in levels: print "%s\t1" % field Define list of known log events Split every line (record) we receive on standard input into fields, normalized by case If this field matches a log level, print it, a tab separator, and the literal value 1
Output of Map FuncQon The map funcqon produces key/value pairs as output INFO 1 INFO 1 WARN 1 INFO 1 WARN 1 INFO 1 ERROR 1
Input to Reduce FuncQon The Reducer receives a key and all values for that key ERROR 1 INFO 1 INFO 1 INFO 1 INFO 1 WARN 1 WARN 1 Keys are always passed to reducers in sorted order Although not obvious here, values are unordered
Python Code for Reduce FuncQon The Reducer extracts the key and value it was passed 1 2 3 4 5 6 7 8 9 10 11 12 13 #!/usr/bin/env python import sys previous_key = '' sum = 0 for line in sys.stdin: key, value = line.split() value = int(value) # continued on next slide IniQalize loop variables Extract the key and value passed via standard input
Python Code for Reduce FuncQon Then simply adds up the value for each key 14 15 16 17 18 19 20 21 22 23 # continued from previous slide if key == previous_key: sum = sum + value else: if previous_key!= '': print '%s\t%i' % (previous_key, sum) previous_key = key sum = 1 print '%s\t%i' % (previous_key, sum) If key unchanged, increment the count If key changed, print sum for previous key Re- init loop variables Print sum for final key
Output of Reduce FuncQon Its output is a sum for each level ERROR 1 INFO 4 WARN 2
Recap of Data Flow Map input 2012-09-06 22:16:49.391 CDT INFO "This can wait" 2012-09-06 22:16:49.392 CDT INFO "Blah blah" 2012-09-06 22:16:49.394 CDT WARN "Hmmm..." 2012-09-06 22:16:49.395 CDT INFO "More blather" 2012-09-06 22:16:49.397 CDT WARN "Hey there" 2012-09-06 22:16:49.398 CDT INFO "Spewing data" 2012-09-06 22:16:49.399 CDT ERROR "Oh boy!" Map output INFO 1 INFO 1 WARN 1 INFO 1 WARN 1 INFO 1 ERROR 1 Reduce input ERROR 1 INFO 1 INFO 1 INFO 1 INFO 1 WARN 1 WARN 1 Reduce output ERROR 1 INFO 4 WARN 2
How to Run a Hadoop Streaming Job I ll demonstrate this now
Hadoop Ecosystem Example CDH MANAGEMENT CORE CLOUD USER INTERFACE WORKFLOW MGMT METADATA CLOUDERA NAVIGATOR AUDIT (v1.0) ACCESS (v1.0) LINEAGE LIFECYCLE WH WHIRR INTEGRATION SQ SQOOP BATCH PROCESSING HI HIVE PI PIG HU HUE MA MAHOUT DF DATAFU OO OOZIE REAL- TIME ACCESS & COMPUTE EXPLORE CLOUDERA MANAGER BDR FL FLUME FILE FUSE- DFS REST WEBHDFS HTTPFS BATCH COMPUTE MR MAPREDUCE RESOURCE MGMT & COORDINATION MR2 MAPREDUCE2 YA YARN IM IMPALA ZO ZOOKEEPER SE SEARCH AC ACCESS MS META STORE RTD CORE (REQUIRED) RTQ SQL ODBC JDBC STORAGE HDFS HADOOP DFS HB HBASE
Apache Flume Copies data into HDFS as it s generated Can handle a variety of input sources Data appended to log files UNIX syslog Output from programs Data received on network ports Custom sources
Apache Sqoop Database integraqon for HDFS CompaQble with any database via JDBC driver High- performance custom connectors available for others Can import database tables into HDFS All tables from a DB, a single table, or a porqon of a table Can also export data from HDFS back to a database
Apache Hive and Apache Pig High- level processing for data stored in HDFS Hive uses a SQL- like language called HiveQL Pig uses a more procedural language called PigLaQn AlternaQve to wriqng MapReduce code Reduces development Qme and increases producqvity But they have the same latency as MapReduce Because they create MapReduce jobs that run on cluster
Apache HBase High- performance NoSQL database built on HDFS Based on Google s BigTable paper Very scalable Low latency data access No high- level query language
Cloudera s Impala Offers the benefits of both Hive and HBase Scalability Performance High- level query language (subset of SQL- 92) Announced at Hadoop World + Strata in October Open source and available under Apache License Download the beta from Cloudera Web site
Other Notable Ecosystem Components Apache Whirr Libraries for running cloud- based services Apache Mahout Scalable machine learning libraries Apache Oozie Workflow management for Hadoop jobs Apache ZooKeeper SynchronizaQon services for distributed systems
Typical Stack Architectures BATCH w/ READ ONLY BATCH w/ RANDOM WRITE BATCH OR REAL TIME QUERY SQ SQOOP FL FLUME INGEST REST WEBHDFS HTTPFS SQL ODBC JDBC STORAGE MAP REDUCE QUERY HI HIVE PI PIG MR MAPREDUCE HDFS HADOOP DFS JA JAVA MR HI HIVE PI PIG MR MAPREDUCE HB HBASE HDFS HADOOP DFS JA JAVA MR IM IMPALA HI HIVE HDFS HADOOP DFS PI PIG MR MAPREDUCE JA JAVA MR SQ SQOOP FL FLUME OUTGEST REST WEBHDFS HTTPFS SQL ODBC JDBC
Hadoop Typical Data Pipeline Hadoop Marts Pig Hive MapReduce HDFS Result or Calculated Data Data Sources Sqoop Flume Original Source Data Oozie Sqoop Data Warehouse
Conclusion Apache Hadoop: scalable data storage + processing HDFS (storage) MapReduce (processing) The Hadoop ecosystem includes addiqonal tools Help integrate Hadoop with other systems Make it easier to analyze data in HDFS
Next Steps Cloudera s DistribuQon including Apache Hadoop Not just Hadoop, but also Hive, Pig, HBase, Mahout, etc. Free and 100% open source (Apache license) Easy to install packages Can download a virtual machine with CDH pre- installed
Highly Recommended Books Tom White ISBN: 1-449- 31152-0 Eric Sammer ISBN: 1-449- 32705-2
Thank You! Lars George, Director EMEA Services Cloudera, Inc. 46