Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Similar documents

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop Ecosystem B Y R A H I M A.

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Deploying Hadoop with Manager

Apache Hadoop: Past, Present, and Future

Tes$ng Hadoop Applica$ons. Tom Wheeler

Hadoop IST 734 SS CHUNG

Hadoop implementation of MapReduce computational model. Ján Vaňo

A very short Intro to Hadoop

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Constructing a Data Lake: Hadoop and Oracle Database United!

Hadoop Architecture. Part 1

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

Implement Hadoop jobs to extract business value from large and varied data sets

Peers Techno log ies Pv t. L td. HADOOP

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Big Data With Hadoop

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop Job Oriented Training Agenda

Complete Java Classes Hadoop Syllabus Contact No:

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Data-Intensive Computing with Map-Reduce and Hadoop

Large scale processing using Hadoop. Ján Vaňo

Open source Google-style large scale data analysis with Hadoop

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Chapter 7. Using Hadoop Cluster and MapReduce

Internals of Hadoop Application Framework and Distributed File System

MySQL and Hadoop. Percona Live 2014 Chris Schneider

CSE-E5430 Scalable Cloud Computing Lecture 2

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

A Brief Outline on Bigdata Hadoop

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Apache Hadoop. Alexandru Costan

COURSE CONTENT Big Data and Hadoop Training

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Map Reduce & Hadoop Recommended Text:

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Data processing goes big

Cloudera Manager Training: Hands-On Exercises

ITG Software Engineering

Certified Big Data and Apache Hadoop Developer VS-1221

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

How To Use Hadoop

Big Data Too Big To Ignore

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop Distributed File System (HDFS) Overview

Case Study : 3 different hadoop cluster deployments

#TalendSandbox for Big Data

<Insert Picture Here> Big Data

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Cloudera Certified Developer for Apache Hadoop

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Open source large scale distributed data management with Google s MapReduce and Bigtable

How to Hadoop Without the Worry: Protecting Big Data at Scale

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

MapReduce with Apache Hadoop Analysing Big Data

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

HADOOP MOCK TEST HADOOP MOCK TEST I

ITG Software Engineering

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

BIG DATA TRENDS AND TECHNOLOGIES

Chase Wu New Jersey Ins0tute of Technology

Jeffrey D. Ullman slides. MapReduce for data intensive computing

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Hadoop & Spark Using Amazon EMR

NoSQL and Hadoop Technologies On Oracle Cloud

Apache HBase. Crazy dances on the elephant back

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Distributed Filesystems

BIG DATA & HADOOP DEVELOPER TRAINING & CERTIFICATION

Big Data Course Highlights

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

HADOOP MOCK TEST HADOOP MOCK TEST II

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

BBM467 Data Intensive ApplicaAons

THE HADOOP DISTRIBUTED FILE SYSTEM

White Paper: What You Need To Know About Hadoop

HDFS. Hadoop Distributed File System

Application Development. A Paradigm Shift

Hadoop: The Definitive Guide

Transcription:

Hadoop 101 Lars George NoSQL- Ma4ers, Cologne April 26, 2013 1

What s Ahead? Overview of Apache Hadoop (and related tools) What it is Why it s relevant How it works No prior experience needed Feel free to ask quesqons

About Me Director EMEA Services @ Cloudera ConsulQng on Hadoop projects (everywhere) Apache Commi4er HBase and Whirr O Reilly Author HBase The DefiniQve Guide Contact Now in Japanese! lars@cloudera.com @larsgeorge 日本語版も出ました!

What is Apache Hadoop? A scalable data storage and processing system An open source Apache project Hadoop clusters are built from standard hardware Distributed and fault- tolerant Widely implemented by many organizaqons

What Does Apache Hadoop Offer? Core Hadoop offers two key features Storage: Hadoop Distributed File System (HDFS) Processing: MapReduce Other related tools provide addiqonal capabiliqes This includes Hive, Sqoop, Flume, and Mahout CollecQvely known as the Hadoop ecosystem

Why Do We Need Apache Hadoop? Let s explore this first Then we ll delve into technical details acerwards

We Generated Li4le Data Before Consider an evening of dinner and a movie (1992) Look up restaurant in phone book Consult map (on paper!) for direcqons Drive to restaurant Pay with cash Check newspaper for movie showqmes Buy Qcket at box office window Not much data is being generated here That s OK, because storage cost > $3500/GB back then

We Generate Lots of Data Now Consider a similar evening in 2012 Look up restaurant using Yelp on mobile phone Use phone s map socware to find the restaurant Check into restaurant on Foursquare Pay with credit card Watch movie online via Nenlix Streaming Tweet about how bad the movie was Lots of data being generated here That s OK, because storage only costs $0.05/GB now

The Value of Volume One tweet is an anecdote But a million tweets may signal important trends One person s product review is an opinion But a million reviews might uncover a design flaw One person s diagnosis is an isolated case But a million medical records could idenqfy the cause TradiQonal tools can t handle big data But Hadoop scales well into the petabytes

How Are OrganizaQons Using Hadoop? Just a few examples... AnalyQcs Product recommendaqons Ad targeqng Fraud detecqon Natural language processing Route opqmizaqon

Where Did Hadoop Come From? Spinoff of Apache Nutch Inspired by two Google publicaqons The Google File System MapReduce: Simplified Data Processing on Large Clusters

Hallmarks of Hadoop s Design Machine failure is unavoidable embrace it Build reliability into the system More is usually be4er than faster Throughput is more important than latency Network bandwidth is a precious resource You have far more data than code

HDFS: Hadoop Distributed File System Inspired by the Google File System Provides low- cost storage for massive amounts of data Not a general purpose filesystem Highly opqmized for processing data with MapReduce Cannot modify file content once wri4en It s actually a user- space Java process Accessed using special commands or APIs

HDFS Blocks When data is loaded into HDFS, it s split into blocks Blocks are of a fixed size (64 MB by default) These are huge when compared to UNIX filesystems Block 1 (64 MB) 230 MB Input File Block 2 (64 MB) Block 3 (64 MB) Block 4 (38 MB)

HDFS ReplicaQon Each block is then replicated to mulqple machines Default replicaqon factor is three (but configurable) Slave node A Slave node B Block 1 (64 MB) Slave node C Slave node D Slave node E

HDFS Demo I will now demonstrate the following 1. How to create a directory in HDFS 2. How to copy a local file to HDFS 3. How to display the contents of a file in HDFS 4. How to remove a file from HDFS

Basic HDFS Architecture NameNode: HDFS Master daemon Manages namespace and metadata Only one NameNode per cluster * DataNode: HDFS slave daemon Provides storage and retrieval for data blocks

HDFS Architectural VariaQons Secondary NameNode Performs periodic merges on the NameNode s data Despite the name, this does not provide failover High Availability (reliability) AcQve/standby configuraqon Standby NameNode replaces older Secondary NameNode HDFS federaqon (scalability) MulQple namespaces per cluster Independent of high availability

MapReduce IntroducQon MapReduce is a programming model It s a way of processing data You can implement MapReduce in any language MapReduce has its roots in funcqonal programming Many languages have funcqons named map and reduce These funcqons have largely the same purpose in Hadoop Popularized for large- scale processing by Google MapReduce processing in Hadoop is batch- oriented

MapReduce Benefits Scalability Hadoop divides the processing job into individual tasks Tasks execute in parallel (independently) across cluster Simplicity Each task receives one input record Each task emits zero or more output records Ease of use Hadoop provides job scheduling and other infrastructure Don t have to write any file or network I/O code

MapReduce: Data Locality by Code RouQng Slave node A Slave node B Block 1 (64 MB) Block 2 (64 MB) Block 3 (64 MB) Slave node C Slave node D JobTracker Slave node E

MapReduce Architecture Like HDFS, MapReduce in Hadoop is master/slave JobTracker is the master daemon One per cluster Performs job scheduling and monitoring TaskTracker is the slave daemon Many per cluster Executes the individual tasks that make up a job Collocated with DataNode daemon (data locality)

MapReduce Code for Hadoop Usually wri4en in Java This uses Hadoop s API directly Data is passed as parameters to Map and Reduce methods Output is emi4ed via Java method calls You can do basic MapReduce in other languages Using the Hadoop Streaming wrapper program Map and Reduce funcqons use STDIN / STDOUT for data Some advanced features require Java code

MapReduce Example in Python The following example uses Python Via Hadoop Streaming It processes log files and summarizes events by type I ll explain both the data flow and the code

Job Input Here s the job input 2012-09-06 22:16:49.391 CDT INFO "This can wait" 2012-09-06 22:16:49.392 CDT INFO "Blah blah" 2012-09-06 22:16:49.394 CDT WARN "Hmmm..." 2012-09-06 22:16:49.395 CDT INFO "More blather" 2012-09-06 22:16:49.397 CDT WARN "Hey there" 2012-09-06 22:16:49.398 CDT INFO "Spewing data" 2012-09-06 22:16:49.399 CDT ERROR "Oh boy!" Each mapper gets a chunk of this data to process This chunk is called an InputSplit

Python Code for Map FuncQon Our map funcqon will parse the event type And then output that event (key) and a literal 1 (value) 1 2 3 4 5 6 7 8 9 10 11 12 13 #!/usr/bin/env python import sys levels = ['TRACE', 'DEBUG', 'INFO', 'WARN', 'ERROR', 'FATAL'] for line in sys.stdin: fields = line.split() for field in fields: field = field.strip().upper() if field in levels: print "%s\t1" % field Define list of known log events Split every line (record) we receive on standard input into fields, normalized by case If this field matches a log level, print it, a tab separator, and the literal value 1

Output of Map FuncQon The map funcqon produces key/value pairs as output INFO 1 INFO 1 WARN 1 INFO 1 WARN 1 INFO 1 ERROR 1

Input to Reduce FuncQon The Reducer receives a key and all values for that key ERROR 1 INFO 1 INFO 1 INFO 1 INFO 1 WARN 1 WARN 1 Keys are always passed to reducers in sorted order Although not obvious here, values are unordered

Python Code for Reduce FuncQon The Reducer extracts the key and value it was passed 1 2 3 4 5 6 7 8 9 10 11 12 13 #!/usr/bin/env python import sys previous_key = '' sum = 0 for line in sys.stdin: key, value = line.split() value = int(value) # continued on next slide IniQalize loop variables Extract the key and value passed via standard input

Python Code for Reduce FuncQon Then simply adds up the value for each key 14 15 16 17 18 19 20 21 22 23 # continued from previous slide if key == previous_key: sum = sum + value else: if previous_key!= '': print '%s\t%i' % (previous_key, sum) previous_key = key sum = 1 print '%s\t%i' % (previous_key, sum) If key unchanged, increment the count If key changed, print sum for previous key Re- init loop variables Print sum for final key

Output of Reduce FuncQon Its output is a sum for each level ERROR 1 INFO 4 WARN 2

Recap of Data Flow Map input 2012-09-06 22:16:49.391 CDT INFO "This can wait" 2012-09-06 22:16:49.392 CDT INFO "Blah blah" 2012-09-06 22:16:49.394 CDT WARN "Hmmm..." 2012-09-06 22:16:49.395 CDT INFO "More blather" 2012-09-06 22:16:49.397 CDT WARN "Hey there" 2012-09-06 22:16:49.398 CDT INFO "Spewing data" 2012-09-06 22:16:49.399 CDT ERROR "Oh boy!" Map output INFO 1 INFO 1 WARN 1 INFO 1 WARN 1 INFO 1 ERROR 1 Reduce input ERROR 1 INFO 1 INFO 1 INFO 1 INFO 1 WARN 1 WARN 1 Reduce output ERROR 1 INFO 4 WARN 2

How to Run a Hadoop Streaming Job I ll demonstrate this now

Hadoop Ecosystem Example CDH MANAGEMENT CORE CLOUD USER INTERFACE WORKFLOW MGMT METADATA CLOUDERA NAVIGATOR AUDIT (v1.0) ACCESS (v1.0) LINEAGE LIFECYCLE WH WHIRR INTEGRATION SQ SQOOP BATCH PROCESSING HI HIVE PI PIG HU HUE MA MAHOUT DF DATAFU OO OOZIE REAL- TIME ACCESS & COMPUTE EXPLORE CLOUDERA MANAGER BDR FL FLUME FILE FUSE- DFS REST WEBHDFS HTTPFS BATCH COMPUTE MR MAPREDUCE RESOURCE MGMT & COORDINATION MR2 MAPREDUCE2 YA YARN IM IMPALA ZO ZOOKEEPER SE SEARCH AC ACCESS MS META STORE RTD CORE (REQUIRED) RTQ SQL ODBC JDBC STORAGE HDFS HADOOP DFS HB HBASE

Apache Flume Copies data into HDFS as it s generated Can handle a variety of input sources Data appended to log files UNIX syslog Output from programs Data received on network ports Custom sources

Apache Sqoop Database integraqon for HDFS CompaQble with any database via JDBC driver High- performance custom connectors available for others Can import database tables into HDFS All tables from a DB, a single table, or a porqon of a table Can also export data from HDFS back to a database

Apache Hive and Apache Pig High- level processing for data stored in HDFS Hive uses a SQL- like language called HiveQL Pig uses a more procedural language called PigLaQn AlternaQve to wriqng MapReduce code Reduces development Qme and increases producqvity But they have the same latency as MapReduce Because they create MapReduce jobs that run on cluster

Apache HBase High- performance NoSQL database built on HDFS Based on Google s BigTable paper Very scalable Low latency data access No high- level query language

Cloudera s Impala Offers the benefits of both Hive and HBase Scalability Performance High- level query language (subset of SQL- 92) Announced at Hadoop World + Strata in October Open source and available under Apache License Download the beta from Cloudera Web site

Other Notable Ecosystem Components Apache Whirr Libraries for running cloud- based services Apache Mahout Scalable machine learning libraries Apache Oozie Workflow management for Hadoop jobs Apache ZooKeeper SynchronizaQon services for distributed systems

Typical Stack Architectures BATCH w/ READ ONLY BATCH w/ RANDOM WRITE BATCH OR REAL TIME QUERY SQ SQOOP FL FLUME INGEST REST WEBHDFS HTTPFS SQL ODBC JDBC STORAGE MAP REDUCE QUERY HI HIVE PI PIG MR MAPREDUCE HDFS HADOOP DFS JA JAVA MR HI HIVE PI PIG MR MAPREDUCE HB HBASE HDFS HADOOP DFS JA JAVA MR IM IMPALA HI HIVE HDFS HADOOP DFS PI PIG MR MAPREDUCE JA JAVA MR SQ SQOOP FL FLUME OUTGEST REST WEBHDFS HTTPFS SQL ODBC JDBC

Hadoop Typical Data Pipeline Hadoop Marts Pig Hive MapReduce HDFS Result or Calculated Data Data Sources Sqoop Flume Original Source Data Oozie Sqoop Data Warehouse

Conclusion Apache Hadoop: scalable data storage + processing HDFS (storage) MapReduce (processing) The Hadoop ecosystem includes addiqonal tools Help integrate Hadoop with other systems Make it easier to analyze data in HDFS

Next Steps Cloudera s DistribuQon including Apache Hadoop Not just Hadoop, but also Hive, Pig, HBase, Mahout, etc. Free and 100% open source (Apache license) Easy to install packages Can download a virtual machine with CDH pre- installed

Highly Recommended Books Tom White ISBN: 1-449- 31152-0 Eric Sammer ISBN: 1-449- 32705-2

Thank You! Lars George, Director EMEA Services Cloudera, Inc. 46