Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Similar documents
CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop Ecosystem B Y R A H I M A.

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Big Data Too Big To Ignore

Chapter 7. Using Hadoop Cluster and MapReduce

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Big Data With Hadoop

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop Architecture. Part 1

A very short Intro to Hadoop

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Constructing a Data Lake: Hadoop and Oracle Database United!

Large scale processing using Hadoop. Ján Vaňo

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop and Map-Reduce. Swati Gore

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Open source Google-style large scale data analysis with Hadoop

Hadoop IST 734 SS CHUNG

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

<Insert Picture Here> Big Data

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

L1: Introduction to Hadoop

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Qsoft Inc

Big Data and Apache Hadoop s MapReduce

Reduction of Data at Namenode in HDFS using harballing Technique

Chase Wu New Jersey Ins0tute of Technology

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

HDFS. Hadoop Distributed File System

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Map Reduce & Hadoop Recommended Text:

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Workshop on Hadoop with Big Data

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

MapReduce with Apache Hadoop Analysing Big Data

A Brief Outline on Bigdata Hadoop

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Deploying Hadoop with Manager

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

BIG DATA TECHNOLOGY. Hadoop Ecosystem

MySQL and Hadoop. Percona Live 2014 Chris Schneider

BIG DATA TRENDS AND TECHNOLOGIES

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

COURSE CONTENT Big Data and Hadoop Training

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Data-Intensive Computing with Map-Reduce and Hadoop

Accelerating and Simplifying Apache

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Using distributed technologies to analyze Big Data

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

White Paper: Hadoop for Intelligence Analysis

THE HADOOP DISTRIBUTED FILE SYSTEM

International Journal of Advance Research in Computer Science and Management Studies

Implement Hadoop jobs to extract business value from large and varied data sets

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

A Performance Analysis of Distributed Indexing using Terrier

Cloudera Certified Developer for Apache Hadoop

ITG Software Engineering

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

Intro to Map/Reduce a.k.a. Hadoop

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Dell In-Memory Appliance for Cloudera Enterprise

Introduction to Hadoop

Distributed Filesystems

BIG DATA What it is and how to use?

Introduction to MapReduce and Hadoop

Apache Hadoop new way for the company to store and analyze big data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Enabling High performance Big Data platform with RDMA

Hadoop Job Oriented Training Agenda

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Hadoop & its Usage at Facebook

Certified Big Data and Apache Hadoop Developer VS-1221

Transcription:

Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Topics The goal of this presentation is to give you a basic understanding of Hadoop's core components (HDFS and MapReduce) In this presentation, I will try and answer the following few questions: Which factors led to the era of Big Data What is Hadoop and what are its significant features How it offers reliable storage for massive amounts of data with HDFS How it supports large scale data processing through MapReduce How Hadoop Ecosystem tools can boost an analyst s productivity 2

Topics The Motivation For Hadoop Hadoop Overview HDFS MapReduce The Hadoop Ecosystem 3

Cost of Data Storage Source: http://www.mkomo.com/cost-per-gigabyte-update 4

Traditional vehicles for storing big data are prohibitively expensive We must retain and process large amount of data to extract its value Unfortunately, traditional vehicles for storing this data have become prohibitively expensive In dollar costs, $ per GB In computational costs The need for ETL jobs to summarize the data for warehousing 5

The Traditional Workaround How has industry typically dealt with these problem? Perform an ETL (Extract, Transform, Load) on the data to summarize and denormalize the data, before archiving the result in a data warehouse Discarding the details Run queries against the summary data Unfortunately, this process results in loss of detail But there could be real nuggets in the lost detail Value lost with data summarization or deletion Think of the opportunities lost when we summarize 10000 rows of data into a single record We may not learn much from a single tweet, but we can learn a lot from 10000 tweets More data == Deeper understanding There s no data like more data (Moore 2001) It s not who has the best algorithms that wins. It s who has the most data 6

We Need A System That Scales We re generating too much data to process with traditional tools Two key problems to address How can we reliably store large amounts of data at a reasonable cost? How can we analyze all the data we have stored? 7

Topics The Motivation For Hadoop Hadoop Overview HDFS MapReduce The Hadoop Ecosystem 8

What is Hadoop? Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. - Google A free, open source software framework that supports data-intensive distributed applications. - Cloudera Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation. - TechTarget 9

What is Hadoop? Hadoop is a free software package that provides tools that enable data discovery, question answering, and rich analytics based on very large volumes of information. All this in fractions of the time required by traditional databases. 10

What Is Apache Hadoop? A system for providing scalable, economical data storage and processing Distributed and fault tolerant Harnesses the power of industry standard hardware Core Hadoop consists of two main components Storage: The Hadoop Distributed File System (HDFS) A framework for distributing data across a cluster in ways that are scalable and fast. Processing: MapReduce A framework for processing data in ways that are reliable and fast Plus the infrastructure needed to make them work, including Filesystem and administration utilities Job scheduling and monitoring 11

Where Did Hadoop Come From? Started with research that originated at Google back in 2003 Google's objective was to index the entire World Wide Web Google had reached the limits of scalability of RDBMS technology This research led to a new approach to store and analyze large quantities of data Google File Systems (Ghemawat, et al 2003) MapReduce: Simplified Data Processing on Large Clusters (Dean and Ghemawat 2008) A developer by the name of Doug Cutting (at Yahoo!) was wrestling with many of the same problems in the implementation of his own open-source search engine, Nutch He started an open-source project based on Google s research and created Hadoop in 2005. Hadoop was named after his son s toy elephant. 12

Scalability Hadoop is a distributed system A collection of servers running Hadoop software is called a cluster Individual servers within a cluster are called nodes Typically standard rack-mount servers running Linux Each node both stores and processes data Add more nodes to the cluster to increase scalability A cluster may contain up to several thousand nodes Facebook and Yahoo are each running clusters in excess of 4400 nodes Scalability is linear And its accomplished on standard hardware, greatly reducing the storage costs 13

Reliability / Availability Reliability and availability are traditionally expensive to provide Requiring expensive hardware, redundancy, auxiliary resources, and training With Hadoop, providing reliability and availability is much less expensive If a node fails, just replace it At Google, servers are not bolted into racks, they are attached to the racks using velcro, in anticipation of the need to replace them quickly If a task fails, it is automatically run somewhere else No data is lost No human intervention is required In other words, reliability / availability are provided by the framework itself 14

Fault Tolerance Adding nodes increases chances that any one of them will fail Hadoop builds redundancy into the system and handle failures automatically Files loaded into HDFS are replicated across nodes in the cluster If a node fails, its data is re-replicated using one of the other copies Data processing jobs are broken into individual tasks Each task takes a small amount of data as input Thousands of tasks (or more) often run in parallel If a node fails during processing, its tasks are rescheduled elsewhere Routine failures are handled automatically without any loss of data 15

How are reliability, availability, and fault tolerance provided? A file in Hadoop is broken down into blocks which are: typically 64MB each (a typical windows block size is 4KB) Distributed across the cluster Replicated across multiple nodes of the cluster (Default: 3) Consequences of this architecture: If a machine fails (even an entire rack) no data is lost Tasks can run elsewhere, where the data resides If a task fails, it can be dispatched to one of the nodes containing redundant copies of the same data block When a failed node comes back online, it can automatically be reintegrated with the cluster 16

Topics The Motivation For Hadoop Hadoop Overview HDFS MapReduce The Hadoop Ecosystem 17

HDFS: Hadoop Distributed File System Provides inexpensive and reliable storage for massive amounts of data Optimized for a relatively small number of large files Each file likely to exceed 100 MB, multi-gigabyte files are common Store file in hierarchical directory structure e.g., /sales/reports/asia.txt Cannot modify files once written Need to make changes? remove and recreate Use Hadoop specific utilities to access HDFS 18

HDFS and Unix File System In some ways, HDFS is similar to a UNIX filesystem Hierarchical, with UNIX/style paths (e.g. /sales/reports/asia.txt) UNIX/style file ownership and permissions There are also some major deviations from UNIX No concept of a current directory Cannot modify files once written You can delete them and recreate them, but you can't modify them Must use Hadoop specific utilities or custom code to access HDFS 19

HDFS Architecture (1 Of 3) Hadoop has a master/slave architecture HDFS master daemon: Name Node Manages namespace (file to block mappings) and metadata (block to machine mappings) Monitors slave nodes HDFS slave daemon: Data Node Reads and writes the actual data 20

HDFS Architecture (2 Of 3) Example: The NameNode holds metadata for the two files Foo.txt (300MB) and Bar.txt (200MB) Assume HDFS is configured for 128MB blocks The DataNodes hold the actual blocks Each block is 128MB in size Each block is replicated three times on the cluster Block reports are periodically sent to the NameNode 21

HDFS Architecture (3 Of 3) Optimal for handling millions of large files, rather than billions of small files, because: In pursuit of responsiveness, the NameNode stores all of its file/block information Too many files will cause the NameNode to run out of storage space Too many blocks (if the blocks are small) will also cause the NameNode to run out of space Processing each block requires its own Java Virtual Machine (JVM) and (if you have too many blocks) you begin to see the limits of HDFS scalability 22

Accessing HDFS Via The Command Line HDFS is not a general purpose file system HDFS files cannot be accessed through the host OS End users typically access HDFS via the hadoop fs command Example: Display the contents of the /user/fred/sales.txt file hadoop fs -cat /user/fred/sales.txt Create a directory (below the root) called reports hadoop fs -mkdir /reports 23

Copying Local Data To And From HDFS Remember that HDFS is separated from your local filesystem Use hadoop fs put to copy local files to HDFS Use hadoop fs -get to copy HDFS files to local files 24

More Command Examples Copy file input.txt from the local disk to the user s home directory in HDFS hadoop fs -put input.txt input.txt This will copy the file to /user/username/input.txt Get a directory listing of the HDFS root directory hadoop fs -ls / Delete the file /reports/sales.txt hadoop fs -rm /reports/sales.txt Other command options emulating Posix commands are available 25

Topics The Motivation For Hadoop Hadoop Overview HDFS MapReduce The Hadoop Ecosystem 26

Introducing MapReduce Now that we have described how Hadoop stores data, lets turn our attention to how it processes data We typically process data in Hadoop using MapReduce MapReduce is not a language, it s a programming model MapReduce consists of two functions: map and reduce 27

Understanding Map and Reduce The map function always runs first Typically used to break down Filter, transform, or parse data, e.g. Parse the stock symbol, price and time from a data feed The output from the map function (eventually) becomes the input to the reduce function The reduce function Typically used to aggregate data from the map function e.g. Compute the average hourly price of the stock Not always needed and therefore optional You can run something called a map-only job Between these two tasks there is typically a hidden phase known as the Shuffle and Sort Which organizes map output for delivery to the reducer Each individual piece is simple, but collectively are quite powerful Analogous to a pipe / filter in Unix 28

MapReduce Example MapReduce code for Hadoop is typically written in Java But it is possible to use nearly any language with Hadoop Streaming The following slides will explain an entire MapReduce job Input: Text file containing order ID, employee name, and sale amount Output: Sum of all sales per employee 29

Explanation Of The Map Function Hadoop splits a job into many individual map tasks The number of map tasks is determined by the amount of input data Each map task receives a portion of the overall job input to process in parallel Mappers process one input record at a time For each input record, they emit zero or more records as output In this case, 5 map tasks each parse a subset of the input records And then output the name and sales fields for each as output Mapper does a simple task of extracting 2 fields out of 3 30

Shuffle and Sort Hadoop automatically sorts and merges output from all map tasks Sorting records by key (name, in this example) This (transparent) intermediate process is known as the Shuffle and Sort The result is supplied to reduce tasks 31

Explanation Of Reduce Function Reducer input comes from the shuffle and sort process As with map, the reduce function receives one record at a time A given reducer receives all records for a given key For each input record, reduce can emit zero or more output records This reduce function sums the total sales per person And emits the employee name (key) and total (sales) as output to a file Each reducer output file is written to a job specific directory in HDFS Reducer sums up things 32

Putting It All Together Here s the data flow for the entire MapReduce job Map Shuffle & sort Reduce 33

Benefits of MapReduce Simplicity (via fault tolerance) Particularly when compared with other distributed programming models Flexibility Offers more analytic capabilities and works with more data types than platforms like SQL Scalability Because it works with Small quantities of data at a time Running in parallel across a cluster Sharing nothing among the participating nodes 34

Hadoop Resource Management: YARN 35

36

37

38

39

40

41

42

43

44

45

Working with YARN Developers: Submit jobs (applications) to run on the YARN cluster Monitor and manage jobs Hadoop includes three major YARN tools for developers The Hue Job Browser The YARN Web UI The YARN command line 46

Topics The Motivation For Hadoop Hadoop Overview HDFS MapReduce The Hadoop Ecosystem 47

48

49

Data Ingestion 50

Data Ingestion 51

Data Processing 52

Data Processing 53

Data Processing 54

Data Analysis and Exploration 55

Data Analysis and Exploration 56

Other Hadoop Ecosystem Tools 57

The Hadoop Ecosystem Hadoop is a framework for distributed storage and processing Core Hadoop includes HDFS for storage and YARN for cluster resource management The Hadoop ecosystem includes many components for Database integration: Sqoop, Flume, Kafka Storage: HDFS, HBase Data processing: Spark, MapReduce, Pig SQL support: Hive, Impala, SparkQL Other: Workflow: Oozie Machine learning: Mahout 58

Recap: Data Center Integration (1 Of 2) 59

Recap: Data Center Integration (2 Of 2) Hadoop is not necessarily intended to replace existing parts of a data center But its capabilities do allow you to offload some of the functionality currently being provided (more expensively) by other technologies So while Hadoop will not replace the use of an RDBMS to retrieve data as part of an transaction, it could be used to store far more detail about the transaction than you would in an ordinary RDBMS (due to costs limitations) 60

Bibliography 10 Hadoopable Problems (recorded presentation) http://tiny.cloudera.com/dac02a Introduction to Apache MapReduce and HDFS (recorded presentation) http://tiny.cloudera.com/dac02b Guide to HDFS Commands http://tiny.cloudera.com/dac02c Hadoop: The Definitive Guide, 3rd Edition http://tiny.cloudera.com/dac02d Sqoop User Guide http://tiny.cloudera.com/dac02e Cloudera Version and Packaging Information http://www.cloudera.com/content/support/en/documentation/version-downloaddocumentation/versions-download-latest.html 61

Thank you! 62