WA2341 Hadoop Programming EVALUATION ONLY

Size: px
Start display at page:

Download "WA2341 Hadoop Programming EVALUATION ONLY"

Transcription

1 WA2341 Hadoop Programming Web Age Solutions Inc. USA: Canada: Web:

2 The following terms are trademarks of other companies: Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. IBM, WebSphere, DB2 and Tivoli are trademarks of the International Business Machines Corporation in the United States, other countries, or both. Other company, product, and service names may be trademarks or service marks of others. For customizations of this book or other sales inquiries, please contact us at: USA: , getinfousa@webagesolutions.com Canada: toll free, getinfo@webagesolutions.com Copyright 2014 Web Age Solutions Inc. This publication is protected by the copyright laws of Canada, United States and any other country where this book is sold. Unauthorized use of this material, including but not limited to, reproduction of the whole or part of the content, re-sale or transmission through fax, photocopy or is prohibited. To obtain authorization for any such activities, please write to: Web Age Solutions Inc. 439 University Ave Suite 820 Toronto Ontario, M5G 1Y8

3 Table of Contents Chapter 1 - MapReduce Overview MapReduce Defined Google's MapReduce MapReduce Explained MapReduce Explained MapReduce Word Count Job MapReduce Shared-Nothing Architecture Similarity with SQL Aggregation Operations Example of Map & Reduce Operations using JavaScript Example of Map & Reduce Operations using JavaScript Problems Suitable for Solving with MapReduce Typical MapReduce Jobs Fault-tolerance of MapReduce Distributed Computing Economics MapReduce Systems Summary...18 Chapter 2 - Hadoop Overview Apache Hadoop Apache Hadoop Logo Typical Hadoop Applications Hadoop Clusters Hadoop Design Principles Hadoop's Core Components Hadoop Simple Definition High-Level Hadoop Architecture Hadoop-based Systems for Data Analysis Hadoop Caveats Summary...23 Chapter 3 - Hadoop Distributed File System Overview Hadoop Distributed File System Hadoop Distributed File System Data Blocks Data Block Replication Example HDFS NameNode Directory Diagram Accessing HDFS Examples of HDFS Commands Client Interactions with HDFS for the Read Operation Read Operation Sequence Diagram Client Interactions with HDFS for the Write Operation Communication inside HDFS Summary...30 Chapter 4 - MapReduce with Hadoop Hadoop's MapReduce JobTracker and TaskTracker...33

4 4.3 MapReduce Programming Options Java MapReduce API The Structure of a Java MapReduce Program The Mapper Class The Reducer Class The Driver Class Compiling Classes Running the MapReduce Job The Structure of a Single MapReduce Program Combiner Pass (Optional) Hadoop's Streaming MapReduce Python Word Count Mapper Program Example Python Word Count Reducer Program Example Setting up Java Classpath for Streaming Support Streaming Use Cases The Streaming API vs Java MapReduce API Amazon Elastic MapReduce Amazon Elastic MapReduce Amazon Elastic MapReduce Summary...43 Chapter 5 - Apache Pig Scripting Platform What is Pig? Pig Latin Apache Pig Logo Pig Execution Modes Local Execution Mode MapReduce Execution Mode Running Pig Running Pig in Batch Mode What is Grunt? Pig Latin Statements Pig Programs Pig Latin Script Example SQL Equivalent Differences between Pig and SQL Statement Processing in Pig Comments in Pig Supported Simple Data Types Supported Complex Data Types Arrays Defining Relation's Schema The bytearray Generic Type Using Field Delimiters Referencing Fields in Relations Summary...54 Chapter 6 - Apache Pig HDFS Interface...55

5 6.1 The HDFS Interface FSShell Commands (Short List) Grunt's Old File System Commands Summary...57 Chapter 7 - Apache Pig Relational and Eval Operators Pig Relational Operators Example of Using the JOIN Operator Example of Using the JOIN Operator Example of Using the Order By Operator Caveats of Using Relational Operators Pig Eval Functions Caveats of Using Eval Functions (Operators) Example of Using Single-column Eval Operations Example of Using Eval Operators For Global Operations Summary...63 Chapter 8 - Apache Pig Miscellaneous Topics Utility Commands Handling Compression User-Defined Functions Filter UDF Skeleton Code Summary...67 Chapter 9 - Apache Pig Performance Apache Pig Performance Performance Enhancer - Use the Right Schema Type Performance Enhancer - Apply Data Filters Use the PARALLEL Clause Examples of the PARALLEL Clause Performance Enhancer - Limiting the Data Sets Displaying Execution Plan Summary...72 Chapter 10 - Hive What is Hive? Apache Hive Logo Hive's Value Proposition Who uses Hive? Hive's Main Systems Hive Features Hive Architecture HiveQL Where are the Hive Tables Located? Hive Command-line Interface (CLI) Summary...78 Chapter 11 - Hive Command-line Interface Hive Command-line Interface (CLI) The Hive Interactive Shell Running Host OS Commands from the Hive Shell...82

6 11.4 Interfacing with HDFS from the Hive Shell The Hive in Unattended Mode The Hive CLI Integration with the OS Shell Executing HiveQL Scripts Comments in Hive Scripts Variables and Properties in Hive CLI Setting Properties in CLI Example of Setting Properties in CLI Hive Namespaces Using the SET Command Setting Properties in the Shell Setting Properties for the New Shell Session Summary...86 Chapter 12 - Hive Data Definition Language Hive Data Definition Language Creating Databases in Hive Using Databases Creating Tables in Hive Supported Data Type Categories Common Primitive Types Example of the CREATE TABLE Statement The STRUCT Type Table Partitioning Table Partitioning Table Partitioning on Multiple Columns Viewing Table Partitions Row Format Data Serializers / Deserializers File Format Storage More on File Formats The EXTERNAL DDL Parameter Example of Using EXTERNAL Creating an Empty Table Dropping a Table Table / Partition(s) Truncation Alter Table/Partition/Column Views Create View Statement Why Use Views? Restricting Amount of Viewable Data Examples of Restricting Amount of Viewable Data Creating and Dropping Indexes Describing Data Summary Chapter 13 - Hive Data Manipulation Language Hive Data Manipulation Language (DML)...103

7 13.2 Using the LOAD DATA statement Example of Loading Data into a Hive Table Loading Data with the INSERT Statement Appending and Replacing Data with the INSERT Statement Examples of Using the INSERT Statement Multi Table Inserts Multi Table Inserts Syntax Multi Table Inserts Example Summary Chapter 14 - Hive Select Statement HiveQL The SELECT Statement Syntax The WHERE Clause Examples of the WHERE Statement Partition-based Queries Example of an Efficient SELECT Statement The DISTINCT Clause Supported Numeric Operators Built-in Mathematical Functions Built-in Aggregate Functions Built-in Statistical Functions Other Useful Built-in Functions The GROUP BY Clause The HAVING Clause The LIMIT Clause The ORDER BY Clause The JOIN Clause The CASE Clause Example of CASE Clause Summary Chapter 15 - Apache Sqoop What is Sqoop? Apache Sqoop Logo Sqoop Import / Export Sqoop Help Examples of Using Sqoop Commands Data Import Example Fine-tuning Data Import Controlling the Number of Import Processes Data Splitting Helping Sqoop Out Example of Executing Sqoop Load in Parallel A Word of Caution: Avoid Complex Free-Form Queries Using Direct Export from Databases Example of Using Direct Export from MySQL More on Direct Mode Import...120

8 15.16 Changing Data Types Example of Default Types Overriding File Formats The Apache Avro Serialization System Binary vs Text More on the SequenceFile Binary Format Generating the Java Table Record Source Code Data Export from HDFS Export Tool Common Arguments Data Export Control Arguments Data Export Example Using a Staging Table INSERT and UPDATE Statements INSERT Operations UPDATE Operations Example of the Update Operation Failed Exports Summary Chapter 16 - Cloudera Impala What is Cloudera Impala? Impala's Logo Benefits of Using Impala Key Features How Impala Handles SQL Queries Impala Programming Interfaces Impala SQL Language Reference Differences Between Impala and HiveQL Impala Shell Impala Shell Main Options Impala Shell Commands Impala Common Shell Commands Cloudera Web Admin UI Impala Browse-based Query Editor Summary Chapter 17 - Apache HBase What is HBase? HBase Design HBase Features The Write-Ahead Log (WAL) and MemStore HBase vs RDBS HBase vs Apache Cassandra Interfacing with HBase HBase Thrift And REST Gateway HBase Table Design Column Families A Cell's Value Versioning...143

9 17.12 Timestamps Accessing Cells HBase Table Design Digest Table Horizontal Partitioning with Regions HBase Compaction Loading Data in HBase HBase Shell HBase Shell Command Groups Creating and Populating a Table in HBase Shell Getting a Cell's Value Counting Rows in an HBase Table Summary Chapter 18 - Apache HBase Java API HBase Java Client HBase Scanners Using ResultScanner Efficiently The Scan Class The KeyValue Class The Result Class Getting Versions of Cell Values Example The Cell Interface HBase Java Client Example Scanning the Table Rows Dropping a Table The Bytes Utility Class Summary...156

10

11 Chapter 1 - MapReduce Overview Objectives In this chapter, participants will learn about: MapReduce Programming Model Main MapReduce design principles 1.1 MapReduce Defined There are different definitions of what MapReduce (single word) is: a programming model parallel processing framework a computational paradigm batch query processor This technique (model, etc.) was influenced by functional programming languages that have the map and reduce functions 1.2 Google's MapReduce MapReduce was first introduced by Google back in 2004 Google applied for and was granted US Patent 7,650,331 on MapReduce called "System and method for efficient large-scale data processing" The patent lists Jeffrey Dean and Sanjay Ghemawat as its inventors The value proposition of the MapReduce framework is its scalability and fault-tolerance achieved by its architecture Notes: ABSTRACT A large-scale data processing system and method includes one or more application-independent map modules configured to read input data and to apply at least one application-specific map operation to the input data to produce intermediate data values, wherein the map operation is automatically parallelized across multiple processors in the parallel processing environment. A plurality of intermediate data structures are used to store the intermediate data values. One or more application-

12 Chapter 1 - MapReduce Overview independent reduce modules are configured to retrieve the intermediate data values and to apply at least one application-specific reduce operation to the intermediate data values to provide output data. Source: MapReduce Explained MapReduce works by breaking the data processing task into two phases: The map phase and the reduce phase executed sequentially with the output of the map operation piped (emitted) as input into the reduce operation The map phase is backed up by a Map() procedure breaks up the chunks of original data into a list of key/value pairs There is a data split operation feeding data into the Map() procedure The reduce phase is backed up by a Reduce() procedure that performs some soft of an aggregation operation by key (e.g. counting, finding the maximum of elements in the data set, etc.) on data received from the Map() procedure There is also an additional step between the map and reduce phases, called "Shuffle", that prepares (sorts, etc.) and directs output of the map phase to the reduce phase 12

13 Chapter 1 - MapReduce Overview 1.4 MapReduce Explained (Source: Wikipedia) 13

14 Chapter 1 - MapReduce Overview 1.5 MapReduce Word Count Job Notes: The key in the Word Count MapReduce operation is the word itself. Each word's associated value is the number of occurrences of this word in the input data set. The map phase emits the key/value pairs as word,1 (each word gets counted as occurring only once) with potentially multiple duplications which will be combined in the shuffle step and then aggregated in the reduce phase. 1.6 MapReduce Shared-Nothing Architecture MapReduce is designed around the shared-nothing architecture which leads to computational efficiencies in distributed computing environments Shared-nothing means "independent of others" A mapper process is independent from other mapper processes, and so are reducer processes This architecture allows the map and reduce operations to be executed in parallel (in their respective phases) 14

15 Chapter 1 - MapReduce Overview MapReduce programming model is linearly scalable 1.7 Similarity with SQL Aggregation Operations It may help compare MapReduce with aggregation operations used in SQL In SQL, aggregation is achieved by using COUNT(), AVG(), MIN(), and other such functions (which act as some sort of reducers) with the GROUP BY clause, e.g. SELECT MONTH, SUM(SALES) FROM YEAR_END_REPORT GROUP BY MONTH MapReduce offers a more fine-grained programmatic control over data processing in multi-node computing environments using parallel programming algorithms 1.8 Example of Map & Reduce Operations using JavaScript JavaScript supports some elements of functional programming in the form of the map() and reduce() functions Problem: Find out the sum of elements of the [1,2,3] array after all its elements have been increased by 10% Note: An alternative solution is to find the sum of the array elements before the increase and then apply the 10% increase to the total 1.9 Example of Map & Reduce Operations using JavaScript Solution: The Map phase (apply the function of a 10% value increase to each element of the input array of [1,2,3]): [1,2,3].map(function(x){return x + x/10}); Result: [1.1, 2.2, 3.3] The Reduce phase (use the result array of the Map operation as input and sum up all its elements): 15

16 Chapter 1 - MapReduce Overview [1.1, 2.2, 3.3].reduce(function(x,y){return x + y}); Result: 6.6 Note: While the map() and reduce() functions here can also be used independently from each other; MapReduce is always a single operation 1.10 Problems Suitable for Solving with MapReduce The following is the criteria that you can use to see if the problem at hand can be efficiently solved by MapReduce: The problem can be split into smaller problems with no shared state that can be solved in parallel Those smaller problems are independent from one another and do not require interactions The problem can be decomposed into the map and reduce operations The map operation: execute the same operation on all data The reduce operation: execute the same operation on each group of data produced by the map operation Basically, you should see the generic "divide and conquer" pattern 1.11 Typical MapReduce Jobs Counting tokens (words, URLs, etc.) Finding aggregate values in the target data set, e.g. the average value Processing geographical data etc... Google Maps uses MapReduce to find the nearest feature, like coffee shop, museum, etc., to a given address 1.12 Fault-tolerance of MapReduce MapReduce operations have a degree of fault-tolerance and built-in recoverability from certain types of run-time failures which is leveraged in 16

17 Chapter 1 - MapReduce Overview production-ready systems (e.g. Hadoop) MapReduce enjoys these quality of service due to its architecture based on process parallelism Failed units of work are rescheduled and resubmitted for execution should some mappers or reducers fail (provided the source data is still available) Note: High Performance (e.g. Grid) Computing systems that use the Message Passing Interface communication for check-points are more difficult to program for failure recovery 1.13 Distributed Computing Economics MapReduce splits the workload into units of work (independent computation tasks) and intelligently distributes them across available worker nodes for parallel processing To improve system performance, MapReduce tries to start a computation task on the node where the data to be worked on is stored Notes: This helps avoid unnecessary network operations "Data locality" (collocation of data with the compute node) underlies the principles of Distributed Computing Economics Distributed Computing Economics was a topic covered by Jim Gray of Microsoft Research in his paper published back in 2003 ( His main conclusion was: "Put the computation near the data" or "One puts computing as close to the data as possible in order to avoid expensive network traffic" MapReduce Systems Many systems leverage MapReduce programming model: Hadoop has a built-in MapReduce engine for executing both regular and streaming MapReduce jobs Amazon WS offers their clients Elastic MapReduce (EMR) Service (running on a Hadoop cluster) 17

18 Chapter 1 - MapReduce Overview Notes: Amazon Elastic MapReduce works as a Job flow where the client needs to follow pre-defined steps: 1. Provide the map and reduce applications/scripts to the Hadoop Framework via Amazon S3 bucket upload or use pre-installed ones 2. Specify the S3 bucket containing data input file(s) and the output S3 bucket to receive the result of the MapReduce workflow 3. Allocate the needed compute (EC2) instances 4. Execute the Job and pickup the output from the output S3 bucket 1.15 Summary The MapReduce programming model was influenced by functional programming languages MapReduce (one word) has the map and reduce components working together in a distributed computational environment Production MapReduce system offer a number of quality of services, such as fault-tolerance 18

19 Chapter 2 - Hadoop Overview Objectives In this chapter, participants will learn about: Apache Hadoop and its core components Hadoop's main design considerations Notes: 2.1 Apache Hadoop Apache Hadoop is a distributed fault-tolerant computing platform written in Java Designed as a massively parallel processing (MPP) system based on a distributed master-slave architecture Hadoop allows for the distributed processing of large data sets across clusters of computers using simple programming models, e.g. MapReduce Hadoop is an open source project Hadoop is the name the son of Doug Cutting (the project's creator) gave to his yellow elephant toy. Doug Cutting is presently Chairman of the Apache Software Foundation and Cloudera's Chief Architect. Massively parallel processing (MPP) is the system configuration that enlists hundreds and thousands of CPUs simultaneously for solving a particular computational task. Each computer in the MPP system controls its own resources along with task execution coordination with other nodes in the system. The coordination aspect sets Hadoop aside from MPP systems as Hadoop is designed on the shared-nothing architecture delegating the coordination activities to the dispatcher layer (made up of the JobTracker and TaskTrackers). According to the Wikipedia article ( "A shared nothing architecture is a distributed computing architecture in which each node is independent and selfsufficient, and there is no single point of contention across the system. More specifically, none of the nodes share memory or disk storage." Yahoo provided funds to make Hadoop a Web-scale technology; the initial use case for Hadoop at Yahoo was to create and analyze a WebMap graph that consists of about one trillion (10 12 ) Web links and 100 billion distinct URLs.

20 Chapter 2 - Hadoop Overview 2.2 Apache Hadoop Logo 2.3 Typical Hadoop Applications Log and/or clickstream analysis Web crawling results processing Marketing analytics Machine learning and data mining Data archiving (e.g. for regulatory compliance, etc.) See for the list of educational and production uses of Hadoop 2.4 Hadoop Clusters First versions of Hadoop were only able to handle 20 machines; newer versions are capable to run Hadoop clusters comprising thousands of nodes Hadoop clusters run on moderately high-end commodity hardware ($2-5K per machine) Hadoop clusters can be used as a data hub, data warehouse or a business analytics platform 2.5 Hadoop Design Principles Hadoop's design was influenced by ideas published in Google File System (GFS) and MapReduce white papers Hadoop's core component, Hadoop Distributed File System (HDFS) is the counterpart of GFS 20

21 Chapter 2 - Hadoop Overview Hadoop uses functionally equivalent to Google's MapReduce data processing system also called MapReduce (term coined by Google's engineers) One of the main principle of Hadoop's architecture is "design for failure" To deliver high-availability quality of service, Hadoop detects and handles failures at the application layer (rather than relying on hardware) 2.6 Hadoop's Core Components The Hadoop project is made up of the following main components: Common Contains Hadoop infrastructure elements (interfaces with HDFS, system libraries, RPC connectors, Hadoop admin scripts, etc.) Hadoop Distributed File System (HDFS) Hadoop's persistence component designed to run on clusters of commodity hardware built around the "load once and read many times" concept MapReduce A distributed data processing framework used as data analysis system 2.7 Hadoop Simple Definition In a nutshell, Hadoop is a distributed computing framework that consists of: Reliable data storage (provided via HDFS) Analysis system (provided by MapReduce) 21

22 Chapter 2 - Hadoop Overview 2.8 High-Level Hadoop Architecture Notes: The HDFS NameNode acts as a meta-data server for all information stored in HDFS on data nodes. The MapReducer master is responsible for allocating the map and reduce jobs in a distributed environment in a most efficient way. 2.9 Hadoop-based Systems for Data Analysis Hadoop (via HDFS) can host the following systems for data analysis: The MapReduce engine (the major data analytics component of the Hadoop project) Apache Pig HBase database Apache Hive data warehouse system Apache Mahout machine learning system etc. 22

23 Chapter 2 - Hadoop Overview 2.10 Hadoop Caveats Hadoop is a batch-oriented processing system Many Hadoop-centric business analytics systems have high processing latency which comes from their dependencies on the MapReduce subsystem Querying even small data sets (under a gigabyte in size) may take up to several minutes This is in sharp contrast with querying speed of relational databases such as MySQL, Oracle, DB2, etc. where functionally similar work can be done several orders of magnitude faster by applying indexes and other techniques Systems that bypass MapReduce when building queries have near realtime querying times (e.g. HBase and Cloudera Impala) 2.11 Summary Apache Hadoop is a distributed fault-tolerant computing platform written in Java used for processing large data sets across clusters of computers Hadoop clusters may comprise thousands of nodes One of the main principle of Hadoop's architecture is "design for failure" 23

24

25 Chapter 3 - Hadoop Distributed File System Overview Objectives In this chapter, participants will learn about: Hadoop Distributed File System (HDFS) Ways to access HDFS 3.1 Hadoop Distributed File System The Hadoop Distributed File System (HDFS) is a distributed, scalable, fault-tolerant and portable file system written in Java HDFS's architecture is based on the master/slave design pattern An HDFS cluster consists of: Notes: A single NameNode (metadata server) holding directory information about files on DataNodes A number of DataNodes, usually one per machine in the cluster HDFS design is "rack-aware" to minimize network latency HDFS includes a server called a secondary NameNode, which is not a fail-over NameNode, rather, the secondary NameNode connects to the primary NameNode at specified intervals and pulls out the primary NameNode's directory information for building file system current snapshots. These snapshots can be used to restart a failed primary NameNode from the most recent checkpoint without having to replay the entire journal of file-system actions. The secondary NameNode has been deprecated in favor of the newly introduced Checkpoint node which takes over the former's tasks and adds more functionality. Work is under way to provide automatic fail-over for the NameNode to prevent a single point of failure of a cluster. Rack-awareness means taking into account a machine's physical location (rack) while scheduling tasks and allocating storage. Basically, HDFS is aware of the fact that network bandwidth between machines sitting in the same server rack is greater than that between machines in different racks. The creators of HDFS made a design decision in favor of using machines with internal hard drives as it helps ensure data locality (data is stored on the hard drive of the machine which CPU is used for processing it, e.g. with MapReduce). For that reason, Storage Area Network (SAN) or similar storage technologies are not recommended for performance considerations.

26 Chapter 3 - Hadoop Distributed File System Overview 3.2 Hadoop Distributed File System HDFS is designed for efficient implementation of the following data processing pattern: Write once (normally, just load the data set on the file system) Read many times (for data analysis, etc.) HDFS functionality (and that of Hadoop) is geared towards batch-oriented rather than real-time scenarios HDFS does not have a built-in cache Processing of data-intensive jobs can be done in parallel 3.3 Data Blocks A data file in HDFS is split into blocks (a typical block size used by HDFS is 64 MB) For large files, bigger block sizes will help reduce the amount of metadata stored in the NameNode (metadata server) Data reliability is achieved by replicating data blocks across multiple machines By default, data blocks get replicated to three nodes: two on the same server rack, and one on a different rack for redundancy 26

27 Chapter 3 - Hadoop Distributed File System Overview 3.4 Data Block Replication Example Example of a three-way (default) replication of a single data block for redundancy and achieving high data availability (the block size is usually a multiple of 64M) 3.5 HDFS NameNode Directory Diagram Notes: The HDFS NameNode maintains the directory metadata about all data nodes and data blocks stored for each file registered on HDFS. 27

28 Chapter 3 - Hadoop Distributed File System Overview 3.6 Accessing HDFS HDFS is not a Portable Operating System Interface (POSIX) compliant file system It is modeled after traditional hierarchical file systems containing directories and files File-system commands (copy, move, delete, etc.) against HDFS can be performed in a number of ways: Through the HDFS Command-Line Interface (CLI) which supports Unixlike commands: cat, chown, ls, mkdir, mv, etc. Using Java API for HDFS Via the C-language wrapper around the Java API Using regular HTTP browser for file-system and file content viewing This is made possible through Web server (Jetty) embedded in the NameNode and DataNodes 3.7 Examples of HDFS Commands The common way to invoke the HDFS CLI: hadoop fs {HDFS commands} Copying a file from the local file system over to HDFS (with the same name): hadoop fs -put <filename_on_local_sysetm> Copying the file from HDFS to the local file system (with the same name): hadoop fs -get <filename_on_hdfs> Recursively listing files in a directory: hadoop fs -ls -R <directory_on_hdfs> Creating a directory under the current user's home directory (e.g. /user/userid/) on HDFS hadoop fs -mkdir REPORT 28

29 Chapter 3 - Hadoop Distributed File System Overview Notes: Instead of the put command, you can use the functionally equivalent but more descriptive copyfromlocal command, and the get command also has its more descriptive equivalent: copytolocal. 3.8 Client Interactions with HDFS for the Read Operation Whenever a Hadoop client needs to read a file on HDFS, it first contacts the NameNode The NameNode locates block ids associated with the file as well as IP address of the DataNodes storing those blocks The NameNode returns the related information to the client The client contacts the related DataNodes and supplies the block ids for DataNodes to locate the blocks on their local HDFS storage The DataNodes serve the blocks of data back to the client 3.9 Read Operation Sequence Diagram 29

30 Chapter 3 - Hadoop Distributed File System Overview 3.10 Client Interactions with HDFS for the Write Operation Notes: Whenever a Hadoop client needs to write a file to HDFS, it first contacts the NameNode The NameNode generates and registers meta-data about the file and allocates suitable DataNodes to keep the file's replicas Information about the allocated DataNodes is sent back to the client Client uses HDFS I/O facilities to stream content to the first DataNode in the list (the "primary" DataNode) The "primary" DataNode makes a local copy of related data and engages other DataNodes in a peer-to-peer data sharing communication for data replication Each DataNode sends acknowledgments of receiving their data back to the "primary" DataNode The client gets an acknowledgment from the "primary" DataNode as a confirmation of the HDFS write operation The client notifies the NameNode on completion of the operation Once all DataNodes receive their replicas of the file, they cache the content in their memory (on Java Heap) and send acknowledgments to the "primary" DataNode. The actual data persistence to the local file system is done by DataNodes asynchronously at a later time Communication inside HDFS HDFS is designed for batch processing rather than for interactive use This affected the design of communication protocols inside HDFS Communication inside HDFS is modeled after the Remote Procedure Call (RPC) application protocol layered on top of the TCP/IP protocol 3.12 Summary File-system commands against HDFS can be performed in a number of 30

31 Chapter 3 - Hadoop Distributed File System Overview ways: Through the fs command-line interface (which supports Unix-like commands: cat, chown, ls, mkdir, mv, etc.) Using Java API for HDFS Via the C-language wrapper around the Java API 31

32

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Qsoft Inc www.qsoft-inc.com

Qsoft Inc www.qsoft-inc.com Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

HDFS. Hadoop Distributed File System

HDFS. Hadoop Distributed File System HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Accelerating and Simplifying Apache

Accelerating and Simplifying Apache Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

Complete Java Classes Hadoop Syllabus Contact No: 8888022204

Complete Java Classes Hadoop Syllabus Contact No: 8888022204 1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

MapReduce with Apache Hadoop Analysing Big Data

MapReduce with Apache Hadoop Analysing Big Data MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Hadoop Job Oriented Training Agenda

Hadoop Job Oriented Training Agenda 1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

Hadoop: The Definitive Guide

Hadoop: The Definitive Guide FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

THE HADOOP DISTRIBUTED FILE SYSTEM

THE HADOOP DISTRIBUTED FILE SYSTEM THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

From Wikipedia, the free encyclopedia

From Wikipedia, the free encyclopedia Page 1 sur 5 Hadoop From Wikipedia, the free encyclopedia Apache Hadoop is a free Java software framework that supports data intensive distributed applications. [1] It enables applications to work with

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,

More information

Big Data Course Highlights

Big Data Course Highlights Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture. Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in

More information

Comparing SQL and NOSQL databases

Comparing SQL and NOSQL databases COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations

More information

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction

More information

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

!#$%&' ( )%#*'+,'-#.//0( !#$%&'()*$+()',!-+.'/', 4(5,67,!-+!89,:*$;'0+$.<.,&0$'09,&)/=+,!()<>'0, 3, Processing LARGE data sets !"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

HDFS Architecture Guide

HDFS Architecture Guide by Dhruba Borthakur Table of contents 1 Introduction... 3 2 Assumptions and Goals... 3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets... 3 2.4 Simple Coherency Model...3 2.5

More information

Chase Wu New Jersey Ins0tute of Technology

Chase Wu New Jersey Ins0tute of Technology CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from Hadoop Beginner's Guide Learn how to crunch big data to extract meaning from data avalanche Garry Turkington [ PUBLISHING t] open source I I community experience distilled ftu\ ij$ BIRMINGHAMMUMBAI ')

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

Deploying Hadoop with Manager

Deploying Hadoop with Manager Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer plinnell@suse.com Alejandro Bonilla / Sales Engineer abonilla@suse.com 2 Hadoop Core Components 3 Typical Hadoop Distribution

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Important Notice. (c) 2010-2013 Cloudera, Inc. All rights reserved.

Important Notice. (c) 2010-2013 Cloudera, Inc. All rights reserved. Hue 2 User Guide Important Notice (c) 2010-2013 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or slogans contained in this document

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data

More information

Introduction to Hadoop

Introduction to Hadoop 1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools

More information

MySQL and Hadoop. Percona Live 2014 Chris Schneider

MySQL and Hadoop. Percona Live 2014 Chris Schneider MySQL and Hadoop Percona Live 2014 Chris Schneider About Me Chris Schneider, Database Architect @ Groupon Spent the last 10 years building MySQL architecture for multiple companies Worked with Hadoop for

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Big Data Too Big To Ignore

Big Data Too Big To Ignore Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction

More information

The Hadoop Eco System Shanghai Data Science Meetup

The Hadoop Eco System Shanghai Data Science Meetup The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:

More information

American International Journal of Research in Science, Technology, Engineering & Mathematics

American International Journal of Research in Science, Technology, Engineering & Mathematics American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629

More information

Apache HBase. Crazy dances on the elephant back

Apache HBase. Crazy dances on the elephant back Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent. Hadoop for MySQL DBAs + 1 About me Sarah Sproehnle, Director of Educational Services @ Cloudera Spent 5 years at MySQL At Cloudera for the past 2 years sarah@cloudera.com 2 What is Hadoop? An open-source

More information

<Insert Picture Here> Big Data

<Insert Picture Here> Big Data Big Data Kevin Kalmbach Principal Sales Consultant, Public Sector Engineered Systems Program Agenda What is Big Data and why it is important? What is your Big

More information

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an

More information

Integrating Big Data into the Computing Curricula

Integrating Big Data into the Computing Curricula Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big

More information

Peers Techno log ies Pv t. L td. HADOOP

Peers Techno log ies Pv t. L td. HADOOP Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and

More information

BIG DATA HADOOP TRAINING

BIG DATA HADOOP TRAINING BIG DATA HADOOP TRAINING DURATION 40hrs AVAILABLE BATCHES WEEKDAYS (7.00AM TO 8.30AM) & WEEKENDS (10AM TO 1PM) MODE OF TRAINING AVAILABLE ONLINE INSTRUCTOR LED CLASSROOM TRAINING (MARATHAHALLI, BANGALORE)

More information

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations

More information

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com Big Data Primer Alex Sverdlov alex@theparticle.com 1 Why Big Data? Data has value. This immediately leads to: more data has more value, naturally causing datasets to grow rather large, even at small companies.

More information

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science A Seminar report On Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science SUBMITTED TO: www.studymafia.org SUBMITTED BY: www.studymafia.org

More information

Google Bing Daytona Microsoft Research

Google Bing Daytona Microsoft Research Google Bing Daytona Microsoft Research Raise your hand Great, you can help answer questions ;-) Sit with these people during lunch... An increased number and variety of data sources that generate large

More information