# Introduc)on to RHadoop Master s Degree in Informa1cs Engineering Master s Programme in ICT Innova1on: Data Science (EIT ICT Labs Master School)

Save this PDF as:

Size: px
Start display at page:

Download "Introduc)on to RHadoop Master s Degree in Informa1cs Engineering Master s Programme in ICT Innova1on: Data Science (EIT ICT Labs Master School)"

## Transcription

1 Introduc)on to RHadoop Master s Degree in Informa1cs Engineering Master s Programme in ICT Innova1on: Data Science (EIT ICT Labs Master School) Academic Year

3 MapReduce & DQ Divide and Conquer (DQ) General idea Divide a problem into sub- problems (smaller) Solve each problem (independently) Combine the solu1ons

4 DQ: pseudo- code Func1on DQ (X: Problem data) if small(x) then S = easy(x) if not divide(x) => (X 1,..., X k ) for i = 1 to k do S i = DQ(X i ) S = combine(s 1,..., S k ) return S

5 DQ: efficiency Efficiency of this approach An appropriate threshold must be selected to apply easy(x) Decomposi1on and combining func1ons must be efficient Sub- problems must be (approximately) of the same size

6 DQ: Remarks It can not be applied to any type of problems Some1mes, it might not be obvious how to divide a large problem into sub- problems If such division is uneven, we will have an unbalanced system, which would have an import impact on the overall performance of the algorithm The size of the reduced problems must be significantly smaller than the original one so that massively parallel supercomputer could be used and the communica1on overhead can be compensated

7 MapReduce: general scheme Source:

8 MapReduce: more detail Source: Hadoop Book

9 MapReduce: example Source: MilanoR

10 Hadoop Distributed File System (HDFS) Distributed File System evolved from Google implementa1on (GFS) Fault- tolerant: files and divided in chunks and those are distributed and replicated through the cluster Normally, the replica1on ra1o is 3 There is a Master Node that stores this meta- data: which files, into how many chunks these are divided and where they are stored Large block sizes are preferred (128MB by default)

12 Hadoop Distributed File System (HDFS) In HDFS, blocks should be read from the beginning to the end (this favors the MapReduce approach) Files in the HDFS system ARE NOT stored along with the host system files HDFS is normally an abstrac1on OVER an exis1ng file system (ext3, ext4, etc.) Thus, there are specific commands to manipulate the HDFS file system To open a file stored in HDFS, the client must contact the NameNode to retrieve the loca1on of each block of the file (at the DataNodes) Parallel reads are possible (and preferred)

13 Hadoop Distributed File System (HDFS) Data locality: normally, when launching a job, it is run in the same node that stores the data it must manipulate The meta- data stored in the NameNode is not automa1cally replicated (it must be done manually or with an inac1ve NameNode)

14 HDFS from the command line Each user of the HDFS has a personal directory No security direc1ves implemented, so users can write anywhere Access to HDFS through the hdfs command hdfs dfs command Important commands - copyfromlocal vs. - copytolocal - mkdir - cp, - mv Documenta1on in the Hadoop Website

15 Hadoop MRv1 vs Yarn (MRv2) Hadoop MRv1 Resources management and tasks scheduling and monitoring done by a single process (bogle- neck): Job Tracker Each sub- problem is run by an independent process: Task Tracker Hadoop MRv2 Resources management and tasks scheduling and monitoring are split in different processes Resource Manager (RM): overall resources management Applica>on Master(AM): per job tasks scheduling and monitoring A NodeManager runs the tasks at each compu1ng node

16 Hadoop MRv1 vs Yarn (MRv2)

17 Example: wordcount Input: document made up of words Output: A set of (Word, count(word)) Two func1ons: map and reduce map(k1, v1): for each word w in v1 emit(w, 1) reduce(k2, v2_list): int result = 0; for each v in v2_list result += v; emit(k2, result)

18 Example: wordcount

19 Example: wordcount

21 RHadoop: interac)ng with HDFS # Load rhdfs library library(rhdfs) # Start rhdfs hdfs.init() # Basic "ls", path is mandatory hdfs.ls("/user/hadoop ) # Create directory work.dir <- "/user/hadoop/aux/ hdfs.mkdir(work.dir) # And delete hdfs.delete(work.dir) # Create again hdfs.mkdir(work.dir)

23 RHadoop: wordcount example wordcount = func1on(input, # The output can be an HDFS path but # if it is NULL some temporary file will # be generated and wrapped in a big data # object, like the ones generated by to.dfs output = NULL, pagern = " ") { # Defining wordcount Map func1on wc.map = func1on(., lines) { keyval( unlist(strsplit(x = lines, split = pagern)), 1) } # Defining wordcount Reduce func1on wc.reduce = func1on(word, counts ) { keyval(word, sum(counts)) }

24 RHadoop: wordcount example } # Defining MapReduce parameters by calling mapreduce func1on mapreduce(input = input, output = output, # You can specify your own input and output formats # and produce binary formats with the func1ons # make.input.format and make.output.format input.format = "text, map = wc.map, reduce = wc.reduce, # With combiner combine = T)

25 RHadoop: wordcount example # Running MapReduce Job by passing the Hadoop # input directory loca1on as parameter wordcount('/user/hadoop/wordcount/quijote.txt') # Retrieving the RHadoop MapReduce output # data by passing output # directory loca1on as parameter from.dfs("/tmp/file1b0817a5bcd0") El Quijote can be downloaded from: hgp://

26 RHadoop: airline example We will analyze the commercial data of an airline The input data file is a CSV We will need to use a custom input formager to ease the task of processing the file Data can be downloaded from: hgp://stat- compu1ng.org/dataexpo/2009/1987.csv.bz2

28 RHadoop: airline example (input format) # # asa.csv.input.format() - read CSV data files and label field names # for beger code readability (especially in the mapper) # asa.csv.input.format = make.input.format(format='csv', mode='text', streaming.format = NULL, sep=',', col.names = c('year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'CRSDepTime', 'ArrTime', 'CRSArrTime', 'UniqueCarrier', 'FlightNum', 'TailNum', 'ActualElapsedTime', 'CRSElapsedTime', 'AirTime', 'ArrDelay', 'DepDelay', 'Origin', 'Dest', 'Distance', 'TaxiIn', 'TaxiOut', 'Cancelled', 'Cancella1onCode', 'Diverted', 'CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircralDelay'), stringsasfactors=f)

29 RHadoop: airline example (mapper 1/2) # # the mapper gets keys and values from the input formager # in our case, the key is NULL and the value is a data.frame from read.table() # mapper.year.market.enroute_1me = func1on(key, val.df) { # Remove header lines, cancella1ons, and diversions: val.df = subset(val.df, Year!= 'Year' & Cancelled == 0 & Diverted == 0) # We don't care about direc1on of travel, so construct a new 'market' vector # with airports ordered alphabe1cally (e.g, LAX to JFK becomes 'JFK- LAX') market = with(val.df, ifelse(origin < Dest, paste(origin, Dest, sep='- '), paste(dest, Origin, sep='- ')) )

30 RHadoop: airline example (mapper 2/2) # key consists of year, market output.key = data.frame(year=as.numeric(val.df\$year), market=market, stringsasfactors=f) # emit data.frame of gate- to- gate elapsed 1mes (CRS and actual) + 1me in air output.val = val.df[,c('crselapsedtime', 'ActualElapsedTime', 'AirTime')] colnames(output.val) = c('scheduled', 'actual', 'inflight') # and finally, make sure they're numeric while we're at it output.val = transform(output.val, scheduled = as.numeric(scheduled), actual = as.numeric(actual), inflight = as.numeric(inflight)) return( keyval(output.key, output.val) ) }

31 RHadoop: airline example (reducer) # # the reducer gets all the values for a given key # the values (which may be mul1- valued as here) come in the form of a data.frame # reducer.year.market.enroute_1me = func1on(key, val.df) { output.key = key output.val = data.frame(flights = nrow(val.df), scheduled = mean(val.df\$scheduled, na.rm=t), actual = mean(val.df\$actual, na.rm=t), inflight = mean(val.df\$inflight, na.rm=t) ) return( keyval(output.key, output.val) ) }

32 RHadoop: final configura)on and execu)on mr.year.market.enroute_1me = func1on (input, output) { mapreduce(input = input, output = output, input.format = asa.csv.input.format, map = mapper.year.market.enroute_1me, reduce = reducer.year.market.enroute_1me, backend.parameters = list( hadoop = list(d = "mapred.reduce.tasks=2") ), } verbose=t) out = mr.year.market.enroute_1me(hdfs.data, hdfs.out)

33 RHadoop: gathering results results = from.dfs( out ) results.df = as.data.frame(results, stringsasfactors=f ) colnames(results.df) = c('year', 'market', 'flights', 'scheduled', 'actual', 'inflight') print(head(results.df)) # save(results.df, file="out/enroute.1me.market.rdata")

BIG DATA ANALYTICS MADE EASY WITH RHADOOP Adarsh V. Rotte 1, Gururaj Patwari 2, Suvarnalata Hiremath 3 1 Student, Department of CSE, BKEC, Karnataka, India 2 Asst. Prof., Department of CSE, BKEC, Karnataka,

### Big Data, beating the Skills Gap Using R with Hadoop

Big Data, beating the Skills Gap Using R with Hadoop Using R with Hadoop There are a number of R packages available that can interact with Hadoop, including: hive - Not to be confused with Apache Hive,

### Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

### Driving New Value from Big Data Investments

An Introduction to Using R with Hadoop Jeffrey Breen Principal, Think Big Academy jeffrey.breen@thinkbiganalytics.com http://www.thinkbigacademy.com/ Greater Boston user Group Cambridge, MA February 20,

### VOL. 5, NO. 2, August 2015 ISSN 2225-7217 ARPN Journal of Systems and Software 2009-2015 AJSS Journal. All rights reserved

Big Data Analysis of Airline Data Set using Hive Nillohit Bhattacharya, 2 Jongwook Woo Grad Student, 2 Prof., Department of Computer Information Systems, California State University Los Angeles nbhatta2

### Distributed Filesystems

Distributed Filesystems Amir H. Payberah Swedish Institute of Computer Science amir@sics.se April 8, 2014 Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 1 / 32 What is Filesystem? Controls

### INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS Bogdan Oancea "Nicolae Titulescu" University of Bucharest Raluca Mariana Dragoescu The Bucharest University of Economic Studies, BIG DATA The term big data

### Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu

Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan sriram@sdsc.edu Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.

### TP1: Getting Started with Hadoop

TP1: Getting Started with Hadoop Alexandru Costan MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development of web

### Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

MapReduce on Big Data Map / Reduce Hadoop Hello world - Word count Hadoop Ecosystem + rmr - functions providing Hadoop MapReduce functionality in R rhdfs - functions providing file management of the

### Accessing bigger datasets in R using SQLite and dplyr

Accessing bigger datasets in R using SQLite and dplyr Amherst College, Amherst, MA, USA March 24, 2015 nhorton@amherst.edu Thanks to Revolution Analytics for their financial support to the Five College

### MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE

### Hadoop 2.2.0 MultiNode Cluster Setup

Hadoop 2.2.0 MultiNode Cluster Setup Sunil Raiyani Jayam Modi June 7, 2014 Sunil Raiyani Jayam Modi Hadoop 2.2.0 MultiNode Cluster Setup June 7, 2014 1 / 14 Outline 4 Starting Daemons 1 Pre-Requisites

### Introduc)on to. Eric Nagler 11/15/11

Introduc)on to Eric Nagler 11/15/11 What is Oozie? Oozie is a workflow scheduler for Hadoop Originally, designed at Yahoo! for their complex search engine workflows Now it is an open- source Apache incubator

1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

### Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

### Introduc)on to Map- Reduce. Vincent Leroy

Introduc)on to Map- Reduce Vincent Leroy Sources Apache Hadoop Yahoo! Developer Network Hortonworks Cloudera Prac)cal Problem Solving with Hadoop and Pig Slides will be available at hgp://lig- membres.imag.fr/leroyv/

Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

### Map- reduce, Hadoop and The communica3on bo5leneck. Yoav Freund UCSD / Computer Science and Engineering

Map- reduce, Hadoop and The communica3on bo5leneck Yoav Freund UCSD / Computer Science and Engineering Plan of the talk Why is Hadoop so popular? HDFS Map Reduce Word Count example using Hadoop streaming

### MapReduce Job Processing

April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File

### MapReduce. Course NDBI040: Big Data Management and NoSQL Databases. Practice 01: Martin Svoboda

Course NDBI040: Big Data Management and NoSQL Databases Practice 01: MapReduce Martin Svoboda Faculty of Mathematics and Physics, Charles University in Prague MapReduce: Overview MapReduce Programming

### Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

### Extreme computing lab exercises Session one

Extreme computing lab exercises Session one Michail Basios (m.basios@sms.ed.ac.uk) Stratis Viglas (sviglas@inf.ed.ac.uk) 1 Getting started First you need to access the machine where you will be doing all

### How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

How to properly misuse Hadoop Marcel Huntemann NERSC tutorial session 2/12/13 History Created by Doug Cutting (also creator of Apache Lucene). 2002 Origin in Apache Nutch (open source web search engine).

### CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has

### Introduction to HDFS. Prasanth Kothuri, CERN

Prasanth Kothuri, CERN 2 What s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand. HDFS is the primary distributed storage for Hadoop applications. HDFS

### MapReduce. Tushar B. Kute, http://tusharkute.com

MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity

### CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

CS242 PROJECT Presented by Moloud Shahbazi Spring 2015 AGENDA Project Overview Data Collection Indexing Big Data Processing PROJECT- PART1 1.1 Data Collection: 5G < data size < 10G Deliverables: Document

### Chase Wu New Jersey Ins0tute of Technology

CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at

### Introduction to HDFS. Prasanth Kothuri, CERN

Prasanth Kothuri, CERN 2 What s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand. HDFS is the primary distributed storage for Hadoop applications. Hadoop

### Extreme computing lab exercises Session one

Extreme computing lab exercises Session one Miles Osborne (original: Sasa Petrovic) October 23, 2012 1 Getting started First you need to access the machine where you will be doing all the work. Do this

### Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

### Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

### Hadoop in Action. Justin Quan March 15, 2011

Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop

### RHadoop Installation Guide for Red Hat Enterprise Linux

RHadoop Installation Guide for Red Hat Enterprise Linux Version 2.0.2 Update 2 Revolution R, Revolution R Enterprise, and Revolution Analytics are trademarks of Revolution Analytics. All other trademarks

### HPC & Big Data. Adam S.Z Belloum Software and Network engineering group University of Amsterdam

HPC & Big Data Adam S.Z Belloum Software and Network engineering group University of Amsterdam 1 Introduc)on to MapReduce programing model 2 Content Introduc)on Master/Worker approach MapReduce Examples

### Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector

### A very short Intro to Hadoop

4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

### Apache Hadoop new way for the company to store and analyze big data

Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File

Cerberus Hadoop Hadoop@LaTech ATLAS Tier 3 David Palma DOSAR Louisiana Tech University January 23, 2013 Cerberus Hadoop Outline 1 Introduction Cerberus Hadoop 2 Features Issues Conclusions 3 Cerberus Hadoop

Introduc)on to Hadoop Slides compiled from: Introduc)on to MapReduce and Hadoop Shivnath Babu Experiences with Hadoop and MapReduce Jian Wen Word Count over a Given Set of Web Pages see bob throw see spot

### Hadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ.

Hadoop Distributed Filesystem Spring 2015, X. Zhang Fordham Univ. MapReduce Programming Model Split Shuffle Input: a set of [key,value] pairs intermediate [key,value] pairs [k1,v11,v12, ] [k2,v21,v22,

### Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

### Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

EDUREKA Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.0 Cluster edureka! 11/12/2013 A guide to Install and Configure

### Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Performance Comparison of Intel Enterprise Edition for Lustre software and HDFS for MapReduce Applications Rekha Singhal, Gabriele Pacciucci and Mukesh Gangadhar 2 Hadoop Introduc-on Open source MapReduce

### GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

### Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

### CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns

CS 378 Big Data Programming Lecture 5 Summariza9on Pa:erns Review Assignment 2 Ques9ons? If you d like to use guava (Google collec9ons classes) pom.xml available for assignment 2 Includes dependency for

### Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster Fang (Cherry) Liu, PhD fang.liu@oit.gatech.edu A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech Targets

### Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

### Hunk & Elas=c MapReduce: Big Data Analy=cs on AWS

Copyright 2014 Splunk Inc. Hunk & Elas=c MapReduce: Big Data Analy=cs on AWS Dritan Bi=ncka BD Solu=ons Architecture Disclaimer During the course of this presenta=on, we may make forward looking statements

### Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到

### Tutorial for Assignment 2.0

Tutorial for Assignment 2.0 Florian Klien & Christian Körner IMPORTANT The presented information has been tested on the following operating systems Mac OS X 10.6 Ubuntu Linux The installation on Windows

### Big Data Analytics Using R

October 23, 2014 Table of contents BIG DATA DEFINITION 1 BIG DATA DEFINITION Definition Characteristics Scaling Challange 2 Divide and Conquer Amdahl s and Gustafson s Law Life experience Where to parallelize?

University of Victoria Faculty of Engineering Coop Workterm Report Hadoop Distributed File System Propagation Adapter for Nimbus Department of Physics University of Victoria Victoria, BC Matthew Vliet

### Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

### Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

### Hadoop Distributed File System. Dhruba Borthakur June, 2007

Hadoop Distributed File System Dhruba Borthakur June, 2007 Goals of HDFS Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle

### High Performance Computing with Hadoop WV HPC Summer Institute 2014

High Performance Computing with Hadoop WV HPC Summer Institute 2014 E. James Harner Director of Data Science Department of Statistics West Virginia University June 18, 2014 Outline Introduction Hadoop

### Big Data Analytics. Lucas Rego Drumond

Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 21 Outline

### A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18

A bit about Hadoop Luca Pireddu CRS4Distributed Computing Group March 9, 2012 luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 Often seen problems Often seen problems Low parallelism I/O is

### Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Research Laboratory Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze 1. Java Web Crawler Description Java Code 2. MapReduce Overview Example of mapreduce

### and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted

MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

### Обработка больших данных: Map Reduce (Python) + Hadoop (Streaming) Максим Щербаков ВолгГТУ 8/10/2014

Обработка больших данных: Map Reduce (Python) + Hadoop (Streaming) Максим Щербаков ВолгГТУ 8/10/2014 1 Содержание Бигдайта: распределенные вычисления и тренды MapReduce: концепция и примеры реализации

Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big

Table of contents 1 DFShell... 3 2 cat...3 3 chgrp...3 4 chmod...3 5 chown...4 6 copyfromlocal... 4 7 copytolocal... 4 8 cp...4 9 du...4 10 dus... 5 11 expunge... 5 12 get... 5 13 getmerge... 5 14 ls...

### Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

### Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

### Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Understanding Big Data and Big Data Analytics Getting familiar with Hadoop Technology Hadoop release and upgrades

### Big Data Analysis with Revolution R Enterprise

Big Data Analysis with Revolution R Enterprise August 2010 Joseph B. Rickert Copyright 2010 Revolution Analytics, Inc. All Rights Reserved. 1 Background The R language is well established as the language

### Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

### Recommended Literature for this Lecture

COSC 6339 Big Data Analytics Introduction to MapReduce (III) and 1 st homework assignment Edgar Gabriel Spring 2015 Recommended Literature for this Lecture Andrew Pavlo, Erik Paulson, Alexander Rasin,

### Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008

Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed

### Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow

Table of contents 1 FS Shell...3 1.1 cat... 3 1.2 chgrp... 3 1.3 chmod... 3 1.4 chown... 4 1.5 copyfromlocal...4 1.6 copytolocal...4 1.7 cp... 4 1.8 du... 4 1.9 dus...5 1.10 expunge...5 1.11 get...5 1.12

### Understanding Hadoop Performance on Lustre

Understanding Hadoop Performance on Lustre Stephen Skory, PhD Seagate Technology Collaborators Kelsie Betsch, Daniel Kaslovsky, Daniel Lingenfelter, Dimitar Vlassarev, and Zhenzhen Yan LUG Conference 15

### Introduction to Cloud Computing

Introduction to Cloud Computing Qloud Demonstration 15 319, spring 2010 3 rd Lecture, Jan 19 th Suhail Rehman Time to check out the Qloud! Enough Talk! Time for some Action! Finally you can have your own

### Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

### 研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

102 年 度 國 科 會 雲 端 計 算 與 資 訊 安 全 技 術 研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊 Version 0.1 總 計 畫 名 稱 : 行 動 雲 端 環 境 動 態 群 組 服 務 研 究 與 創 新 應 用 子 計 畫 一 : 行 動 雲 端 群 組 服 務 架 構 與 動 態 群 組 管 理 (NSC 102-2218-E-259-003) 計

### Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay Dipojjwal Ray Sandeep Prasad 1 Introduction In installation manual we listed out the steps for hadoop-1.0.3 and hadoop-

### Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis

Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis Prabin R. Sahoo Tata Consultancy Services Yantra Park, Thane Maharashtra, India ABSTRACT Hadoop Distributed

### The MapReduce Framework

The MapReduce Framework Luke Tierney Department of Statistics & Actuarial Science University of Iowa November 8, 2007 Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 1 / 16 Background

### International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

### Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

, 22-24 October, 2014, San Francisco, USA Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis Teng Zhao, Kai Qian, Dan Lo, Minzhe Guo, Prabir Bhattacharya, Wei Chen, and Ying

### HDFS File System Shell Guide

Table of contents 1 Overview...3 1.1 cat... 3 1.2 chgrp... 3 1.3 chmod... 3 1.4 chown... 4 1.5 copyfromlocal...4 1.6 copytolocal...4 1.7 count... 4 1.8 cp... 4 1.9 du... 5 1.10 dus...5 1.11 expunge...5

### File System Shell Guide

Table of contents 1 Overview...3 1.1 cat... 3 1.2 chgrp... 3 1.3 chmod... 3 1.4 chown... 4 1.5 copyfromlocal...4 1.6 copytolocal...4 1.7 count... 4 1.8 cp... 5 1.9 du... 5 1.10 dus...5 1.11 expunge...6

### Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:

### Reduction of Data at Namenode in HDFS using harballing Technique

Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu vgkorat@gmail.com swamy.uncis@gmail.com Abstract HDFS stands for the Hadoop Distributed File System.

### HSearch Installation

To configure HSearch you need to install Hadoop, Hbase, Zookeeper, HSearch and Tomcat. 1. Add the machines ip address in the /etc/hosts to access all the servers using name as shown below. 2. Allow all

### CSE-E5430 Scalable Cloud Computing. Lecture 4

Lecture 4 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 5.10-2015 1/23 Hadoop - Linux of Big Data Hadoop = Open Source Distributed Operating System

### Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations