Hadoop/MapReduce Workshop. Dan Mazur, McGill HPC July 10, 2014

Size: px
Start display at page:

Download "Hadoop/MapReduce Workshop. Dan Mazur, McGill HPC daniel.mazur@mcgill.ca guillimin@calculquebec.ca July 10, 2014"

Transcription

1 Hadoop/MapReduce Workshop Dan Mazur, McGill HPC July 10,

2 Outline Hadoop introduction and motivation Python review HDFS - The Hadoop Filesystem MapReduce examples and exercises Wordcount Distributed grep Distributed sort Maximum Mean and standard deviation Top 10 Combiners 2

3 Exercise 0: Login and Setup Log-in to Guillimin $ ssh -X class##@guillimin.hpc.mcgill.ca Use the account number and password from the slip you received Log-in to the Hadoop cluster $ ssh -X lm-2r01-n01 Copy workshop files $ cp -R /software/workshop/hadoop. Load Hadoop Module $ module show hadoop $ module load hadoop 3

4 Exercise 1: First MapReduce job To make sure your environment is set up correctly Launch an example MapReduce job $ hadoop jar $HADOOP_EXAMPLES pi $HADOOP_EXAMPLES=/software/CentOS-6/tools/hadoop/hadoop /share/hadoop/mapreduce/hadoop-mapreduce-examples jar Final output Job Finished in seconds Estimated value of Pi is

5 Hadoop What is Hadoop? 5

6 Hadoop What is Hadoop? A collection of related software ( software framework, software stack, software ecosystem ) Data-intensive computing ( Big Data ) Scales with data size Fault-tolerant Parallel Analysis of unstructured, semi-structured data Cluster Commodity hardware Open Source 6

7 MapReduce What is MapReduce? Parallel, distributed programming model Large-scale data processing Map() Filter, sort, embarrassingly parallel e.g. sort participants by first name Reduce() Summary e.g. count participants whose first name starts with each letter 7

8 Hadoop Ecosystem Apache Hadoop core HDFS - Hadoop Distributed File System Yarn - Resource management, job scheduling Pig - high-level scripting for MapReduce programs Hive - Data warehouse with SQL-like queries HCatalog - Abstracts data storage filenames and formats HBase - Database Zookeeper - Maintains and sychronizes configuration Oozie - Workflow scheduling Sqoop - Transfer data to/from relational databases 8

9 Hadoop Motivation For big data, hard disk input/output (I/O) is a bottleneck We have seen huge technology improvements in both CPU speeds and storage capacity I/O performance has not improved as dramatically We will see that Hadoop solves this problem by parallelizing I/O operations across many disks The genius of Hadoop is in how easy this parallelization is from the developer's perspective 9

10 Hadoop Motivation Big Data Challenges - The V -words Volume Variety Velocity Amount of data (terabytes, petabytes) Want to distribute storage across many nodes and analyze it in place Data comes in many formats, not always in a relational database Want to store data in original format Rate at which size of data grows, or speed with which it must be processed Want to expand storage and analysis capacity as data grows Data is big data if its volume, variety, or velocity are too great to manage easily with traditional tools 10

11 How Does Hadoop Solve These Problems? Distributed file system (HDFS) Scalable with redundancy Parallel computations on data nodes Batch (scheduled) processing 11

12 Hadoop vs. Competition Hadoop works well for loosely coupled (embarrassingly) parallel problems Process coupling MPI Database Hadoop Parallelism 12

13 Hadoop vs. Competition Map or reduce tasks are automatically re-run if they fail Data is stored redundantly and reproduced automatically if a drive fails Fault Tolerance Database MPI Hadoop Parallelism 13

14 Hadoop vs. Competition Hadoop makes certain problems very easy to solve Hardest parts of parallel programming are abstracted away Today we will write several practical codes that could scale to 1000s of nodes with just a few lines of code Developer Productivity Database MPI Hadoop Parallelism 14

15 Hadoop vs. and Competition Hadoop, MPI, and databases are all improving their weaknesses All are becoming fault-tolerant platforms for tightlycoupled, massively parallel problems Hadoop integrates easily into a workflow that includes MPI and/or databases Sqoop, HBase, etc. for working with databases Hadoop for post MPI data analysis Hadoop for pre MPI data processing 15

16 Python Hadoop is implemented in Java Developers can work with any programming language For a workshop, it is important to have a common language 16

17 Python - for loops Can loop over individual instances of 'iterable' objects (lists, tuples, dictionaries) Looped sections use an indented block Be consistent: use a tab or 4 spaces, not both Do not forget the colon mylist = ['one', 'two', 'three'] for item in mylist: print(item) 17

18 Python standard input/output #!/usr/bin/python import sys import csv Load modules for system and comma separated value file functions reader = csv.reader(sys.stdin, delimiter=',') for line in reader: data0 = line[0] data1 = line[1] Loop over lines in reader 18

19 Dictionaries Unordered set of key/value pairs keys are unique, can be used as an index >>> dict = {'key':'value', 'apple':'the round fruit of a tree'} >>> print dict['key'] value >>> print dict['apple'] the round fruit of a tree dict 'key' 'apple' 'value' 'the round fruit of a tree' 19

20 Hadoop Distributed File System (HDFS) Key Concepts: Data is read and written in minimum units ( blocks ) A master node ( namenode ) manages the filesystem tree and the metadata for each file Data is stored on a group of worker nodes ( datanodes ) 20

21 HDFS Blocks datanode datanode 1: 64 MB 2: 64 MB 3: 22 MB 150 MB myfile.txt datanode datanode datanode datanode 21

22 HDFS Blocks datanode datanode 1: 64 MB 2: 64 MB 1: 64 MB 2: 64 MB 3: 22 MB 150 MB myfile.txt datanode 3: 22 MB datanode Data is distributed block-by-block to multiple nodes datanode datanode 22

23 HDFS Blocks datanode 1: 64 MB datanode 2: 64 MB 1: 64 MB 2: 64 MB 3: 22 MB 150 MB myfile.txt 3: 22 MB 3: 22 MB datanode datanode 3: 22 MB 1: 64 MB 2: 64 MB 2: 64 MB Data redundancy default = 3x If we lose a node, data is available on 2 other nodes and the namenode arranges to create a 3rd copy on another node datanode 1: 64 MB datanode 23

24 Exercise 2: Using HDFS Put a file into HDFS $ hdfs dfs -put titanic.txt List files in HDFS $ hdfs dfs -ls Output the file contents $ hdfs dfs -cat titanic.txt $ hdfs dfs -tail titanic.txt Get help $ hdfs dfs -help Put the workshop data sets into HDFS $ hdfs dfs -put usask_access_logs $ hdfs dfs -put household_power_consumption.txt 24

25 MapReduce Roman census approach: Want to count (and tax) all people in the Roman empire Better to go to where the people are (decentralized) than try to bring them to you (centralized) Bring back information from each village (map phase) Summarize the global picture (reduce phase) 25

26 Roman Census: Mapping Village mapper mapper Village 287 men 293 women 104 children 854 sheep... Village Capital Village mapper mapper Village Village Note: These mappers are also combiners in Hadoop language. We will discuss what this means. 26

27 Roman Census: Reducing 854 sheep 34 sheep reducer 1032 sheep 206 sheep reducer sheep sheep sheep sheep children sheep sheep sheep sheep sheep women sheep sheep sheep sheep sheep men sheep 91 sheep 545 sheep reducer 2762 sheep 27

28 MapReduce Data key, value pairs key, value pairs key, value pairs key, value pairs key, value pairs key, value pairs key, value pairs key, value pairs key, value pairs mappers sort and shuffle key, all values key, all values key, all values key, all values reducers results 28

29 Mapper Takes specified data as input Works with a fraction of the data Works in parallel Outputs intermediate records key, value pairs Recall hash tables or python dictionaries 29

30 Reducer Takes a key or set of keys with all associated values as input Works with all data for that key Outputs the final results 30

31 MapReduce Word Counting Want to count the frequency of words in a document What are the key, value pairs for our mapper output? A) key=1, value='word' B) key='word', value=1 C) key=[number in hdfs block], value='word' D) key='word', value=[number in hdfs block] E) Something else 31

32 MapReduce Word Counting Want to count the frequency of words in a document What are the key, value pairs for our mapper output? A) key=1, value='word' B) key='word', value=1 C) key=[number in hdfs block], value='word' D) key='word', value=[number in hdfs block] E) Something else Explanation: We want to sort according to the words, so that is the key. We can generate a pair for each word, we don't need the mapper to keep track of frequencies 32

33 MapReduce Word Counting If our mapper input is hello world, hello!, what will our reducer input look like? A) hello world hello C) hello 1 hello 1 world 1 B) hello 1 world 1 hello 1 D) hello 2 world 1 33

34 MapReduce Word Counting If our mapper input is hello world, hello!, what will our reducer input look like? A) hello world hello B) hello 1 world 1 hello 1 C) hello 1 hello 1 world 1 D) hello 2 world 1 Explanation: The reducer receives SORTED key, value pairs. The sorting is done automatically by Hadoop. D is also possible, we will learn about combiners later. 34

35 MapReduce Word Counting Want to count the frequency of words in a document What are the key, value pairs for our reducer output? A) key=1, value='word' B) key='word', value=1 C) key=[count in document], value='word' D) key='word', value=[count in document] E) Something else 35

36 MapReduce Word Counting Want to count the frequency of words in a document What are the key, value pairs for our reducer output? A) key=1, value='word' B) key='word', value=1 C) key=[count in document], value='word' D) key='word', value=[count in document] E) Something else 36

37 Hadoop Streaming Streaming lets developers use any programming language for mapping and reducing Use standard input and standard output The first tab character delimits between key and value Similar to Bash pipes $ cat file.txt./mapper sort./reducer HADOOP_STREAM=/software/CentOS-6/tools/hadoop/hadoop /share/hadoop/tools/lib/hadoop-streaming jar hadoop jar $HADOOP_STREAM -input <input dir> -output <output dir> -mapper <mapper script> -file <mapper script> -reducer <reducer script> -file <reducer script> 37

38 mapper.py #!/usr/bin/env python import sys for line in sys.stdin: # split the line into words words = line.split() for word in words: print word, '\t', 1 Scripts require a hash bang line Import the sys module for stdin Loop over standard input Loop over words Print tab-separated key, value pairs 38

39 reducer.py #!/usr/bin/env python import sys prevword = None wordcount = 0 word = None for line in sys.stdin: word, count = line.split('\t', 1) count = int(count) if word == prevword: wordcount += count else: if prevword: print prevword, '\t', wordcount wordcount = count prevword = word if prevword: print prevword, '\t', wordcount 39

40 Testing map and reduce scripts It is useful to test your scripts with a small amount of data in serial to check for syntax errors $ head -100 mydata.txt./mapper.py sort./reducer.py 40

41 Exercise 3: Word count Place the directory montgomery into HDFS $ hdfs dfs -put montgomery Submit a MapReduce job with your *tested* scripts to count the word frequencies in Lucy Maud Montgomery's books $ hadoop jar $HADOOP_STREAM \ -mapper mapper_wordcount.py \ -file mapper_wordcount.py \ -reducer reducer_wordcount.py \ -file reducer_wordcount.py \ -input montgomery \ -output wordcount 41

42 Exercise 3: Word count View the output directory $ hdfs dfs -ls wordcount View your results $hdfs dfs -cat wordcount/part View your (sorted) results $ hdfs dfs -cat wordcount/part sort -k 2 -n 42

43 Storage The Guillimin Hadoop cluster has 10 nodes with 300GB of storage per node with the default HDFS setup (replication factor 3, 64MB blocks). Alice wants to upload two 400GB files and run WordCount on them both. What will happen? A) The data upload fails at the first file B) The data upload fails at the second file C) MapReduce job fails D) None of the above. MapReduce is successful. 43

44 Storage The Guillimin Hadoop cluster has 10 nodes with 300GB of storage per node with the default HDFS setup (replication factor 3, 64MB blocks). Alice wants to upload two 400GB files and run WordCount on them both. What will happen? A) The data upload fails at the first file B) The data upload fails at the second file C) MapReduce job fails D) None of the above. MapReduce is successful. 44

45 Storage The Guillimin Hadoop cluster has 10 nodes with 300GB of storage per node with the default HDFS setup (replication factor 3, 64MB blocks). Alice wants to upload three 400GB files and run WordCount on them all. What will happen? A) The data upload fails at the first file B) The data upload fails at the second file C) The data upload fails at the third file D) MapReduce job fails E) None of the above. MapReduce is successful. 45

46 Storage The Guillimin Hadoop cluster has 10 nodes with 300GB of storage per node with the default HDFS setup (replication factor 3, 64MB blocks). Alice wants to upload three 400GB files and run WordCount on them all. What will happen? A) The data upload fails at the first file B) The data upload fails at the second file C) The data upload fails at the third file D) MapReduce job fails E) None of the above. MapReduce is successful. Explanation: Our small cluster can only store 1.0TB of data with 3X replication! Alice wants to upload 1.2TB. 46

47 Simplifying MapReduce Commands The native streaming commands are cumbersome TIP: Create simplifying aliases and functions e.g. mapreduce_stream(){ hadoop jar $HADOOP_STREAM -mapper $1 \ -reducer $2 -file $1 -file $2 \ -input $3 -output $4 } alias mrs mapreduce_stream Place these commands into ~/.bashrc so they are executed in each new bash session (each login) To avoid confusion, we will only use the native commands today 47

48 Exercise 3: Hadoop Web UI Hadoop includes a web-based user interface Launch a firefox window $ firefox & Navigate to the Hadoop Job Monitor Navigate to the namenode and filesystem Navigate tot he job history 48

49 49

50 Example: Distributed Grep Note: We don't have to write any scripts! Note: There is no reducer phase Hadoop Command: $ hadoop jar $HADOOP_STREAM \ -D mapreduce.job.reduces=0 \ -D mapred.reduce.tasks=0 \ -input titanic.txt \ -output grepout \ -mapper /bin/grep Williams View Results: $ hdfs dfs -cat grepout/part-0000* 50

51 Household Power Consumption Dataset: household_power_consumption.txt From the UCI Machine Learning Repository 9 Columns, semicolon separated (see household_power_consumption.explain) 1. date: dd/mm/yyyy 2. time: hh:mm:ss 3. minute-averaged active power (kilowatts) 4. minute-averaged reactive power (kilowatts) 5. minute-averaged voltage (volts) 6. minute-averaged current (amps) 7. kitchen active energy (watt-hours) 8. laundry active energy (watt-hours) 9. water-heater and A/C active energy (watt-hours) $ hdfs dfs -put household_power_consumption.txt 51

52 Exercise 4: Distributed Sort Write a mapper that, for each line of data, outputs Key: minute-averaged active power (3 rd column, line[2]) Value: The entire line of input data Write a hadoop command that will sort the input data by the key value Use /bin/cat as the reducer. Not necessary to attach a script with -file. $ hadoop jar $HADOOP_STREAM \ -D mapred.output.key.comparator.class=\ org.apache.hadoop.mapred.lib.keyfieldbasedcomparator \ -D mapred.text.key.comparator.options=-n \ -mapper mapper_dist-sort.py \ -file mapper_dist-sort.py \ -reducer /bin/cat \ -input household_power_consumption.txt \ -output sortedpower 52

53 On which date was the maximum minute-averaged active power? What should the output of our mapper be? A) power 1 B) date 1 C) date power D) something else 53

54 On which date was the maximum minute-averaged active power? What should the output of our mapper be? A) power 1 B) date 1 C) date power D) something else 54

55 Exercise 5: Compute the maximum Write a mapper and a reducer to compute the maximum value of the minute-averaged active power (3rd column), as well as the date on which this power occurred 55

56 Combiners To compute the max, you may have... Output a list of all values with a single common key Had a single reducer compute the maximum value in serial We would like to do some pre-reduction on the mapper nodes to balance the workload from the reducer to the mappers To find the maximum, we only need to send the maximum from each mapper through the network, not every value 56

57 Combiners (maximum)... 20/12/2006;02:46:00;1.516;0.262; ;6.200;0.000;1.000; /12/2006;02:47:00;1.498;0.258; ;6.200;0.000;2.000; /12/2006;02:48:00;1.518;0.264; ;6.200;0.000;1.000; Map 16/12/ /12/ /12/ /12/ /12/ /12/ /12/ /12/ /12/ /12/ Combine 16/12/ /12/ Shuffle and sort, reduce 57

58 Combiners To compute the maximum the reducer and the combiner can be the same script max() is associative and commutative max([a,b]) = max([b,a]) max([a, max([b,c])]) = max([max([a,b]),c]) 58

59 Combiners To test your combiner scripts $ cat data.txt./mapper.py sort./combiner.py./reducer.py Hadoop sorts locally before the combiner and globally between the combiner and reducer Note that Hadoop does not guarantee how many times the combiner will be run 59

60 Histograms Histograms 'bin' continuous data into discrete ranges Mwtoews,

61 Exercise 6: Histogram Write a mapper that uses round(power) to 'bin' the minute-averaged active power readings (3rd column) Output for each reading: [power bin], 1 Write a reducer that creates combined counts Input: [power bin], count Output: [power bin], count This script must also function as combiner Submit your tested scripts as a Hadoop job Use the reducer script as a combiner $ hadoop jar $HADOOP_STREAM... -combiner reducer_hist.py... You may generate a plot (plotting is outside the scope of the workshop) $ hdfs dfs -cat histogram/part solutions/plot_hist.py 61

62 Histogram 62

63 Exercise 7: Mean Write a mapper and reducer to compute the mean value of minute-averaged active power (3rd column) 63

64 Mean and Standard Deviation We can't easily use combiners to compute the mean max(max(a,b), max(c,d,e)) = max(a,b,c,d,e) mean(mean(a,b), mean(c,d,e))!= mean(a,b,c,d,e) Reducer function can be used as a combiner if associative: (A*B)*C = A*(B*C) commutative: A*B = B*A e.g.: counting, addition, multiplication,... Computing the mean and standard deviation means the reducer is stuck with a lot of math Combiner idea for mean key = intermediate sum value = number of entries 64

65 Power consumption by day of week Are there days of the week when more power is consumed, on average? Want to know the mean and standard deviation for each week day Simplification: Compute average of minuteaveraged powers, grouped by day of week 65

66 Python datetime The datetime module is a powerful library for working with dates and times We can easily find the day of the week from the date import datetime from datetime weekday = datetime.strptime(date, "%Y-%m-%d").weekday() 66

67 Exercise 8: Mean and Standard Deviation Write mapper and reducer code to compute the mean and standard deviation for active power (3rd column) for each of the seven days of the week Test your scripts using serial Bash commands Submit your job to Hadoop Tip: Wikipedia - Algorithms for calculating variance Python code to compute mean and variance in a single pass 67

68 Speedup - Mean and St.Dev. ~2 million entries Serial Bash version $ cat household_power_consumption.txt./mapper.py sort./reducer.py 75 seconds Hadoop version 2 mappers, 1 reducer: 50 seconds Speedup: 1.5X 2 mappers, 2 reducers: 48 seconds Speedup: 1.6X -D mapred.reduce.tasks=2 4 mappers, 4 reducers: 29 seconds Speedup: 2.6X 68

69 Top 10 List Mapper output: key 1 Combiner returns top 10 for each mapper output: key count Reducer finds the global top 10 output: key count 69

70 Exercise 9: Top 10 websites Produce a top 10 list of websites accessed on the University of Saskatchewan website usask_access_logs Be careful Some lines wont conform to your expectations How to handle? skip? exception? 70

71 What questions do you have? 71

72 In the time remaining... Import your own data into HDFS for analysis Your quota is 300GB (after replication) by default Examine some data from Twitter /software/workshop/twitter_data 3.8 million tweets + metadata ~ 11 GB Continue to work with the workshop data sets titanic.txt household_power_consumption.txt usask_access_logs Contact us to add your user account to the Hadoop test environment (class accounts deactivate later today) guillimin@calculquebec.ca 72

73 Keep Learning... Contact us for access to our Hadoop test system Download a Hadoop virtual machine View online training materials html 73

Hadoop/MapReduce Workshop

Hadoop/MapReduce Workshop Hadoop/MapReduce Workshop Dan Mazur daniel.mazur@mcgill.ca Simon Nderitu simon.nderitu@mcgill.ca guillimin@calculquebec.ca August 14, 2015 1 Outline Hadoop introduction and motivation Python review HDFS

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013 Hadoop 101 Lars George NoSQL- Ma4ers, Cologne April 26, 2013 1 What s Ahead? Overview of Apache Hadoop (and related tools) What it is Why it s relevant How it works No prior experience needed Feel free

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Extreme computing lab exercises Session one

Extreme computing lab exercises Session one Extreme computing lab exercises Session one Michail Basios (m.basios@sms.ed.ac.uk) Stratis Viglas (sviglas@inf.ed.ac.uk) 1 Getting started First you need to access the machine where you will be doing all

More information

A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18

A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 A bit about Hadoop Luca Pireddu CRS4Distributed Computing Group March 9, 2012 luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 Often seen problems Often seen problems Low parallelism I/O is

More information

How To Use Hadoop

How To Use Hadoop Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop

More information

Hadoop. Bioinformatics Big Data

Hadoop. Bioinformatics Big Data Hadoop Bioinformatics Big Data Paolo D Onorio De Meo Mattia D Antonio p.donoriodemeo@cineca.it m.dantonio@cineca.it Big Data Too much information! Big Data Explosive data growth proliferation of data capture

More information

Hadoop Job Oriented Training Agenda

Hadoop Job Oriented Training Agenda 1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module

More information

Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"

Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster" Mahidhar Tatineni SDSC Summer Institute August 06, 2013 Overview "" Hadoop framework extensively used for scalable distributed processing

More information

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

More information

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from Hadoop Beginner's Guide Learn how to crunch big data to extract meaning from data avalanche Garry Turkington [ PUBLISHING t] open source I I community experience distilled ftu\ ij$ BIRMINGHAMMUMBAI ')

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

ITG Software Engineering

ITG Software Engineering Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti

More information

Hadoop Streaming. Table of contents

Hadoop Streaming. Table of contents Table of contents 1 Hadoop Streaming...3 2 How Streaming Works... 3 3 Streaming Command Options...4 3.1 Specifying a Java Class as the Mapper/Reducer... 5 3.2 Packaging Files With Job Submissions... 5

More information

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea Overview Riding Google App Engine Taming Hadoop Summary Riding

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Tutorial for Assignment 2.0

Tutorial for Assignment 2.0 Tutorial for Assignment 2.0 Florian Klien & Christian Körner IMPORTANT The presented information has been tested on the following operating systems Mac OS X 10.6 Ubuntu Linux The installation on Windows

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

More information

Extreme computing lab exercises Session one

Extreme computing lab exercises Session one Extreme computing lab exercises Session one Miles Osborne (original: Sasa Petrovic) October 23, 2012 1 Getting started First you need to access the machine where you will be doing all the work. Do this

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

Chase Wu New Jersey Ins0tute of Technology

Chase Wu New Jersey Ins0tute of Technology CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at

More information

The MapReduce Framework

The MapReduce Framework The MapReduce Framework Luke Tierney Department of Statistics & Actuarial Science University of Iowa November 8, 2007 Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 1 / 16 Background

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

MapReduce with Apache Hadoop Analysing Big Data

MapReduce with Apache Hadoop Analysing Big Data MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues

More information

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

!#$%&' ( )%#*'+,'-#.//0( !#$%&'()*$+()',!-+.'/', 4(5,67,!-+!89,:*$;'0+$.<.,&0$'09,&)/=+,!()<>'0, 3, Processing LARGE data sets !"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Apache Hadoop new way for the company to store and analyze big data

Apache Hadoop new way for the company to store and analyze big data Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File

More information

MapReduce. Tushar B. Kute, http://tusharkute.com

MapReduce. Tushar B. Kute, http://tusharkute.com MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

map/reduce connected components

map/reduce connected components 1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

PassTest. Bessere Qualität, bessere Dienstleistungen!

PassTest. Bessere Qualität, bessere Dienstleistungen! PassTest Bessere Qualität, bessere Dienstleistungen! Q&A Exam : CCD-410 Title : Cloudera Certified Developer for Apache Hadoop (CCDH) Version : DEMO 1 / 4 1.When is the earliest point at which the reduce

More information

Cloud Computing. Chapter 8. 8.1 Hadoop

Cloud Computing. Chapter 8. 8.1 Hadoop Chapter 8 Cloud Computing In cloud computing, the idea is that a large corporation that has many computers could sell time on them, for example to make profitable use of excess capacity. The typical customer

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Tutorial for Assignment 2.0

Tutorial for Assignment 2.0 Tutorial for Assignment 2.0 Web Science and Web Technology Summer 2012 Slides based on last years tutorials by Chris Körner, Philipp Singer 1 Review and Motivation Agenda Assignment Information Introduction

More information

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Qloud Demonstration 15 319, spring 2010 3 rd Lecture, Jan 19 th Suhail Rehman Time to check out the Qloud! Enough Talk! Time for some Action! Finally you can have your own

More information

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

HDFS. Hadoop Distributed File System

HDFS. Hadoop Distributed File System HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files

More information

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 Lecture 2 (08/31, 09/02, 09/09): Hadoop Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 K. Zhang BUDT 758 What we ll cover Overview Architecture o Hadoop

More information

Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster Fang (Cherry) Liu, PhD fang.liu@oit.gatech.edu A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech Targets

More information

Hadoop@LaTech ATLAS Tier 3

Hadoop@LaTech ATLAS Tier 3 Cerberus Hadoop Hadoop@LaTech ATLAS Tier 3 David Palma DOSAR Louisiana Tech University January 23, 2013 Cerberus Hadoop Outline 1 Introduction Cerberus Hadoop 2 Features Issues Conclusions 3 Cerberus Hadoop

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data

More information

Extreme Computing. Hadoop. Stratis Viglas. School of Informatics University of Edinburgh sviglas@inf.ed.ac.uk. Stratis Viglas Extreme Computing 1

Extreme Computing. Hadoop. Stratis Viglas. School of Informatics University of Edinburgh sviglas@inf.ed.ac.uk. Stratis Viglas Extreme Computing 1 Extreme Computing Hadoop Stratis Viglas School of Informatics University of Edinburgh sviglas@inf.ed.ac.uk Stratis Viglas Extreme Computing 1 Hadoop Overview Examples Environment Stratis Viglas Extreme

More information

Big Data : Experiments with Apache Hadoop and JBoss Community projects

Big Data : Experiments with Apache Hadoop and JBoss Community projects Big Data : Experiments with Apache Hadoop and JBoss Community projects About the speaker Anil Saldhana is Lead Security Architect at JBoss. Founder of PicketBox and PicketLink. Interested in using Big

More information

Introduction to HDFS. Prasanth Kothuri, CERN

Introduction to HDFS. Prasanth Kothuri, CERN Prasanth Kothuri, CERN 2 What s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand. HDFS is the primary distributed storage for Hadoop applications. Hadoop

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

Reduction of Data at Namenode in HDFS using harballing Technique

Reduction of Data at Namenode in HDFS using harballing Technique Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu vgkorat@gmail.com swamy.uncis@gmail.com Abstract HDFS stands for the Hadoop Distributed File System.

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

Maximizing Hadoop Performance with Hardware Compression

Maximizing Hadoop Performance with Hardware Compression Maximizing Hadoop Performance with Hardware Compression Robert Reiner Director of Marketing Compression and Security Exar Corporation November 2012 1 What is Big? sets whose size is beyond the ability

More information

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has

More information

Google Bing Daytona Microsoft Research

Google Bing Daytona Microsoft Research Google Bing Daytona Microsoft Research Raise your hand Great, you can help answer questions ;-) Sit with these people during lunch... An increased number and variety of data sources that generate large

More information

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Intro to Map/Reduce a.k.a. Hadoop

Intro to Map/Reduce a.k.a. Hadoop Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by

More information

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box By Kavya Mugadur W1014808 1 Table of contents 1.What is CDH? 2. Hadoop Basics 3. Ways to install CDH 4. Installation and

More information

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science A Seminar report On Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science SUBMITTED TO: www.studymafia.org SUBMITTED BY: www.studymafia.org

More information

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture. Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in

More information

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS Bogdan Oancea "Nicolae Titulescu" University of Bucharest Raluca Mariana Dragoescu The Bucharest University of Economic Studies, BIG DATA The term big data

More information

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Sector vs. Hadoop. A Brief Comparison Between the Two Systems Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector

More information

Hadoop Big Data for Processing Data and Performing Workload

Hadoop Big Data for Processing Data and Performing Workload Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer

More information

TP1: Getting Started with Hadoop

TP1: Getting Started with Hadoop TP1: Getting Started with Hadoop Alexandru Costan MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development of web

More information

Hadoop and Map-reduce computing

Hadoop and Map-reduce computing Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.

More information