Hadoop MapReduce: Review. Spring 2015, X. Zhang Fordham Univ.

Size: px
Start display at page:

Download "Hadoop MapReduce: Review. Spring 2015, X. Zhang Fordham Univ."

Transcription

1 Hadoop MapReduce: Review Spring 2015, X. Zhang Fordham Univ.

2 Outline 1.Review of how map reduce works: the HDFS, Yarn sorting and shuffling advanced topics: partial sort, total sort, join, chained mapper/reducer, multiple input 2.Review of MapReduce programming paradigm 1.job: default settings, 2.checklist of things to set: 3.MapReduce streaming API 4.command line options, configurations

3 Challenges General Problems in Big Data Era: How to process very big volume of data in a reasonable amount of time? It turns out: disk bandwidth has become bottleneck, i.e., hard disk cannot read data fast enough Solutions: parallel processing Google s problems: to crawl, analyze and rank web pages into giant inverted index (to support search engine) Google engineers went ahead to build their own systems: Google File System, exabyte-scale data management using commodity hardware Google MapReduce (GMR), implementation of design pattern applied to massively parallel processing

4 Hadoop History Originally Yahoo Nutch Project: crawl and index a large number of web pages Idea: a program is distributed, and process part of data stored with them Two Google papers => Hadoop project (an open source implementation of Distributed File system and MapReduce framework) Hadoop: schedule and resource management framework for execute map and reduce jobs in a cluster environment Now an open source project, Apache Hadoop Hadoop ecosystem: various tools to make it easier to use Hive, Pig: tools that can translate more abstract description of workload to map-reduce pipelines.

5 MapReduce End-user MapReduce API for programming MapReduce application. MapReduce framework, the runtime implementation of various phases such as map phase, sort/shuffle/merge aggregation and reduce phase. MapReduce system, which is the backend infrastructure required to run the user s MapReduce application, manage cluster resources, schedule thousands of concurrent jobs etc. 5

6 Hadoop: HDFS, MapReduce 6

7 Hadoop Daemons Hadoop (HDFS and MapReduce) is a distributed system Distributed file system Support running MapReduce program in distributed and parallel fashion Automatic input splitting, shuffling Provide fault tolerances, load balances, To suppose these, several Hadoop Deamons (processes running in background) HDFS: Namenode, datanode; MapReduce: jobtracker, resource manager, node manager These daemons communicates with each other via RPC (Remote Procedure Call) over SSH protocol. Usually allow user to view their status via Web interface Both above inter-process communication are via socket (network API) Will learn more about this later. 7

8 HDFS: NameNode & DataNode namenode: stores filesystem metadata i.e. which file maps to what block locations and which blocks are stored on which datanode. Secondary namenode regularly connects to primary namenode and keeps snapshotting filesystem metadata into local/remote storage. data node: where actual data resides Datanode stores a file block and checksum for it. update namenode with block information periodically, and before updating verify checksums. If checksum is incorrect for a particular block i.e. there is a disk level corruption for that block, it skips that block while reporting block information to namenode. => namenode replicates the block somewhere else. Send heartbeat message to namenode to say that they are alive => name node detects datanode failure, and initiates replication of blocks Datanodes can talk to each other to rebalance data, move and copy data around and keep replication high. 8

9 Hadoop Daemons Daemon Default Port Configuration Parameter HDFS namenode dfs.http.addre ss datanode dfs.dataname. http.address secondaryname node dfs.secondary. http.address You could open a browser to /to view various information about name node Plan: install a text-based Web browser on puppet, so that we can use web based user-interface. 9

10 Hadoop 1.x There are two types of nodes that control the job execution process: a jobtracker and a number of tasktrackers. jobtracker: coordinates all jobs run on the system by scheduling tasks to run on tasktrackers. Tasktrackers: run tasks and send progress reports to the jobtracker, which keeps a record of the overall progress of each job. If a task fails, the jobtracker can reschedule it on a different tasktracker. 10

11 YARN: Yet Another Resource Negotiator Resource management => a global ResourceManager Per-node resource monitor => NodeManager Job scheduling/monitoring => per-application ApplicationMaster (AM). 11

12 YARN: Master-slave System: ResourceManager and per-node slave, NodeManager (NM), form the new, and generic, system for managing applications in a distributed manner. ResourceManager: ultimate authority that arbitrates resources among all applications in the system. Pluggable Scheduler, allocate resources to various running applications based on the resource requirements of the applications based on abstract notion of a Resource Container which incorporates resource elements such as memory, cpu, disk, network etc. Per-application ApplicationMaster: negotiate resources from ResourceManager and working with NodeManager(s) to execute and monitor component tasks. 12

13 WebUI: for Yarn Daemons YARN DAEMON PORT Configuration name ResourceManag er 8088 yarn.resourcemanager.webapp.address NodeManager yarn.nodemanager.webapp.address URL to view status of ResouceManager: address of RM>:

14 Outline 1.Review of how map reduce works: the HDFS, Yarn 2.Review of MapReduce programming paradigm 1.job: default settings, 2.checklist of things to set: sorting and shuffling advanced topics: partial sort, total sort, join, chained mapper/reducer, multiple input 3.MapReduce streaming API 4.command line options, configurations

15 MapReduce End-user MapReduce API for programming MapReduce application. MapReduce framework, the runtime implementation of various phases such as map phase, sort/shuffle/merge aggregation and reduce phase. MapReduce system, which is the backend infrastructure required to run the user s MapReduce application, manage cluster resources, schedule thousands of concurrent jobs etc. 15

16 MapReduce Programming Model Split Shuffle Input: a set of [key,value] pairs intermediate [key,value] pairs [k1,v11,v12, ] [k2,v21,v22, ] Output: a set of [key,value] pairs 16

17 Parallel Execution: Scaling Out A MapReduce job is a unit of work that client/user wants to be performed input data MapReduce program Configuration information Hadoop system: * divides job into map and reduce tasks. * divides input into fixed-size pieces called input splits, or splits. * Hadoop creates one map task for each split, which runs the user-defined map function for each record in the split 17

18 MapReduce and HDFS Parallism of MapReduce + very high aggregate I/O bandwidth across a large cluster provided by HDFS => economics of the system are extremely compelling a key factor in the popularity of Hadoop. Keys: lack of data motion i.e. move compute to data, and do not move data to compute node via network. Specifically, MapReduce tasks can be scheduled on the same physical nodes on which data is resident in HDFS, which exposes the underlying storage layout across the cluster. Benefits: reduces network I/O and keeps most of the I/O on local disk or within same rack. 18

19 Hadoop 1.x There are two types of nodes that control the job execution process: a jobtracker and a number of tasktrackers. jobtracker: coordinates all jobs run on the system by scheduling tasks to run on tasktrackers. Tasktrackers: run tasks and send progress reports to the jobtracker, which keeps a record of the overall progress of each job. If a task fails, the jobtracker can reschedule it on a different tasktracker. 19

20 YARN: Yet Another Resource Negotiator Resource management => a global ResourceManager Per-node resource monitor => NodeManager Job scheduling/monitoring => per-application ApplicationMaster (AM). Hadoop Deamons are Java processes, running in background, talking to other via RPC over SSH protocol. 20

21 YARN: Master-slave System: ResourceManager and per-node slave, NodeManager (NM), form the new, and generic, system for managing applications in a distributed manner. ResourceManager: ultimate authority that arbitrates resources among all applications in the system. Pluggable Scheduler, allocate resources to various running applications based on the resource requirements of the applications based on abstract notion of a Resource Container which incorporates resource elements such as memory, cpu, disk, network etc. Per-application ApplicationMaster: negotiate resources from ResourceManager and working with NodeManager(s) to execute and monitor component tasks. 21

22 Outline 1.Review of how map reduce works: the HDFS, Yarn sorting and shuffling advanced topics: partial sort, total sort, join, chained mapper/reducer, multiple input 2.Review of MapReduce programming paradigm 1.job: default settings, 2.checklist of things to set: 3.MapReduce streaming API 4.command line options, configurations

23 MapReduce Programming InputFormat Split Partition: hash key to reducer task Shuffle ile (text, binary) atabase Input: a set of [key,value] pairs intermediate [key,value] pairs [k,(v1,v2, )] [k,(v1,v2, )] OutputFormat Output: a set of [key,value] pairs 23

24 Hadoop Streaming API A generic API for MapReduce framework Mappers/Reducers can be written in any language, or some Unix commands Mappers/Reducers act as filters : receive input and output on stdin and stdout For text processing: each <key,value> pair takes a line, Key/value separated by 'tab' character. Mapper/Reducer reads each line (<key,value> pair) from stdin, processes it, and writes to stdout a line (<key,value> pair). Splitting Shuffling 24

25 Usage of InputFormat An InputFormat is responsible for creating input splits and dividing them into records: public abstract class InputFormat<K, V> { public abstract List<InputSplit> getsplits(jobcontext context) throws IOException, InterruptedException; public abstract RecordReader<K, V> createrecordreader(inputsplit split, TaskAttemptContext context) throws IOException, InterruptedException; } Default input format is TextInputFormat, which produces one (key,value) pair for each line in the text file We will later look at a customized InputFormat class 25

26 InputSplit Input split is a chunk of the input that is processed by a single map. Each split is divided into records, and map processes each record a key-value pair in turn. public abstract class InputSplit { public abstract long getlength() throws IOException, InterruptedException; public abstract String[] getlocations() throws IOException,InterruptedException; } An InputSplit has a length in bytes and a set of storage locations, host- name strings. 26

27 Starting map tasks 1. Client running job calculates the splits for the job by calling getsplits(), 2. Client sends splits to jobtracker/resourcemanager, which uses their storage locations to schedule map tasks to process them 3. Each map task passes split to createrecordreader() method on InputFormat to obtain a RecordReader for that split. A RecordReader is little more than an iterator over records, Map task uses RecordReader to generate key-value pairs, and passes them to map function. public void run(context context) throws IOException, InterruptedException { setup(context); while (context.nextkeyvalue()) { } map(context.getcurrentkey(), context.getcurrentvalue(), context); cleanup(context); } default implementation of Mapper class s run function 27

28 Split and Record Sometimes a record span two blocks/input splits: map task #1: located on same node as first block/split of file needs to perform remote read to obtain record 5 (which spans two blocks) 28

29 InputFormat Hierarchy 29

30 FileInputFormat Parent class for all that read from files Input to a job: a collection of paths: void addinputpath( ) void addinputpaths( ) void setinputpaths( ) void setinputpaths( ) 30

31 Split Size FileInputFormat splits only large files. Here large means larger than an HDFS block. by default, minimumsize < blocksize < MaximumSize Formula for split size: max (minimumsize, min (maxmumsize, blocksize)) 31

32 Demo: How to? Change MaximumSplit size so that a file smaller than BlockSize is splitter? ~/hadoop_samplecodes/citibike/shellscript/ RunDefaultJob.sh -dmapred.max.split.size=20000 Try./RunDefaultJob.sh >& dd grep csv dd wc -l => 655 map tasks :38:40,425 INFO [main] input.fileinputformat (FileInputFormat.java:listStatus(245)) - Total input paths to process : :38:40,496 INFO [main] mapreduce.jobsubmitter (JobSubmitter.java:submitJobInternal(371)) - number of splits: :25:49,678 INFO [Thread-2] mapred.merger (Merger.java:merge(568)) - Merging 655 sorted segments 32

33 Demo: How to configure Do not work in local mode number of reducers? See LocalJobRunner.java code: sequentially run the map tasks, and then start one reduce tasks in command line: -Dmapred.reduce.tasks=2 in code: job.setnumreducetasks(2); ~/hadoop_samplecodes/citibike/shellscript/ RunDefaultJob.sh -dmapred.max.split.size=20000 RunDefaultJob_pseudo.sh 33

34 Shuffling InputFormat Split Partition: hash key to reducer task Shuffle ile (text, binary) atabase Input: a set of [key,value] pairs intermediate [key,value] pairs [k,(v1,v2, )] [k,(v1,v2, )] OutputFormat Output: a set of [key,value] pairs 34

35 Which reduce task? For each (K2,V2), intermediate key-value pair, which reduce task to go to? Partition the whole domain of K2 into multiple partitions each reduce tasks process one partition partition function operates on intermediate key and value types (K2 and V2), and returns partition index. In practice, partition is determined solely by key (value is ignored): Default partitioner is HashPartioner public class HashPartitioner<K, V> extends Partitioner<K, V> { } public int getpartition(k key, V value, int numreducetasks) { return (key.hashcode() & Integer.MAX_VALUE) % numreducetasks; } 35

36 ChainMapper class ChainMapper class: use multiple Mapper classes within a single Map task. mapper1 => mapper2 => mapper3 =>lastmapper output of the first becomes the input of the second, and so on until the last Mapper output of the last Mapper will be written to the task's output. Benefit: Modularity (simple and reusable specialized Mappers) Composibility (mapper can combined to perform composite operations) reduction in disk IO: compared to multiple chained map reduce jobs 36

37 a word count job Job job = Job.getInstance(); Configuration splitmapconfig = new Configuration(false); ChainMapper.addMapper(job, SplitMapper.class, LongWritable.class, Text.class, Text.class, IntWritable.class, splitmapconfig); Configuration lowercasemapconfig = new Configuration(false); ChainMapper.addMapper(job, LowerCaseMapper.class, Text.class, IntWritable.class, Text.class, IntWritable.class, lowercasemapconfig); job.setjarbyclass(chainmapperdriver.class); job.setcombinerclass(chainmapreducer.class); job.setreducerclass(chainmapreducer.class); 37

38 forming a chain of mapper public static <K1,V1,K2,V2> void addmapper(jobconf job," Class<? extends Mapper<K1,V1,K2,V2>> klass," Class<? extends K1> inputkeyclass," Class<? extends V1> inputvalueclass," Class<? extends K2> outputkeyclass," Class<? extends V2> outputvalueclass," boolean byvalue," JobConf mapperconf)" Adds a Mapper class to the chain job's JobConf. byvalue - indicates if key/values should be passed by value to the next Mapper in the chain (if any) or by reference. If a Mapper leverages the assumed semantics that the key and values are not modified by the collector 'by value' must be used. If the Mapper does not expect this semantics, as an optimization to avoid serialization and deserialization 'by reference' can be used. MPORTANT: There is no need to specify the output key/value classes for the ChainMapper, this is done by the addmapper for the last mapper in the chain 38

39 SequenceFile SequenceFile: provides a persistent data structure for binary keyvalue pairs keys and values stored in a SequenceFile do not necessarily need to be Writable. Any types that can be serialized and deserialized by a Serialization may be used. convert an object/value to/from a byte stream, In contrast, default TextOutputFormat writes key, value by calling tostring() method on them convert an object/value to/from a stream of text (e.g., ASCII) 39

40 MapFile A file-based map from keys to values. A map is a directory containing two files data file, containing all keys and values in the map index file, containing a fraction of the keys. The fraction is determined by MapFile.Writer.getIndexInterval(). Index file is read entirely into memory. Thus key implementations should try to keep themselves small. Allowing for quick lookup of record Exercise: run the MapReduce job that uses MapFileOutputFormat class, examine the output directory 40

41 Sorting in MapReduce Framework 41

42 All about sorting Partial sort comes free, see previous slides output from map => partitioned, sorted, and merged (sorted) => reduce each partition is sorted, i.e., each reduce task output is sorted To sort globally (total sort) one partition, i.e., one reduce task or, use customized partition class 42

43 All about sorting Total sort: outputs of reduce tasks can be concatenated together to get a sorted output Ideas: Use TotalOrderPartition class, if k1<k2, partition(k1)<partition(k2) How to make sure the partitioning is balanced, to minimize running time? use InputSampler to sample output to get an estimated distribution of keys 43

44 Secondary sort Goal: sort output by year, and then within the year, sort by temperature You can do this by writing a reduce class intermediate key-value pairs are grouped: reduce (k, <v1,v2, >), but v1,v2, are not sorted Or we can again take advantage of MapReduce s framework (how it already partitions, sorts, groups data for us) 44

45 How? Goal: sort output by year, and then within the year, sort by temperature Plan: Use year and temperature as key (a composite key) Partition and group based on year only so that records of same year are sent to same reduce tasks, and grouped together in a list Sort based on the composite key (year and temperature so that the ordering of records (within same group) are ordered by temperature 45

46 Details PartitionerClass: FirstPartitioner class, uses only first part (e.g., year) of composite key 46

47 Details setgroupingcomparatorclass Define the comparator that controls which keys are grouped together for a single call to Reducer.reduce function" Use GroupComparator which just comparing first part of key (e.g., year) 47

48 Details setsortcomparatorclass: " " Define the comparator that controls how the keys are sorted before they are passed to the Reducer." We use KeyCompartor class, sort by first part (e.g., year), and then by second part of composite key (e.g., temperature) SortComparator

49 Outline MapReduce: how to control number of tasks InputFormat class decides # of splits => number of map tasks reduce tasks # configured by client ChainMapper: modular design Input processing: XML file processing whole file as a record Binary output & sorting Join

50 Join Combine two datasets together using a key Here, use StationID 50

51 An example of reduce-side joins Multiple inputs, of different formats e.g., one station record, another weather data mapper class: tag record with composite key, e.g., station_id-0 for station record, station_id-1 for weather record Secondary sort: use first part of composite key to partition and group 51

52 Commonly Used Mapper/ Reducer 52

53 Debugging MapReduce job and task logs 53

54 User-level logs 54

55 MapReduce Programming InputFormat Split Partition: hash key to reducer task Shuffle ile (text, binary) atabase Input: a set of [key,value] pairs intermediate [key,value] pairs [k,(v1,v2, )] [k,(v1,v2, )] OutputFormat Output: a set of [key,value] pairs 55

56 What happens? Maps are the individual tasks which transform input records into a intermediate records. The transformed intermediate records need not be of the same type as the input records. A given input pair may map to zero or many output pairs. Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. " Framework first calls setup(org.apache.hadoop.mapreduce.mapper.context), followed by map(object, Object, Context) for each key/value pair in the InputSplit. Finally cleanup(context) is called. All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to a Reducer to determine final output. Users can control sorting and grouping by specifying two key RawComparator classes. Mapper outputs are partitioned per Reducer. Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner. Users can optionally specify a combiner, via Job.setCombinerClass(Class), to perform local aggregation of intermediate outputs, which helps to cut down amount of data transferred from the Mapper to the Reducer. 56

57 MapReduce Programming Map: input key/value type (can be int, long, text, ) output key/value type (can be int, long, text, ) reducer: input key/value type (same as mapper s output key/value type) output key/value type (can be int, long, text, ) 57

58 a MapReduce Job Class Job: org.apache.hadoop.mapreduce.job All Implemented Interfaces: JobContext, org.apache.hadoop.mapreduce.mrjobconfig public class Job" extends org.apache.hadoop.mapreduce.task.jobcontextimpl" implements JobContext" The job submitter's view of the Job. It allows the user to configure the job: using set***** methods, only work until the job is submitted, afterwards they will throw an IllegalStateException. submit it, control its execution, and query the state. Normally the user creates the application, describes various facets of the job via Job and then submits the job and monitor its progress. 58

59 a MapReduce Job Here is an example on how to submit a job: // Create a new Job" Job job = Job.getInstance();" job.setjarbyclass(myjob.class);" " // Specify various job-specific parameters " job.setjobname("myjob");" " job.setinputpath(new Path("in"));" job.setoutputpath(new Path("out"));" " job.setmapperclass(myjob.mymapper.class);" job.setreducerclass(myjob.myreducer.class);" // Submit the job, then poll for progress until the job is complete" job.waitforcompletion(true); 59

60 Mapper class s KEYIN must be consistent with inputformat.class Mapper class s KEYOUT must be consistent with map.out.key.class 60

61 a minimalmapreduce Job Try to run the minimalmapreduce job Compare it with the WithDefaults j 61

62 Job: default settings 62

63 Default Streaming Job More stripped down streaming job: Equivalently, 63

64 Usage of streaming separators 64

65 65

66 Configuration class Hadoop loads hdfs-default.xml file from the classpath resources, which the jar helps supply. Then we get the "default" values for several configs ready. After this, we load hdfs-site.xml from the classpath again (which resides mostly at /etc/hadoop/conf/ directory) and apply overrides into the default config object. ~]$ hadoop classpath /etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoophdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/ hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//* The default XML files reside inside the hadoop-common and hadoop-hdfs jars. You do not need these files explicitly on the classpath since they are read from the jars itself and they should never be modified. reference online: core-default.xml is at hdfs-default.xml is at 66

Getting to know Apache Hadoop

Getting to know Apache Hadoop Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless

More information

Data-intensive computing systems

Data-intensive computing systems Data-intensive computing systems Hadoop Universtity of Verona Computer Science Department Damiano Carra Acknowledgements! Credits Part of the course material is based on slides provided by the following

More information

Hadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ.

Hadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ. Hadoop Distributed Filesystem Spring 2015, X. Zhang Fordham Univ. MapReduce Programming Model Split Shuffle Input: a set of [key,value] pairs intermediate [key,value] pairs [k1,v11,v12, ] [k2,v21,v22,

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Hadoop. History and Introduction. Explained By Vaibhav Agarwal Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology Working With Hadoop Now that we covered the basics of MapReduce, let s look at some Hadoop specifics. Mostly based on Tom White s book Hadoop: The Definitive Guide, 3 rd edition Note: We will use the new

More information

University of Maryland. Tuesday, February 2, 2010

University of Maryland. Tuesday, February 2, 2010 Data-Intensive Information Processing Applications Session #2 Hadoop: Nuts and Bolts Jimmy Lin University of Maryland Tuesday, February 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Understanding Big Data and Big Data Analytics Getting familiar with Hadoop Technology Hadoop release and upgrades

More information

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200 Hadoop Learning Resources 1 Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200 Author: Hadoop Learning Resource Hadoop Training in Just $60/3000INR

More information

Hadoop Design and k-means Clustering

Hadoop Design and k-means Clustering Hadoop Design and k-means Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise

More information

How To Write A Map Reduce In Hadoop Hadooper 2.5.2.2 (Ahemos)

How To Write A Map Reduce In Hadoop Hadooper 2.5.2.2 (Ahemos) Processing Data with Map Reduce Allahbaksh Mohammedali Asadullah Infosys Labs, Infosys Technologies 1 Content Map Function Reduce Function Why Hadoop HDFS Map Reduce Hadoop Some Questions 2 What is Map

More information

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated

More information

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Lecture 3 Hadoop Technical Introduction CSE 490H

Lecture 3 Hadoop Technical Introduction CSE 490H Lecture 3 Hadoop Technical Introduction CSE 490H Announcements My office hours: M 2:30 3:30 in CSE 212 Cluster is operational; instructions in assignment 1 heavily rewritten Eclipse plugin is deprecated

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology Hadoop Dawid Weiss Institute of Computing Science Poznań University of Technology 2008 Hadoop Programming Summary About Config 1 Open Source Map-Reduce: Hadoop About Cluster Configuration 2 Programming

More information

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop WordCount Explained! IT332 Distributed Systems Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,

More information

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

Jordan Boyd-Graber University of Maryland. Tuesday, February 10, 2011

Jordan Boyd-Graber University of Maryland. Tuesday, February 10, 2011 Data-Intensive Information Processing Applications! Session #2 Hadoop: Nuts and Bolts Jordan Boyd-Graber University of Maryland Tuesday, February 10, 2011 This work is licensed under a Creative Commons

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department

More information

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from Hadoop Beginner's Guide Learn how to crunch big data to extract meaning from data avalanche Garry Turkington [ PUBLISHING t] open source I I community experience distilled ftu\ ij$ BIRMINGHAMMUMBAI ')

More information

Big Data Technology Core Hadoop: HDFS-YARN Internals

Big Data Technology Core Hadoop: HDFS-YARN Internals Big Data Technology Core Hadoop: HDFS-YARN Internals Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class Map-Reduce Motivation This class

More information

map/reduce connected components

map/reduce connected components 1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software

More information

Hadoop Streaming. Table of contents

Hadoop Streaming. Table of contents Table of contents 1 Hadoop Streaming...3 2 How Streaming Works... 3 3 Streaming Command Options...4 3.1 Specifying a Java Class as the Mapper/Reducer... 5 3.2 Packaging Files With Job Submissions... 5

More information

THE HADOOP DISTRIBUTED FILE SYSTEM

THE HADOOP DISTRIBUTED FILE SYSTEM THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,

More information

Xiaoming Gao Hui Li Thilina Gunarathne

Xiaoming Gao Hui Li Thilina Gunarathne Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

How To Use Hadoop

How To Use Hadoop Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop

More information

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015 7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing MapReduce and Hadoop 15 319, spring 2010 17 th Lecture, Mar 16 th Majd F. Sakr Lecture Goals Transition to MapReduce from Functional Programming Understand the origins of

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

MapReduce. Tushar B. Kute, http://tusharkute.com

MapReduce. Tushar B. Kute, http://tusharkute.com MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

HowTo Hadoop. Devaraj Das

HowTo Hadoop. Devaraj Das HowTo Hadoop Devaraj Das Hadoop http://hadoop.apache.org/core/ Hadoop Distributed File System Fault tolerant, scalable, distributed storage system Designed to reliably store very large files across machines

More information

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

Hadoop 2.6 Configuration and More Examples

Hadoop 2.6 Configuration and More Examples Hadoop 2.6 Configuration and More Examples Big Data 2015 Apache Hadoop & YARN Apache Hadoop (1.X)! De facto Big Data open source platform Running for about 5 years in production at hundreds of companies

More information

INTRODUCTION TO HADOOP

INTRODUCTION TO HADOOP Hadoop INTRODUCTION TO HADOOP Distributed Systems + Middleware: Hadoop 2 Data We live in a digital world that produces data at an impressive speed As of 2012, 2.7 ZB of data exist (1 ZB = 10 21 Bytes)

More information

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013

More information

Comparison of Different Implementation of Inverted Indexes in Hadoop

Comparison of Different Implementation of Inverted Indexes in Hadoop Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture

More information

Hadoop (Hands On) Irene Finocchi and Emanuele Fusco

Hadoop (Hands On) Irene Finocchi and Emanuele Fusco Hadoop (Hands On) Irene Finocchi and Emanuele Fusco Big Data Computing March 23, 2015. Master s Degree in Computer Science Academic Year 2014-2015, spring semester I.Finocchi and E.Fusco Hadoop (Hands

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385 brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 1 Hadoop in a heartbeat 3 2 Introduction to YARN 22 PART 2 DATA LOGISTICS...59 3 Data serialization working with text and beyond 61 4 Organizing and

More information

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted

More information

Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

Intro to Map/Reduce a.k.a. Hadoop

Intro to Map/Reduce a.k.a. Hadoop Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by

More information

The Hadoop Eco System Shanghai Data Science Meetup

The Hadoop Eco System Shanghai Data Science Meetup The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related

More information

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra

More information

Hadoop: The Definitive Guide

Hadoop: The Definitive Guide FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!

More information

Complete Java Classes Hadoop Syllabus Contact No: 8888022204

Complete Java Classes Hadoop Syllabus Contact No: 8888022204 1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What

More information

Data-Intensive Computing with Hadoop. Thanks to: Milind Bhandarkar <milindb@yahoo-inc.com> Yahoo! Inc.

Data-Intensive Computing with Hadoop. Thanks to: Milind Bhandarkar <milindb@yahoo-inc.com> Yahoo! Inc. Data-Intensive Computing with Hadoop Thanks to: Milind Bhandarkar Yahoo! Inc. Agenda Hadoop Overview HDFS Programming Hadoop Architecture Examples Hadoop Streaming Performance Tuning

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Using Lustre with Apache Hadoop

Using Lustre with Apache Hadoop Using Lustre with Apache Hadoop Table of Contents Overview and Issues with Hadoop+HDFS...2 MapReduce and Hadoop overview...2 Challenges of Hadoop + HDFS...4 Some useful suggestions...5 Hadoop over Lustre...5

More information

Hadoop: The Definitive Guide

Hadoop: The Definitive Guide Hadoop: The Definitive Guide Tom White foreword by Doug Cutting O'REILLY~ Beijing Cambridge Farnham Köln Sebastopol Taipei Tokyo Table of Contents Foreword Preface xiii xv 1. Meet Hadoop 1 Da~! 1 Data

More information

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

MAPREDUCE Programming Model

MAPREDUCE Programming Model CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce

More information

MapReduce Job Processing

MapReduce Job Processing April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Click Stream Data Analysis Using Hadoop

Click Stream Data Analysis Using Hadoop Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Hadoop EKG: Using Heartbeats to Propagate Resource Utilization Data

Hadoop EKG: Using Heartbeats to Propagate Resource Utilization Data Hadoop EKG: Using Heartbeats to Propagate Resource Utilization Data Trevor G. Reid Duke University tgr3@duke.edu Jian Wei Gan Duke University jg76@duke.edu Abstract Hadoop EKG is a modification to the

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

HDFS. Hadoop Distributed File System

HDFS. Hadoop Distributed File System HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files

More information