Introduction to Map/Reduce & Hadoop

Size: px
Start display at page:

Download "Introduction to Map/Reduce & Hadoop"

Transcription

1 Introduction to Map/Reduce & Hadoop V. CHRISTOPHIDES Department of Computer Science University of Crete 1 What is MapReduce? MapReduce: programming model and associated implementation for batch processing of large data sets MapReduce could be implemented on different architectures, but Google proposed it for clusters large clusters of commodity PCs Functional programming meets distributed computing: Automatic parallelization & distribution A clean abstraction for programmers by factoring out many reliability concerns from application logic Fault-tolerance Status and monitoring tools MapReduce implementations such as Hadoop differ in details, but main principles are the same 2 1 1

2 Overview Clever abstraction that is a good fit for many real-world problems Programmer focuses on algorithm itself Runtime system takes care of all messy details Partitioning of input data Scheduling program execution Handling machine failures Managing inter-machine communication Divide and Conquer 3 MapReduce Implementations Google has a proprietary implementation in C++ Bindings in Java, Python Hadoop is an open-source implementation in Java Development led by Yahoo, now an Apache project Used in production at Yahoo, Facebook, Twitter, LinkedIn, Netflix but also A9.com, AOL, The New York Times, Last.fm, Baidu.com, Joost, Veoh, etc. The de facto big data processing platform Rapidly expanding software ecosystem Lots of custom research implementations For GPUs, cell processors, etc

3 Hadoop: Storage & Compute on One Platform Evolution from Apache Hadoop to the Enterprise Data Hub A. Awadallah 5 Co-Founder & CTO of Cloudera SMDB 2014 Expanding Data Requires A New Approach Evolution from Apache Hadoop to the Enterprise Data Hub A. Awadallah 6 Co-Founder & CTO of Cloudera SMDB

4 Map Reduce Foundations 7 Map and Reduce: The Origins The programming idea of Map, and Reduce is 40+ year old Present in all Functional Programming Languages e.g., APL, Lisp and ML Alternate Map names: Apply-All Higher Order Functions take function definitions as return a function as output arguments, or Map and Reduce are higher-order functions Map processes each record individually Reduce processes (combines) set of all records in a batch 8 4 4

5 Map: A Higher Order Function F(x: int) returns r: int Let V be an array of integers W = map(f, V) W[i] = F(V[i]) for all I i.e., apply F to every element of V Examples in Haskell map (+1) [1,2,3,4,5] == [2, 3, 4, 5, 6] map (tolower) "abcdefg12!@# == "abcdefg12!@# map (`mod` 3) [1..10] == [1, 2, 0, 1, 2, 0, 1, 2, 0, 1] 9 reduce: A Higher Order Function reduce also known as fold, accumulate, compress or inject Reduce/fold takes in a function and folds it in between the elements of a list

6 Fold-Left in Haskell Definition foldl f z [] = z foldl f z (x:xs) = foldl f (f z x) xs Examples foldl (+) 0 [1..5] ==15 foldl (+) 10 [1..5] == 25 foldl (div) 7 [34,56,12,4,23] == 0 11 Fold-Right in Haskell Definition foldr f z [] = z foldr f z (x:xs) = f x (foldr f z xs) Example foldr (div) 7 [34,56,12,4,23] ==

7 13 Examples of Map Reduce Computation

8 What can I do in MapReduce? Three main functions: ❶Querying Filtering (distributed grep, etc.) Relational-based (join, selection, projection, etc.) ❷Summarizing Computing Aggregates (word/record count, Min/Max/Average/Median/Standard deviation, etc.) Data Organization (sort, indexing, etc.) ❸Analyzing Iterative Message Passing (graph processing)... large datasets in offline mode for boosting other on-line processes 15 Grep Example Input data Split data Split data Split data grep grep grep matches matches matches cat All matches Split data grep matches Search input files for a given pattern: e.g., Given a list of tweets (username, date, text) determine the tweets that contain a word Map: emits a (filename, line) if pattern is matched Reduce: Copies results to output

9 Word Count Example Input data Split data Split data Split data token token token count count count merge All matches Split data token count Read text files and count how often words occur The input is text files The output is a text file each line: word, tab, count Map: Produce pairs of (word, count = 1) from files Reduce: For each word, sum up the counts (i.e., fold) 17 Inverted Index Example (this was the original Google's Use Case) Input data Split data Split data Split data parse parse parse inverse inverse inverse cat Inverted list Split data parse inverse Generate an inverted index of words from a given set of files Map: parses a document and emits <word, docid> pairs Reduce: takes all pairs for a given word, sorts the docid values, and emits a <word, list(docid)> pair

10 10 10 Map Reduce Input data M A P Partitioning Function R E D U C E Result Map: Accepts input key/value pair Emits intermediate key/value pair Reduce : Accepts intermediate key/value* pair Emits output key/value pair 19 Real World Applications in MapReduce Organizations Google Yahoo Amazon Facebook PowerSet(Microsoft) Twitter New York Times Others (>74) MapReduce Applications Wide-range applications, grep/ sorting, machine learning, clustering, report extraction, graph computation Data model training, Web map construction, Web log processing using Pig, and much, much more Build product search indices Web log processing via both MapReduce and Hive HBase for natural language search Web log processing using Pig Large-scale image conversion Details in (so far, the longest list of applications for MapReduce 20

11 11 11 MapReduce Principle applied to BigData 21 Adapt MapReduce for BigData Always maps/reduces on list of key/value pairs Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs map() produces one or more intermediate values along with an output key from the input After the map phase is over, all the intermediate values for a given output key are combined together into a list reduce() combines those intermediate values into one or more final values for that same output key in practice, only one final value per key 22

12 12 12 Adapt MapReduce for BigData Automatic Parallelization: Depending on the size of RAW INPUT DATA instantiate multiple map() tasks Depending upon the number of intermediate <key, value> partitions instantiate multiple reduce() tasks Master program divvies up tasks based on data location tries to have map() tasks on same machine as physical file data, or at least same rack 23 Adapt MapReduce for BigData Map/Reduce execute in parallel on a cluster Fault tolerance is built in the framework Specific systems/implementation aspects matters How is data partitioned as input to map How is data serialized between processes Cloud specific improvements: Handle elasticity Take cluster topology (e.g., node proximity, node size) into account Source: Google Developers 24

13 13 13 Execution on Clusters ❶Input files split (M splits) ❷Assign Master & Workers ❸Map tasks ❹Writing intermediate data to disk (R regions) ❺Intermediate data read & sort ❻Reduce tasks ❼Return 25 Execution Overview Master assigns map and reduce tasks to workers, taking data location into account Mapper reads an assigned file split and writes intermediate key-value pairs to local disk Mapper informs master about result locations, who in turn informs the reducers Reducers pull data from appropriate mapper disk location After map phase is completed, reducers sort their data by key For each key, the Reduce function is executed and output is appended to final output file When all reduce tasks are completed, master wakes up user program 26

14 14 14 MapReduce Data Flow in Hadoop with no Reduce Tasks 27 MapReduce Data Flow in Hadoop with a Single Reduce Task (K1, V1) V (K2, V2) V (K2, List<V2>) (K3, V3) 28

15 15 15 MapReduce Data Flow in Hadoop with Multiple Reduce Tasks 29 MapReduce Programing Constructs in Hadoop Programmers must specify: map (k, v) <k, v >* reduce (k, v ) <k, v >* All values with the same key are reduced together Optionally, also: partition (k, number of partitions) partition for k Often a simple hash of the key, e.g., hash(k ) mod n Divides up key space for parallel reduce operations combine (k, v ) <k, v >* Mini-reducers that run in memory after the map phase Used as an optimization to reduce network traffic 1:many 30

16 16 16 The Execution Framework Handles Everything Else The execution framework handles everything else Scheduling: assigns workers to map and reduce tasks Data distribution : moves processes to data Synchronization: gathers, sorts, and shuffles intermediate data Errors and faults: detects worker failures and restarts Limited control over data and execution flow All algorithms must expressed in m, r, c, p You don t know: Where mappers and reducers run When a mapper or reducer begins or finishes Which input a particular mapper is processing Which intermediate key a particular reducer is processing 31 Execution Map Phase Shuffle Phase Reduce Phase 32

17 17 17 Automatic Parallel Execution in MapReduce (Google) Master Task Usually many more map tasks than machines E.g. 200K map tasks 5K reduce tasks 2K machines 33 Map/Reduce Execution in Clusters Several map or reduce tasks can run on a single computer Each intermediate file is divided into R partitions, by partitioning function Each reduce task corresponds to one partition 34

18 18 18 Moving Data From Mappers to Reducers Mappers output need to be separated for different reducers Reducers need to collect their data from all mappers and group it by key Keys at each reducer are processed in order! Shuffle & sort phase=synchronization barrier between map & reduce phase Collects the output from all map executions Transforms the map output Into the reduce input Divides the map output Into chunks Often one of the most expensive parts of a MapReduce execution 35 Shuffle and Sort in Map Reduce Spilled to a new disk file when almost full Spill files merged into single output file Spill files on disk: partitioned by reduce task, each partition sorted by key Merge happens in memory if data fits, otherwise also on disk Reduce task starts copying data from map task as soon as it completes. Reduce cannot start working on the data until all mappers 36 have finished and their data has arrived.

19 19 19 Shuffle and Sort in Hadoop Probably the most complex aspect of MapReduce! Map side Map outputs are buffered in memory in a circular buffer When buffer reaches threshold, contents are spilled to disk Spills merged in a single, partitioned file (sorted within each partition): combiner runs here Reduce side First, map outputs are copied over to reducer machine Sort is a multi-pass merge of map outputs (happens in memory and on disk): combiner runs here Final merge pass goes directly into reducer 37 Shuffle and Sort in Hadoop Mapper merged spills (on disk) intermediate files (on disk) Combiner Reducer circular buffer (in memory) Combiner spills (on disk) other reducers other mappers 38

20 20 20 Tools for Synchronization Cleverly-constructed data structures Bring partial results together Sort order of intermediate keys Control order in which reducers process keys Partitioner Control which reducer processes which keys Preserving state in mappers and reducers Capture dependencies across multiple keys and values 39 Master Data Structures Master keeps track of status of each map and reduce task and who is working on it Idle, in-progress, or completed Master stores location and size of output of each completed map task Pushes information incrementally to workers with in-progress reduce tasks 40

21 21 21 Semantics with Failures If map and reduce are deterministic, then output is identical to nonfaulting sequential execution For non-deterministic operators, different reduce tasks might see output of different map executions Relies on atomic commit of map and reduce outputs In-progress task writes output to private temp file Mapper: on completion, send names of all temp files to master (master ignores if task already complete) Reducer: on completion, atomically rename temp file to final output file (needs to be supported by distributed file system) 41 Fault Recovery Workers are commodity (low cost) machines so failures are frequent: given n workers, probability of at least one "problem" is 1 (1-p) n Workers are pinged by master periodically Non-responsive workers are marked as failed All tasks in-progress or completed by failed worker become eligible for rescheduling (reset to idle state) Can be assigned to other mapper Completed tasks are re-executed since result is stored on mapper s local disk Reducers are notified about mapper failure, so that they can read the data from the replacement mapper 42

22 22 22 Fault Recovery Failed reducers identified through ping as well Reducer s in-progress tasks are reset to idle state Can be assigned to other reducer No need to restart completed reduce tasks, because result is written to distributed file system Current implementations abort MapReduce computation on master failure Users re-submit aborted jobs when new master process is up Alternative: master writes periodic checkpoints of its data structures so that it can be re-started from checkpointed state 43 Machines Share Roles So far, logical view of cluster In reality Each cluster machine stores data And runs MapReduce workers Lots of storage + compute cycles nearby 44

23 23 23 The MapReduce Runtime Handles scheduling Assigns workers to map and reduce tasks Handles data distribution Moves processes to data Handles synchronization Gathers, sorts, and shuffles intermediate data Handles errors and faults Detects worker failures and restarts Everything happens on top of a distributed filesystem 45 Implementing in MapReduce Externally for the user Write a map function, and a reduce function Submit a job; wait for result No need to know anything about the environment Google: 4000 servers disks, many failures Internally for MapReduce system designer Run map in parallel Shuffle: combine map results to produce reduce input Run reduce in parallel Deal with failures 46

24 24 24 Component Overview 47 Peta-Bytes Data Processing 48

25 25 25 Google s MapReduce inspired Yahoo s Hadoop Now as an open source (Java) Project of Apache hadoop.apache.org We will use release (November 2014) Distribute processing of petabytes data across thousands of commodity machines In 2008 about 100k MapReduce jobs per day in Google over 20 petabytes of data processed per day each job occupies about 400 servers Easy to use since run-time complexity hidden from the users 49 Hadoop Key Components MapReduce - distributes applications Hadoop Distributed File System (HDFS) - distributes data Store big files across machines Store each file as a sequence of blocks Blocks of a file are replicated for fault tolerance 50

26 26 26 MapReduce vs. Hadoop MapReduce Hadoop Org Google Yahoo/Apache Impl C++ Java Distributed File Sys GFS HDFS Data Base Bigtable HBase Distributed lock Mgr Chubby History Dec 2004 Google GFS paper published July 2005 Nutch uses MapReduce Jan 2006 Doug Cutting joins Yahoo! Feb 2006 Becomes Lucene subproject Apr 2007 Yahoo! on 1000-node cluster Jan 2008 An Apache Top Level Project Feb 2008 Yahoo! production search index ZooKeeper 51 Google s Problem in 2003: Already Lots of Data Example: 20+ billion web pages x 20KB = 400+ terabytes One computer can read MB/sec from disk ~four months to read the web ~1,000 hard drives just to store the web Even more to do something with the data: process crawled documents process web request logs build inverted indices construct graph representations of web documents 52

27 27 27 The Need: Data & Programs Distribution! A Distributed System: Scalable Fault-Tolerant Easy to Program Applicable to Many Problems Re-discovered by Google with goals: "Reliability has to come from the software" "How can we make it easy to write distributed programs?" 53 How do we get Data to the Workers? NAS Compute Nodes SAN What s the problem here? 54

28 28 28 Distributed File System Don t move data to workers Move workers to the data! Store data on the local disks for nodes in the cluster Start up the workers on the node that has the data local! Why? Not enough RAM to hold all the data in memory Disk access is slow, disk throughput is good A distributed file system is the answer GFS (Google File System) DFS for Hadoop (= GFS clone) 55 Distributed File System Principles Single Namespace for entire cluster Data Coherency Write-once-read-many access model Client can only append to existing files Files are broken up into blocks Typically 128 MB block size Each block replicated on multiple DataNodes Intelligent Client Client can find location of blocks Client accesses data directly from DataNode 56

29 29 29 Hadoop Supported File Systems HDFS: Hadoop's own file system Amazon S3 file system Targeted at clusters hosted on the Amazon Elastic Compute Cloud server-on-demand infrastructure Not rack-aware CloudStore, previously Kosmos Distributed File System like HDFS, this is rack-aware optimization which takes into account the geographic clustering of servers network traffic between servers in different geographic clusters is minimized FTP Filesystem stored on remote FTP servers Read-only HTTP and HTTPS file systems 57 Hadoop Distributed File System (HDFS) Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle hardware failure Detect failures and recovers from them Optimized for Batch Processing Data locations exposed so that computations can move to where data resides Provides very high aggregate bandwidth User Space, runs on top of the file systems of the underlying OS 58

30 30 30 Hadoop Distributed File System (HDFS) Master manages the file system and access to files by clients Runs on master node of the HDFS cluster Directs DataNodes to perform their low-level I/O tasks Slave manages the storages attached to the nodes running on Runs on each slave machine in the HDFS cluster Does the low-level I/O work Files stored in many blocks Each block has a block Id Block Id associated with several nodes hostname:port (depending on level of replication) 59 HDFS Architecture NameNode Cluster Membership Client Secondary NameNode Cluster Membership DataNodes NameNode : Maps a file to a file-id and list of MapNodes DataNode : Maps a block-id to a physical location on disk SecondaryNameNode: Periodic merge of Transaction log 60

31 31 31 NameNode Metadata Meta-data in Memory The entire metadata is in main memory No demand paging of meta-data Types of Metadata List of files List of Blocks for each file List of DataNodes for each block File attributes, e.g creation time, replication factor A Transaction Log stored in multiple directories (on a local/remote FS) Records file creations, file deletions. etc 61 DataNode A Block Server Stores data in the local file system (e.g. ext3) Stores meta-data of a block (e.g. CRC) Serves data and meta-data to Clients Block Report Periodically sends a report of all existing blocks to the NameNode Facilitates Pipelining of Data Forwards data to other specified DataNodes 62

32 32 32 Block Placement Current Strategy One replica on local node Second replica on a remote rack Third replica on same remote rack Additional replicas are randomly placed Clients read from nearest replica Would like to make this policy pluggable 63 Data Correctness Use Checksums to validate data Use CRC32 File Creation Client computes checksum per 512 byte DataNode stores the checksum File access Client retrieves the data and checksum from DataNode If validation fails, Client tries other replicas 64

33 33 33 Primary vs Secondary NameNode NameNode daemon Head interface to HDFS cluster Records all global metadata A single point of failure Secondary NameNode daemon One per cluster to monitor status of HDFS: Not a failover NameNode! Records snapshots of HDFS metadata from real NameNode to facilitate recovery from failures Can merge update logs in flight Can upload snapshot back to primary Need to develop a real High Availability (HA) solution 65 Data Pipelining Client retrieves a list of DataNodes on which to place replicas of a block Client writes block to the first DataNode The first DataNode forwards the data to the next DataNode in the Pipeline When all replicas are written, the Client moves on to write the next block in file 66

34 34 34 Rebalancer Goal: % disk full on DataNodes should be similar Usually run when new DataNodes are added Cluster is online when Rebalancer is active Rebalancer is throttled to avoid network congestion Command line tool 67 dfs.name.dir Dir in the namenode to store metadata dfs.data.dir Dir in data nodes to store the data blocks Must be in a local disk partition fs.default.name URI of the namenode conf/masters ip address of the master conf/slaves ip address of the slaves Should have password-less SSH access to all the nodes Configuring HDFS conf/hdfs-site.xml <property> <name>dfs.name.dir</name> <value>/tmp/hadoop-test/name</value> </property> <property> <name>dfs.data.dir</name> <value>/tmp/hadoop-test/data</value> </property> conf/core-site.xml <property> <name>fs.default.name</name> <value>hdfs:// :9000/</va lue> </property> 68

35 35 35 User Interface Command for HDFS User hadoop dfs -mkdir /foodir hadoop dfs -cat /foodir/myfile.txt hadoop dfs -rm /foodir myfile.txt Command for HDFS Administrator hadoop dfsadmin -report hadoop dfsadmin -decommission datanodename Web Interface WordCount A Simple Hadoop Example 70

36 36 36 Word Count over a Given Set of Web Pages see bob throw see spot run see 1 bob 1 throw 1 see 1 spot 1 run 1 bob 1 run 1 see 2 spot 1 throw 1 How can we do word count in parallel? 71 Word Counting with MapReduce M 1 Documents Key Value Key Value Doc1 Doc2 Doc3 Doc4 Doc5 Financial, IMF, Economics, Crisis Financial, IMF, Crisis Documents Economics, Harry Financial, Harry, Potter, Film Crisis, Harry, Potter Map Map ` ` Financial 1 IMF 1 Economics 1 ` Crisis 1 Financial 1 ` IMF 1 Crisis 1 Economics 1 ` Harry 1 Financial 1 Harry 1 ` Potter 1 Film 1 Crisis 1 Harry 1 ` Potter 1 M 2 Kyuseok Shim (VLDB 2012 TUTORIAL) 72

37 37 37 Word Counting with MapReduce Doc1 Doc2 Doc3 Doc4 Doc5 Documents Financial, IMF, Economics, Crisis Financial, IMF, Crisis Documents Economics, Harry Financial, Harry, Potter, Film Crisis, Harry, Potter Map Map Key KeyValue Key Value list Financial Financial1 Crisis 1, 1, 1 1 Financial IMF 1 Crisis 1, 1 1 Financial Economics 1 Crisis 1, 1 1 IMF Crisis 1 Harry 1, 1, 1 1 IMF Harry 1 Harry 1, 1, 1 1 Economics Film 1 Harry 1 1 Economics Potter 1 Film 1, 1 1 Potter 1 Potter 1 Value Reduc Reduce Key Financial 3 IMF 2 Economics 2 Crisis 3 Harry 3 Film 1 Potter 2 Before reduce functions are called, for each distinct key, the list of its values is generated Kyuseok Shim (VLDB 2012 TUTORIAL) ` ` Value 73 Basic Hadoop API Mapper void map(k1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter) void configure(jobconf job) void close() throws IOException Reducer/Combiner void reduce(k2 key, Iterator<V2> values, OutputCollector<K3,V3> output, Reporter reporter) void configure(jobconf job) void close() throws IOException Partitioner void getpartition(k2 key,v2 value,int numpartitions) *Note: forthcoming API changes 74

38 38 38 WordCount Mapper public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); } } } 75 WordCount Reducer public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } 76

39 39 39 Job Configuration Job object forms the job specification and gives control for running the job Set mapper and reducer class to be used Set output key and value classes for map and reduce functions For reducer: setoutputkeyclass(), setoutputvalueclass() For mapper (omit if same as reducer): setmapoutputkeyclass(), setmapoutputvalueclass() Can set input types similarly (default is TextInputFormat) Specify data input path using setinputpath() Can be single file, directory (to use all files there), or file pattern addinputpath() can be called multiple times to add multiple paths Specify output path using setoutputpath() Single output path, which is a directory for all output files Method waitforcompletion() submits job and waits for it to finish 77 Job Configuration public int run(string[] args) throws Exception { Job job = new Job(getConf(), "WordCount"); job.setjarbyclass(wordcount.class); job.setoutputkeyclass(text.class Text.class); job.setoutputvalueclass(intwritable.class IntWritable.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); job.setcombinerclass(reduce.class Reduce.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); job.setnumreducetasks(2); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitforcompletion(true); return 0; } 78

40 40 40 Invocation of WordCount hadoop dfs -mkdir <hdfs hdfs-dir dir> hadoop dfs -copyfromlocal <local local-dir dir> <hdfs hdfs-dir dir> hadoop jar hadoop-*-examples.jar WordCount [-m <#maps>] [-r <#reducers>] <in-dir dir> <out-dir dir> 79 Hadoop Job Tuning Choose appropriate number of mappers and reducers Define combiners whenever possible Consider Map output compression Optimize the expensive shuffle phase (between mappers and reducers) by setting its tuning parameters Profiling distributed MapReduce jobs is challenging 80

41 41 41 Job Configuration Parameters 190+ parameters in Hadoop Do their settings impact performance? What are ways to set these parameters? Defaults -- are they good enough? Best practices -- the best setting can depend on data, job, and cluster properties Automatic setting (Not yet in Hadoop) 81 MapReduce Development Steps ❶Write Map and Reduce functions Create unit tests ❷Write driver program to run a job Can run from IDE with small data subset for testing If test fails, use IDE for debugging Update unit tests and Map/Reduce if necessary ❸Once program works on small test set, run it on full data set If there are problems, update tests and code accordingly ❹Fine-tune code, do some profiling 82

42 42 42 Mechanics of Programming Hadoop Jobs 83 Anatomy of a Hadoop Job in MR v1.0 MapReduce program in Hadoop = Hadoop job Jobs are divided into map and reduce tasks An instance of a running task is called a task attempt Multiple jobs can be composed into a workflow 84

43 43 43 Client (i.e., driver program) creates a job, configures it, and submits it to job tracker Launching a Hadoop Job JobTracker puts job data in shared location, enqueues tasks JobClient computes input splits (on client end) Job data (jar, configuration XML) are sent to JobTracker TaskTrackers poll JobTracker for tasks TaskTrackers launch a task in a separate Java instance Source: 85 Job Launch: JobClient, JobTracker and TaskTracker Client submits MapReduce job through JobClient.runJob() // blocks JobClient.submitJob() // does not block waitforcompletion() submits job and polls JobTracker about progress every sec, outputs to console if changed JobClient: Determines proper division of input into InputSplits Sends job data to master JobTracker server JobTracker: Inserts jar and JobConf (serialized to XML) in shared location Posts a JobInProgress to its run queue TaskTrackers running on slaves periodically query JobTracker for work Retrieve job-specific jar and config Launch task in separate instance of Java 86 main() is provided by Hadoop

44 44 44 Job Launch: Task and TaskRunner TaskTracker.Child.main(): Sets up the child TaskInProgress attempt Reads XML configuration Connects back to necessary MapReduce components via RPC Uses TaskRunner to launch user process TaskRunner, MapTaskRunner, MapRunner work in a daisy-chain to launch Mapper Task knows ahead of time which InputSplits it should be mapping Calls Mapper once for each record retrieved from the InputSplit Running the Reducer is much the same 87 Creating the Mapper One instance of your Mapper is initialized by the MapTaskRunner for a TaskInProgress Exists in separate process from all other instances of Mapper no data sharing! public void map (WritableComparable key, Writable value, Context context) 88

45 45 45 Data Types in Hadoop Writable WritableComparable Defines a de/serialization protocol Every data type in Hadoop is a Writable Defines a sort order All keys must be of this type (but not values) IntWritable LongWritable Text SequenceFiles Concrete classes for different data types Binary encoded of a sequence of key/value pairs 89 What is Writable? Hadoop defines its own box classes for strings (Text), integers (IntWritable), etc. All values are instances of Writable All keys are instances of WritableComparable Writing for cache coherency while (more input exists) { myintermediate = new intermediate(input); myintermediate.process(); export outputs; } 90

46 46 46 Getting Data to the Mapper Input File Input File InputFormat InputSplit InputSplit InputSplit InputSplit InputSplit RecordReader RecordReader RecordReader RecordReader RecordReader Mapper Mapper Mapper Mapper Mapper Intermediates Intermediates Intermediates Intermediates Intermediates Data sets are specified by InputFormats Defines input data (e.g., a directory) Identifies partitions of the data that form an InputSplit Factory for RecordReader objects to extract (k,v) records from the input source 91 FileInputFormat and Friends TextInputFormat Treats each \n -terminated line of a file as a value KeyValueTextInputFormat Maps \n - terminated text lines of k SEP v SequenceFileInputFormat Binary file of (k, v) pairs with some add l metadata SequenceFileAsTextInputFormat Same, but maps (k.tostring(), v.tostring()) FileInputFormat will read all files out of a specified directory and send them to the mapper Delegates filtering this file list to a method subclasses may override e.g., create your own xyzfileinputformat to read *.xyz from directory list These classes make use of the readfields() method of the specific 92 Writable classes used by your MapReduce pass

47 47 47 A simple CustomWritable public class MyWritable implements Writable { private int counter; private long timestamp; public void write(dataoutput out) throws IOException { out.writeint(counter); out.writelong(timestamp); } public void readfields(datainput in) throws IOException { counter = in.readint(); timestamp = in.readlong(); } public static MyWritable read(datainput in) throws IOException { MyWritable w = new MyWritable(); w.readfields(in); return w; } } 93 Record Readers Each InputFormat provides its own RecordReader implementation Provides (unused?) capability multiplexing LineRecordReader Reads a line from a text file KeyValueRecordReader Used by KeyValueTextInputFormat 94

48 48 48 Input Split Size FileInputFormat will divide large files into chunks Exact size controlled by mapred.min.split.size RecordReaders receive file, offset, and length of chunk Custom InputFormat implementations may override split size e.g., NeverChunkFile 95 WritableComparator Compares WritableComparable data Will call WritableComparable.compare() Can provide fast path for serialized data JobConf.setOutputValueGroupingComparator() 96

49 49 49 Sending Data To Reducers: The Context Class Both Mapper and Reducer define an inner class called Context which implements the JobContext interface Job also implements JobContext when you create a new Job, you also set the context for the Mapper and Reducer Some methods of Context: write: generate an output key/value pair progress and setstatus: report progress or set the status of the task getcounter: get access (read/write) to the value of a Counter getconfiguration: return the configuration for the job getcachefiles: get cache files set in the Configuration 97 Partitioning Which reducer will receive the intermediate output keys and values? (key, value) pairs with the same key end up at the same partition The mappers partition data independently they never exchange information with one another Hadoop uses an interface called Partitioner to determine which partition a (key, value) pair will go to A single partition refers to all (key, value) pairs which will be sent to a single reduce task: #partitions = #reduce tasks each Reducer can process multiple reduce tasks The Partitioner determines the load balancing of the reducers JobConf sets Partitioner implementation 98

50 50 50 The Partitioner interface The Partitioner interface defines the getpartition() method Input: a key, a value and the number of partitions Output: a partition id for the given key, value pair The default Partitioner is the HashPartitioner: int getpartition(k key, V value, int numpartitions) { return key.hashcode() % numpartitions; } numpartitions = 3 Key key.hashcode() partitionid (0-2) hello world map reduce Partition and Shuffle Mapper Mapper Mapper Mapper Mapper Intermediates Intermediates Intermediates Intermediates Intermediates Partitioner Partitioner Partitioner Partitioner Partitioner (combiners omitted here) Intermediates Intermediates Intermediates Reducer Reducer Reduce 100

51 51 51 Reduction public void reduce(writablecomparable key, Iterator values, Context context) Keys & values sent to one partition all go to the same reduce task Calls are sorted by key: earlier keys are reduced and output before later keys 101 Finally: Writing The Output Reducer Reducer Reduce OutputFormat RecordWriter RecordWriter RecordWriter Output File Output File Output File 102

52 52 52 OutputFormat The OutputFormat and RecordWriter interfaces dictate how to write the results of a job back to the underlying permanent storage The default format (TextOutputFormat) will write (key, value) pairs as strings to individual lines of an output file Writes key val\n strings using the tostring() methods of the keys and values The SequenceFileOutputFormat will keep the data in binary, so it can be later read quickly by the SequenceFileInputFormat Uses a binary format to pack (k, v) pairs NullOutputFormat Discards output These classes make use of the write() method of the specific Writable classes used by your MapReduce pass 103 Sending Data to the Client Reporter object sent to Mapper allows simple asynchronous feedback incrcounter(enum key, long amount) setstatus(string msg) Allows self-identification of input InputSplit getinputsplit() 104

53 53 53 MapReduce Coding Summary Decompose problem into appropriate workflow of MapReduce jobs For each job, implement the following Job configuration Map function Reduce function Combiner function (optional) Partition function (optional) Might have to create custom data types as well WritableComparable for keys Writable for values 105 Local (Standalone) Mode Runs same MapReduce user program as cluster version, but does it sequentially Does not use any of the Hadoop daemons Works directly with local file system No HDFS, hence no need to copy data to/from HDFS Great for development, testing, initial debugging 106

54 54 54 Pseudo-Distributed Mode Still runs on single machine, but now simulates a real Hadoop cluster Simulates multiple nodes Runs all daemons Uses HDFS Main purpose: more advanced testing and debugging You can also set this up on your laptop 107 Running a MapReduce Workflow Linear chain of jobs To run job2 after job1, create JobConf s conf1 and conf2 in main function Call JobClient.runJob(conf1); JobClient.runJob(conf2); Catch exceptions to re-start failed jobs in pipeline More complex workflows Use JobControl from org.apache.hadoop.mapred.jobcontrol 108

55 55 55 Lifecycle of a MapReduce Job Time Input Splits Map Wave 1 Map Wave 2 Reduce Wave 1 Reduce Wave 2 How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined? 109 HDFS Limitations Almost GFS (Google FS) No file update options (record append, etc); all files are write-once Does not implement demand replication Designed for sequential access Random seeks devastate performance 110

56 56 56 Comparison of MapReduce and Other Approaches MapReduce: The Programming Model and Practice", SIGMETRICS, Turorials 2009, Google. 111 Comparison of MapReduce and other approaches MapReduce: The Programming Model and Practice", SIGMETRICS, Turorials 2009, Google. 112

57 57 57 MapReduce: A Step Backwards? Don t need 1000 nodes to process petabytes: Parallel DBs do it in fewer than 100 nodes No support for schema: Sharing across multiple MR programs is difficult No indexing: Wasteful access to unnecessary data Non-declarative programming model: Requires highly-skilled programmers No support for JOINs: Requires multiple MR phases for the analysis Agrawal et al., VLDB 2010 Tutorial 113 Map Reduce vs Parallel DBMS Parallel DBMS MapReduce Schema Support Not out of the box Indexing Not out of the box Programming Model Optimizations (Compression, Query Optimization) Declarative (SQL) Imperative (C/C++, Java, ) Extensions through Pig and Hive Not out of the box Flexibility Not out of the box Fault Tolerance Coarse grained techniques [Pavlo et al., SIGMOD 2009, Stonebraker et al., CACM 2010, ] 114

58 58 58 Additional Languages & Components 115 Hadoop and C++ Hadoop Pipes Library of bindings for native C++ code Operates over local socket connection Straight computation performance may be faster Downside: Kernel involvement and context switches 116

59 59 59 Hadoop and Python Option 1: Use Jython Caveat: Jython is a subset of full Python Option 2: HadoopStreaming 117 Hadoop Streaming Effectively allows shell pipe operator to be used with Hadoop You specify two programs for map and reduce (+) stdin and stdout do the rest (-) Requires serialization to text, context switches (+) Reuse Linux tools: cat grep sort uniq 118

60 60 60 Eclipse Plugin Support for Hadoop in Eclipse IDE Allows MapReduce job dispatch Panel tracks live and recent jobs The Hadoop Ecosystem 120

61 61 61 Hadoop Levels of Abstraction Less Hadoop visible HBase Queries against tables More DB view Hive SQL-Like language Pig Query and workflow language More Hadoop visible Java Write map-reduce functions More mapreduce view 121 MapReduce Cloud Service Providing MapReduce frameworks as a service in clouds becomes an attractive usage model for enterprises A MapReduce cloud service allows users to cost-effectively access a large amount of computing resources with creating own cluster Users are able to adjust the scale of MapReduce clusters in response to the change of the resource demand of applications 122

62 62 62 Hadoop Workflow 1. Load data into HDFS 2. Develop code locally You 3. Submit MapReduce job 3a. Go back to Step 2 Hadoop Cluster 4. Retrieve data from HDFS 123 On Amazon: With EC2 2. Develop code locally 0. Allocate Hadoop cluster 1. Load data into HDFS EC2 You Uh oh. Where did the data go? 3. Submit MapReduce job 3a. Go back to Step 2 Your Hadoop Cluster 4. Retrieve data from HDFS 5. Clean up! 124

63 63 63 On Amazon: EC2 and S3 EC2 (The Cloud) Copy from S3 to HDFS S3 (Persistent Store) Your Hadoop Cluster Copy from HFDS to S3 125 Takeaway MapReduce s data-parallel programming model hides complexity of distribution and fault tolerance Principal philosophies: Make it scale, so you can throw hardware at problems Make it cheap, saving hardware, programmer and administration costs (but requiring fault tolerance) MapReduce is not suitable for all problems, but when it works, it may save you a lot of time Agrawal et al., VLDB 2010 Tutorial 126

64 64 64 References We rely on the new org.apache.hadoop.mapreduce API, but Many existing programs might be written using the old API org.apache.hadoop.mapred Some old libraries might only support the old API Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters. Usenix SDI '04, David DeWitt, Michael Stonebraker, "MapReduce: A major step backwards, craig-henderson.blogspot.com CS9223 Massive Data Analysis J. Freire & J. Simeon New York University Course 2013 INFM 718G / CMSC 828G Data-Intensive Computing with MapReduce J. Lin University of Maryland 2013 The MapReduce Programming Model: Introduction and Examples Jose Maria Alvarez-Rodriguez FP7 RELATE-ITN April, Hadoop v1.0 Daemons: Putting it All Together JobTracker daemon One per cluster, usually running on master node Communicates with client application and controls MapReduce execution in TaskTracker daemons One TaskTracker per slave node Performs actual Map and Reduce execution Can spawn multiple JVMs to do the work tasktracker datanode daemon Linux file system namenode namenode daemon tasktracker datanode daemon Linux file system job submission node jobtracker tasktracker datanode daemon Linux file system Typical setup slave node slave node NameNode and JobTracker run on cluster head node DataNode and TaskTracker run on all other nodes Secondary NameNode runs on dedicated machine or on cluster head node (usually not a good idea, but ok for small clusters) slave node 128

65 65 65 JobTracker UI 129 NameNode WebUI 130

66 66 66 Datanodes 131

Lecture 3 Hadoop Technical Introduction CSE 490H

Lecture 3 Hadoop Technical Introduction CSE 490H Lecture 3 Hadoop Technical Introduction CSE 490H Announcements My office hours: M 2:30 3:30 in CSE 212 Cluster is operational; instructions in assignment 1 heavily rewritten Eclipse plugin is deprecated

More information

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless

More information

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology Working With Hadoop Now that we covered the basics of MapReduce, let s look at some Hadoop specifics. Mostly based on Tom White s book Hadoop: The Definitive Guide, 3 rd edition Note: We will use the new

More information

University of Maryland. Tuesday, February 2, 2010

University of Maryland. Tuesday, February 2, 2010 Data-Intensive Information Processing Applications Session #2 Hadoop: Nuts and Bolts Jimmy Lin University of Maryland Tuesday, February 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed

More information

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source

More information

Introduction to MapReduce

Introduction to MapReduce Introduction to MapReduce Jerome Simeon IBM Watson Research Content obtained from many sources, notably: Jimmy Lin course on MapReduce. Our Plan Today 1. Background: Cloud and distributed computing 2.

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Hadoop Distributed File System. Dhruba Borthakur June, 2007 Hadoop Distributed File System Dhruba Borthakur June, 2007 Goals of HDFS Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle

More information

Getting to know Apache Hadoop

Getting to know Apache Hadoop Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the

More information

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology Hadoop Dawid Weiss Institute of Computing Science Poznań University of Technology 2008 Hadoop Programming Summary About Config 1 Open Source Map-Reduce: Hadoop About Cluster Configuration 2 Programming

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Data-intensive computing systems

Data-intensive computing systems Data-intensive computing systems Hadoop Universtity of Verona Computer Science Department Damiano Carra Acknowledgements! Credits Part of the course material is based on slides provided by the following

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

Apache Hadoop FileSystem and its Usage in Facebook

Apache Hadoop FileSystem and its Usage in Facebook Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs

More information

By Hrudaya nath K Cloud Computing

By Hrudaya nath K Cloud Computing Processing Big Data with Map Reduce and HDFS By Hrudaya nath K Cloud Computing Some MapReduce Terminology Job A full program - an execution of a Mapper and Reducer across a data set Task An execution of

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction

More information

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex

More information

Hadoop Design and k-means Clustering

Hadoop Design and k-means Clustering Hadoop Design and k-means Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise

More information

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:

More information

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop WordCount Explained! IT332 Distributed Systems Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,

More information

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Hadoop and its Usage at Facebook. Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009

Hadoop and its Usage at Facebook. Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009 Hadoop and its Usage at Facebook Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed on Hadoop Distributed File System Facebook

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing MapReduce and Hadoop 15 319, spring 2010 17 th Lecture, Mar 16 th Majd F. Sakr Lecture Goals Transition to MapReduce from Functional Programming Understand the origins of

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Map Reduce / Hadoop / HDFS

Map Reduce / Hadoop / HDFS Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview

More information

CS54100: Database Systems

CS54100: Database Systems CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics

More information

How To Write A Map Reduce In Hadoop Hadooper 2.5.2.2 (Ahemos)

How To Write A Map Reduce In Hadoop Hadooper 2.5.2.2 (Ahemos) Processing Data with Map Reduce Allahbaksh Mohammedali Asadullah Infosys Labs, Infosys Technologies 1 Content Map Function Reduce Function Why Hadoop HDFS Map Reduce Hadoop Some Questions 2 What is Map

More information

Jordan Boyd-Graber University of Maryland. Tuesday, February 10, 2011

Jordan Boyd-Graber University of Maryland. Tuesday, February 10, 2011 Data-Intensive Information Processing Applications! Session #2 Hadoop: Nuts and Bolts Jordan Boyd-Graber University of Maryland Tuesday, February 10, 2011 This work is licensed under a Creative Commons

More information

Apache Hadoop FileSystem Internals

Apache Hadoop FileSystem Internals Apache Hadoop FileSystem Internals Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Storage Developer Conference, San Jose September 22, 2010 http://www.facebook.com/hadoopfs

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN

Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Hadoop Framework technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Xiaoming Gao Hui Li Thilina Gunarathne

Xiaoming Gao Hui Li Thilina Gunarathne Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal

More information

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

The Hadoop Eco System Shanghai Data Science Meetup

The Hadoop Eco System Shanghai Data Science Meetup The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related

More information

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015 7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE

More information

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich Big Data Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich First, an Announcement There will be a repetition exercise group on Wednesday this week. TAs will answer your questions on SQL, relational

More information

map/reduce connected components

map/reduce connected components 1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

MAPREDUCE Programming Model

MAPREDUCE Programming Model CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce

More information

Hadoop MapReduce: Review. Spring 2015, X. Zhang Fordham Univ.

Hadoop MapReduce: Review. Spring 2015, X. Zhang Fordham Univ. Hadoop MapReduce: Review Spring 2015, X. Zhang Fordham Univ. Outline 1.Review of how map reduce works: the HDFS, Yarn sorting and shuffling advanced topics: partial sort, total sort, join, chained mapper/reducer,

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2013/14 Lecture XI: MapReduce & Hadoop The new world of Big Data (programming model) Big Data Buzzword for challenges occurring

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

MapReduce, Hadoop and Amazon AWS

MapReduce, Hadoop and Amazon AWS MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables

More information

Data Science in the Wild

Data Science in the Wild Data Science in the Wild Lecture 3 Some slides are taken from J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 Data Science and Big Data Big Data: the data cannot

More information

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Hadoop. History and Introduction. Explained By Vaibhav Agarwal Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow

More information

INTRODUCTION TO HADOOP

INTRODUCTION TO HADOOP Hadoop INTRODUCTION TO HADOOP Distributed Systems + Middleware: Hadoop 2 Data We live in a digital world that produces data at an impressive speed As of 2012, 2.7 ZB of data exist (1 ZB = 10 21 Bytes)

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.

More information

Big Data 2012 Hadoop Tutorial

Big Data 2012 Hadoop Tutorial Big Data 2012 Hadoop Tutorial Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 1 Contact Exercise Session Friday 14.15 to 15.00 CHN D 46 Your Assistant Martin Kaufmann Office: CAB E 77.2 E-Mail:

More information

Design and Evolution of the Apache Hadoop File System(HDFS)

Design and Evolution of the Apache Hadoop File System(HDFS) Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop

More information

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software

More information

Word Count Code using MR2 Classes and API

Word Count Code using MR2 Classes and API EDUREKA Word Count Code using MR2 Classes and API A Guide to Understand the Execution of Word Count edureka! A guide to understand the execution and flow of word count WRITE YOU FIRST MRV2 PROGRAM AND

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200 Hadoop Learning Resources 1 Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200 Author: Hadoop Learning Resource Hadoop Training in Just $60/3000INR

More information

MapReduce with Apache Hadoop Analysing Big Data

MapReduce with Apache Hadoop Analysing Big Data MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan sriram@sdsc.edu Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.

More information

MapReduce (in the cloud)

MapReduce (in the cloud) MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:

More information

THE HADOOP DISTRIBUTED FILE SYSTEM

THE HADOOP DISTRIBUTED FILE SYSTEM THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,

More information

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış Istanbul Şehir University Big Data Camp 14 Hadoop Map Reduce Aslan Bakirov Kevser Nur Çoğalmış Agenda Map Reduce Concepts System Overview Hadoop MR Hadoop MR Internal Job Execution Workflow Map Side Details

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

!#$%&' ( )%#*'+,'-#.//0( !#$%&'()*$+()',!-+.'/', 4(5,67,!-+!89,:*$;'0+$.<.,&0$'09,&)/=+,!()<>'0, 3, Processing LARGE data sets !"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.

More information

Hadoop Architecture and its Usage at Facebook

Hadoop Architecture and its Usage at Facebook Hadoop Architecture and its Usage at Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Microsoft Research, Seattle October 16, 2009 Outline Introduction

More information

Hadoop: Understanding the Big Data Processing Method

Hadoop: Understanding the Big Data Processing Method Hadoop: Understanding the Big Data Processing Method Deepak Chandra Upreti 1, Pawan Sharma 2, Dr. Yaduvir Singh 3 1 PG Student, Department of Computer Science & Engineering, Ideal Institute of Technology

More information

HDFS Architecture Guide

HDFS Architecture Guide by Dhruba Borthakur Table of contents 1 Introduction... 3 2 Assumptions and Goals... 3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets... 3 2.4 Simple Coherency Model...3 2.5

More information

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from Hadoop Beginner's Guide Learn how to crunch big data to extract meaning from data avalanche Garry Turkington [ PUBLISHING t] open source I I community experience distilled ftu\ ij$ BIRMINGHAMMUMBAI ')

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information