INFO5011. Cloud Computing Semester 2, 2011 Lecture 6, MapReduce
|
|
- Reginald Lucas
- 7 years ago
- Views:
Transcription
1 INFO5011 Cloud Computing Semester 2, 2011 Lecture 6, MapReduce COMMONWEALTH OF Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of the university of Sydney pursuant to Part VB of the Copyright Act 1968 (the Act). The material in this communication may be subject to copyright under the Act. Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. Do not remove this notice. The presentation is based on: Jeff Dean, Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters. In OSDI'04. Original slides can be found from:
2 Outline MapReduce Framework introduction Word Count Example in Hadoop - Mapper, Reducer, Job, InputFormat, OutputFormat Sort example in Hadoop - Custom Partitioner Chaining jobs 2
3 Motivation: Large Scale Data Processing Many tasks: Process lots of data to produce other data Want to use hundreds or thousands of CPUs - but this needs to be easy MapReduce - Automatic parallelization and distributio - Fault-tolerance - I/O scheduling - Status and monitoring 3
4 Input & Output: each a set of key/value pairs Programmer specifies two functions: - map (in_key, in_value) -> list(out_key, intermediate_value) - Processes input key/value pair - Produces set of intermediate pairs - reduce (out_key, list(intermediate_value)) -> list(out_value) - Combines all intermediate values for a particular key - Produces a set of merged output values (usually just one) Inspired by similar primitives in LISP and other languages Programming model 4
5 map(string input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); Example: Count word occurrences reduce(string output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); 5
6 Model is widely applicable Example uses: distributed grep distributed sort web link-graph reversal term-vector per host web access log stats inverted index construction document clustering machine learning statistical machine translation Diagram from the original slides by Jeff Dean and Sanjay Ghemawat 6
7 Typical cluster: - 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory - Limited bisection bandwidth - Storage is on local IDE disks - GFS: distributed file system manages data Implementation Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines 7
8 MapReduce execution overview Data Locality Split 0 and 1 locate in the same worker machine, two map tasks are assigned to this worker. Input data is read locally! 1 GFS chunk may equal 1 or more splits Master Operation Master stores the state of each map and reduce tasks It receives intermediate file locations and push them to reduce tasks incrementally Diagram from the CACM version of the original MapReduce paper 8
9 Parallel Execution GFS Local disk The shuffle process uses RPC read GFS The partition function put all map output keys into R region, in this case R =2 and k2, k4,k5 is partitioned to region 1 while k1 and k3 are partitioned to region 2 The default partition function is hashing e.g. hash( key ) mod R Diagram from the original slides by Jeff Dean and Sanjay Ghemawat 9
10 Task Granularity And Pipelining Fine granularity tasks: many more map tasks than machines - Minimizes time for fault recovery, many workers (instead of one) take over the tasks on a failed worker - Can pipeline shuffling with map execution - Better dynamic load balancing Often use 200,000 map/5000 reduce tasks w/ 2000 machines Diagram from the original slides by Jeff Dean and Sanjay Ghemawat 10
11 A trace of a MapReduce job 11
12 A Real Word Count Example Input Data: user \t photo_id \t tags \t date \t place_id \n user_id: date: :07:05 Place_id: ymji9yyya5u6djf41q tags Output: all tags that occurs more than 500 times and its corresponding frequency 12
13 Input: Sample input and output and Map/Reduce :52:38 party music night photoshop disco festa OliN.OabAZ5BL3fBcw :17:51 santa beach night la lanzarote playa playja YOTPODybCJ8C3IUN_g :40:53 photoshop portait ubowh1cebpiaydi :46:02 cap festa dany 2007 BXEugr.bAZ5Srgn39A :46:02 cap festa dany 2007 BXEugr.bAZ5Srgn39A :46:02 cap festa dany 2007 BXEugr.bAZ5Srgn39A :58:49 city hdr ubowh1cebpiaydi Output: ianbramham 922 iceland 1818 iledefrance 1187 illustration 608 Map/Reduce jobs: Map: line -> (k:tag, v:1) Reduce: (k:tag,v:(v1,v2,..)) -> (K:tag, v: v sum ) The number is not always 1 as emitted by the map function because we normally apply a combiner after mapper job imcomk 3584 installation 4072 interest 742 Map/Reduce examples: Map: first line -> (party,1),(music,1),(night,1),,(festa,1) Reduce: (interest,(5,10,..)) -> (interest, 742) 13
14 A master node runs JobTracker (a MapReduce job) Hadoop Framework - In small or medial cluster (< 40 servers), it is OK to put the HDFS Namenode and MapReduce JobTracker in the same physical nodes - In large cluster (multiple racks), it is better to have dedicated Namenode and JobTracker Lots of slave node runs TaskTracker (a Mapper or Reducer task) - TaskTrackers run on Datanodes of HDFS - Locality, moving computation to data 14
15 The Important Java Classes In Hadoop Mapper task Reducer task 15
16 Important Java Classes In Hadoop The MapReduce Job 16
17 The Mapper public static class TagMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable ONE = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context) throws IOException, InterruptedException { String[] dataarray = value.tostring().split("\t"); if (dataarray.length < 5){ // a not complete record with all data return; // don't emit anything String tagstring = dataarray[3]; if (tagstring.length() > 0){ String[] tagarray = tagstring.split(" "); for(string tag: tagarray) { word.set(tag); context.write(word, ONE); This example is based on the WordCount example comes with hadoop download 17
18 The Reducer public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); private final static int MINFREQ = 500; public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); if (sum > MINFREQ){ //only emit when sum is bigger than the threshold result.set(sum); context.write(key, result); 18
19 The Job setting and submitting public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); String[] otherargs = new GenericOptionsParser(conf, args).getremainingargs(); if (otherargs.length!= 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); Job job = new Job(conf, "word count"); job.setnumreducetasks(2); job.setjarbyclass(wordcount.class); job.setmapperclass(tagmapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true)? 0 : 1); 19
20 InputFormat Input data - How input files are split up and read is defined by the InputFormat. FileInputFormat is the abstract class for all file inputs. InputFormat Description Key Value TextInputFormat InputSplit Default format for plain text files; reads lines of textfiles Byte offset of the line KeyValueTextInputFormat Parse lines into key, val pairs Everything up to the first tab character SequenceFileInputFormat A Hadoop-specific highperformance binary format User defined The line content The remainder of the line User defined - An InputSplit describes a unit of work that comprises a single map task in a MapReduce job. - By default, the FileInputFormat and its descendants break a file up into 64 MB chunks (the same size as blocks in HDFS) - This can be modified by setting split size in configuration file or in code at run time. 20
21 RecordReader Input data - The RecordReader class actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. - The RecordReader instance is defined by the InputFormat. The default InputFormat, TextInputFormat, provides a LineRecordReader, which treats each line of the input file as a new value. The key associated with each line is its byte offset in the file. - Developers can write their own RecordReader 21
22 Output Serialization is the process of turning structured objects into a byte stream for transmission over a network or for writing to persistent storage. Deserialization is the process of turning a byte stream back into a series of structured objects. Serialization is important for interprocess communication and for persistent storage in Hadoop - The Reducer use Remote Procedure Calls(RPC) to get intermediate data stored locally in Mapper nodes. - The RPC protocol uses serialization to render the message into a binary stream to be sent to the remote node, which then deserializes the binary stream into the original message. - The Reducer also write final results to HDFS Both interprocess communication and persistent storage requires serialization to be compact, fast, extensible and interoperable Hadoop uses its own serialization format, Writable - Text, IntWritable are all subclass of Writable. 22
23 Output Context is used to collect and write the ouput into intermediate as well as final files. The method context.write() takes (Key,Value) Partition and Shuffle - This process of moving map outputs to the reducers is known as shuffling. - The Partitioner class determines which partition a given (key, value) pair emitted by mapper will go to. The default one use hashing Sort: Each reduce task is responsible for reducing the values associated with several intermediate keys. The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer. OutputFormat: The (key, value) pairs collected by Context are then written to output files. The way they are written is governed by the OutputFormat. The default TextOutputFormat writes lines in key \t value form 23
24 How Hadoop runs a MapReduce Job This include the jar file, which has a default replication factor 10 (the hotspot mentioned in GFS paper) Diagram from Tom White, Hadoop, the definitive Guide, O reilly, 2009, page
25 The communication between mapper and reducer Diagram from Tom White, Hadoop, the definitive Guide, O reilly, 2009, page
26 Sort in Hadoop Hadoop has a standard TeraSort example, the source can be found from hadoop distribution Each reducer sorts its input <key,v> pairs based on the keys - Primitive key type such as string, int has its build comparator - Custom key type should provide a comparator Reducer are identified by id [0,R-1] - Reducer s output are named as part-r-00000, part-r-00001, - The final output is a merge of all individual outputs in the order Default hash partitioner ensure map output keys are distributed to all reducers evenly To implement sort, a customized partitioner is required to make sure all keys in reducer k-1 is always smaller than keys in reducer k. 26
27 The initial output of word count is sorted on key We want the list to be sorted on frequency Example: sort tags based on frequency automobile 2700 aviation 1957 avion 1518 awesome 626 badajoz 734 bahia 701 barcelona 1305 barros 639 bc 2271 bcn 507 beach 6010 beachwedding 51 pomegranate m 51 digitalrebel 51 grandwesterncanal 51 sa 51 linna 52 jenbrian 52 3steps 52 gazerock 52 creeks 52 27
28 A second job is needed Map/Reduce jobs: Map: (tag, freq) -> (k:freq, v:tag) Reduce: (k:freq,v:(t1,t2,..)) -> (K:t1, v: freq), (K:t1, v: freq), apple: 5 pear: 6 banana:71 Mapper 1 5:apple 6: pear 71:banana 5:apple 6: pear 6: red 23: yellow 35:palace Reducer 0 apple: 5 pear: 6 red: 6 yellow: 23 palace: 35 beijing: 60 city: 600 palace: 35 Mapper 2 35:palace 60:beijing 600: city 60:beijing 71: banana Reducer 1 beijing: 60 banana:71 red: 6 green: 1002 yellow: 23 Mapper 3 6: red 23: yellow 1002: green 600:city 1002: green Reducer 2 city: 600 green:
29 The Mapper and Reducer public class TagFreqMapper extends Mapper<Object, Text, IntWritable, Text> { private Text word = new Text(); private IntWritable freq = new IntWritable(); //just reverse the key and value public void map(object key, Text value, Context context) throws IOException, InterruptedException{ String[] tagfreqarray = value.tostring().split("\t"); if (tagfreqarray.length == 2){ word.set(tagfreqarray[0]); try{ freq.set(integer.parseint(tagfreqarray[1])); context.write(freq, word); catch(numberformatexception e){ return; //not doing anything public class TagFreqReducer extends Reducer<IntWritable,Text,Text,IntWritable> { public void reduce(intwritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException{ for (Text val : values) { context.write(val, key); 29
30 A very simple custom partitioner package info5011sort; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.partitioner; public class SortPartitioner extends Partitioner<IntWritable, Text> { public int getpartition(intwritable key, Text value, int arg2) { // a simple and static partitioner to partition the key into 3 regions // [6,50) => 0, [50, 500) => 1, [500,-) => 2 int keyint = key.get(); if (keyint <50) return 0; if (keyint < 500 ) return 1; else return 2; Set the custom partitioner for this job: job.setpartitionerclass(sortpartitioner.class); 30
31 Problem of the simple partitioner The keys are not evenly distributed! In the TeraSort algorithm, the boundaries of each region is determined dynamically based on data distribution The distribution is estimated by taking a small sample of the data - E.g. for example, the tag frequency distribution is like a zipf shape 31
32 Frequency The distribution of tag frequencies Frequency Histogram % % % % % % % tag frequency range Bin Frequency Cumulative % % % % % The bounds 10 and 30 would roughly divide the keys evenly among three reducers 32
33 Chaining multiple jobs If we want to find the tag frequencies and sort tags based on frequencies, we need to run two jobs It is easy to chain jobs together in a form of - Map1 -> Reduce 1 -> Map2 -> Reduce 2 - The first job in the chain should write its output to a path which is then used as the input path for the second job Job countjob = new Job(conf, "count Tag"); countjob.setnumreducetasks(3); TextInputFormat.addInputPath(countJob, new Path(otherArgs[0])); TextOutputFormat.setOutputPath(countJob, new Path("temp")); countjob.waitforcompletion(true); Job sortjob = new Job(conf,"sort Tag"); TextInputFormat.addInputPath(sortJob, new Path("temp")); TextOutputFormat.setOutputPath(sortJob, new Path(args[1])); System.exit(sortJob.waitForCompletion(true)? 0 : 1); 33
34 input - Photo: user \t photo_id \t tags \t date \t place_id \n Output: - Place_id \t freq Your turn 34
Getting to know Apache Hadoop
Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the
More informationMrs: MapReduce for Scientific Computing in Python
Mrs: for Scientific Computing in Python Andrew McNabb, Jeff Lund, and Kevin Seppi Brigham Young University November 16, 2012 Large scale problems require parallel processing Communication in parallel processing
More informationBig Data 2012 Hadoop Tutorial
Big Data 2012 Hadoop Tutorial Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 1 Contact Exercise Session Friday 14.15 to 15.00 CHN D 46 Your Assistant Martin Kaufmann Office: CAB E 77.2 E-Mail:
More informationBig Data Analytics* Outline. Issues. Big Data
Outline Big Data Analytics* Big Data Data Analytics: Challenges and Issues Misconceptions Big Data Infrastructure Scalable Distributed Computing: Hadoop Programming in Hadoop: MapReduce Paradigm Example
More informationExtreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk
Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing MapReduce and Hadoop 15 319, spring 2010 17 th Lecture, Mar 16 th Majd F. Sakr Lecture Goals Transition to MapReduce from Functional Programming Understand the origins of
More informationUniversity of Maryland. Tuesday, February 2, 2010
Data-Intensive Information Processing Applications Session #2 Hadoop: Nuts and Bolts Jimmy Lin University of Maryland Tuesday, February 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationWorking With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology
Working With Hadoop Now that we covered the basics of MapReduce, let s look at some Hadoop specifics. Mostly based on Tom White s book Hadoop: The Definitive Guide, 3 rd edition Note: We will use the new
More informationHadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart
Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated
More informationIntroduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
More informationHadoop WordCount Explained! IT332 Distributed Systems
Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,
More informationDATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
More informationIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process
More informationProcessing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems
Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:
More informationHadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013
Hadoop Scalable Distributed Computing Claire Jaja, Julian Chan October 8, 2013 What is Hadoop? A general-purpose storage and data-analysis platform Open source Apache software, implemented in Java Enables
More informationHadoop Design and k-means Clustering
Hadoop Design and k-means Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise
More informationTutorial- Counting Words in File(s) using MapReduce
Tutorial- Counting Words in File(s) using MapReduce 1 Overview This document serves as a tutorial to setup and run a simple application in Hadoop MapReduce framework. A job in Hadoop MapReduce usually
More informationLecture 3 Hadoop Technical Introduction CSE 490H
Lecture 3 Hadoop Technical Introduction CSE 490H Announcements My office hours: M 2:30 3:30 in CSE 212 Cluster is operational; instructions in assignment 1 heavily rewritten Eclipse plugin is deprecated
More informationIntroduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
More informationBIG DATA APPLICATIONS
BIG DATA ANALYTICS USING HADOOP AND SPARK ON HATHI Boyu Zhang Research Computing ITaP BIG DATA APPLICATIONS Big data has become one of the most important aspects in scientific computing and business analytics
More informationHow To Write A Map Reduce In Hadoop Hadooper 2.5.2.2 (Ahemos)
Processing Data with Map Reduce Allahbaksh Mohammedali Asadullah Infosys Labs, Infosys Technologies 1 Content Map Function Reduce Function Why Hadoop HDFS Map Reduce Hadoop Some Questions 2 What is Map
More informationData Science in the Wild
Data Science in the Wild Lecture 3 Some slides are taken from J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 Data Science and Big Data Big Data: the data cannot
More informationLambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014
Lambda Architecture CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 1 Goals Cover the material in Chapter 8 of the Concurrency Textbook The Lambda Architecture Batch Layer MapReduce
More informationHadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab
IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态
More informationBig Data Storage, Management and challenges. Ahmed Ali-Eldin
Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big
More informationIntroduc)on to Map- Reduce. Vincent Leroy
Introduc)on to Map- Reduce Vincent Leroy Sources Apache Hadoop Yahoo! Developer Network Hortonworks Cloudera Prac)cal Problem Solving with Hadoop and Pig Slides will be available at hgp://lig- membres.imag.fr/leroyv/
More informationIntroduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
More informationIntroduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu
Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan sriram@sdsc.edu Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.
More informationHadoop MapReduce: Review. Spring 2015, X. Zhang Fordham Univ.
Hadoop MapReduce: Review Spring 2015, X. Zhang Fordham Univ. Outline 1.Review of how map reduce works: the HDFS, Yarn sorting and shuffling advanced topics: partial sort, total sort, join, chained mapper/reducer,
More informationBig Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source
More informationJordan Boyd-Graber University of Maryland. Tuesday, February 10, 2011
Data-Intensive Information Processing Applications! Session #2 Hadoop: Nuts and Bolts Jordan Boyd-Graber University of Maryland Tuesday, February 10, 2011 This work is licensed under a Creative Commons
More informationMap-Reduce and Hadoop
Map-Reduce and Hadoop 1 Introduction to Map-Reduce 2 3 Map Reduce operations Input data are (key, value) pairs 2 operations available : map and reduce Map Takes a (key, value) and generates other (key,
More informationHadoop Lab Notes. Nicola Tonellotto November 15, 2010
Hadoop Lab Notes Nicola Tonellotto November 15, 2010 2 Contents 1 Hadoop Setup 4 1.1 Prerequisites........................................... 4 1.2 Installation............................................
More informationXiaoming Gao Hui Li Thilina Gunarathne
Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal
More informationHadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology
Hadoop Dawid Weiss Institute of Computing Science Poznań University of Technology 2008 Hadoop Programming Summary About Config 1 Open Source Map-Reduce: Hadoop About Cluster Configuration 2 Programming
More informationWord Count Code using MR2 Classes and API
EDUREKA Word Count Code using MR2 Classes and API A Guide to Understand the Execution of Word Count edureka! A guide to understand the execution and flow of word count WRITE YOU FIRST MRV2 PROGRAM AND
More informationThe Hadoop Eco System Shanghai Data Science Meetup
The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related
More informationTutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
More informationWhat s Big Data? Big Data: 3V s. Variety (Complexity) 5/5/2016. Introduction to Big Data, mostly from www.cs.kent.edu/~jin/bigdata by Ruoming Jin
data every day 5/5/2016 Introduction to Big Data, mostly from www.cs.kent.edu/~jin/bigdata by Ruoming Jin What s Big Data? No single definition; here is from Wikipedia: Big data is the term for a collection
More informationMapReduce (in the cloud)
MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:
More informationData-intensive computing systems
Data-intensive computing systems Hadoop Universtity of Verona Computer Science Department Damiano Carra Acknowledgements! Credits Part of the course material is based on slides provided by the following
More informationEnterprise Data Storage and Analysis on Tim Barr
Enterprise Data Storage and Analysis on Tim Barr January 15, 2015 Agenda Challenges in Big Data Analytics Why many Hadoop deployments under deliver What is Apache Spark Spark Core, SQL, Streaming, MLlib,
More informationHadoop Overview. July 2011. Lavanya Ramakrishnan Iwona Sakrejda Shane Canon. Lawrence Berkeley National Lab
Hadoop Overview Lavanya Ramakrishnan Iwona Sakrejda Shane Canon Lawrence Berkeley National Lab July 2011 Overview Concepts & Background MapReduce and Hadoop Hadoop Ecosystem Tools on top of Hadoop Hadoop
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationParallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
More informationCS54100: Database Systems
CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics
More informationParallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model
More informationWeekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationJeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
More informationMapReduce, Hadoop and Amazon AWS
MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables
More informationIstanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış
Istanbul Şehir University Big Data Camp 14 Hadoop Map Reduce Aslan Bakirov Kevser Nur Çoğalmış Agenda Map Reduce Concepts System Overview Hadoop MR Hadoop MR Internal Job Execution Workflow Map Side Details
More informationComparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in
More informationPerformance and Energy Efficiency of. Hadoop deployment models
Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationhttp://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
More informationmap/reduce connected components
1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains
More informationThe MapReduce Framework
The MapReduce Framework Luke Tierney Department of Statistics & Actuarial Science University of Iowa November 8, 2007 Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 1 / 16 Background
More informationBig Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
More informationBy Hrudaya nath K Cloud Computing
Processing Big Data with Map Reduce and HDFS By Hrudaya nath K Cloud Computing Some MapReduce Terminology Job A full program - an execution of a Mapper and Reducer across a data set Task An execution of
More informationAVRO - SERIALIZATION
http://www.tutorialspoint.com/avro/avro_serialization.htm AVRO - SERIALIZATION Copyright tutorialspoint.com What is Serialization? Serialization is the process of translating data structures or objects
More informationCS 378 Big Data Programming. Lecture 2 Map- Reduce
CS 378 Big Data Programming Lecture 2 Map- Reduce MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is processed But viewed in small increments
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationMapReduce Job Processing
April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File
More informationCloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
More informationCopy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)
Beijing Codelab 1 Introduction to the Hadoop Environment Spinnaker Labs, Inc. Contains materials Copyright 2007 University of Washington, licensed under the Creative Commons Attribution 3.0 License --
More informationHDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"
HDInsight Essentials Rajesh Nadipalli Chapter No. 1 "Hadoop and HDInsight in a Heartbeat" In this package, you will find: A Biography of the author of the book A preview chapter from the book, Chapter
More informationOutline of Tutorial. Hadoop and Pig Overview Hands-on
Outline of Tutorial Hadoop and Pig Overview Hands-on 1 Hadoop and Pig Overview Lavanya Ramakrishnan Shane Canon Lawrence Berkeley National Lab October 2011 Overview Concepts & Background MapReduce and
More informationMAPREDUCE Programming Model
CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce
More informationCS 378 Big Data Programming
CS 378 Big Data Programming Lecture 2 Map- Reduce CS 378 - Fall 2015 Big Data Programming 1 MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is
More informationCloud Computing: MapReduce and Hadoop
Cloud Computing: MapReduce and Hadoop June 2010 Marcel Kunze, Research Group Cloud Computing KIT University of the State of Baden-Württemberg and National Laboratory of the Helmholtz Association www.kit.edu
More informationHadoop. History and Introduction. Explained By Vaibhav Agarwal
Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow
More informationMapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
More informationInternals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
More informationGraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
More informationStep 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:
Codelab 1 Introduction to the Hadoop Environment (version 0.17.0) Goals: 1. Set up and familiarize yourself with the Eclipse plugin 2. Run and understand a word counting program Setting up Eclipse: Step
More informationhadoop Running hadoop on Grid'5000 Vinicius Cogo vielmo@lasige.di.fc.ul.pt Marcelo Pasin pasin@di.fc.ul.pt Andrea Charão andrea@inf.ufsm.
hadoop Running hadoop on Grid'5000 Vinicius Cogo vielmo@lasige.di.fc.ul.pt Marcelo Pasin pasin@di.fc.ul.pt Andrea Charão andrea@inf.ufsm.br Outline 1 Introduction 2 MapReduce 3 Hadoop 4 How to Install
More informationHadoop Configuration and First Examples
Hadoop Configuration and First Examples Big Data 2015 Hadoop Configuration In the bash_profile export all needed environment variables Hadoop Configuration Allow remote login Hadoop Configuration Download
More informationA very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
More informationHadoop Streaming. Table of contents
Table of contents 1 Hadoop Streaming...3 2 How Streaming Works... 3 3 Streaming Command Options...4 3.1 Specifying a Java Class as the Mapper/Reducer... 5 3.2 Packaging Files With Job Submissions... 5
More informationINTRODUCTION TO HADOOP
Hadoop INTRODUCTION TO HADOOP Distributed Systems + Middleware: Hadoop 2 Data We live in a digital world that produces data at an impressive speed As of 2012, 2.7 ZB of data exist (1 ZB = 10 21 Bytes)
More informationMap Reduce / Hadoop / HDFS
Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview
More informationHadoop Basics with InfoSphere BigInsights
An IBM Proof of Technology Hadoop Basics with InfoSphere BigInsights Unit 2: Using MapReduce An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government Users Restricted Rights
More informationHowTo Hadoop. Devaraj Das
HowTo Hadoop Devaraj Das Hadoop http://hadoop.apache.org/core/ Hadoop Distributed File System Fault tolerant, scalable, distributed storage system Designed to reliably store very large files across machines
More informationHadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN
Hadoop Framework technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal
More informationThe Hadoop Framework
The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationMapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Motivation: Large Scale Data Processing Many tasks: Process lots of data to produce other data Want to use
More informationMapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially
More informationHadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps
Distributed Recommenders Fall 2010 Distributed Recommenders Distributed Approaches are needed when: Dataset does not fit into memory Need for processing exceeds what can be provided with a sequential algorithm
More informationHadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com
Hadoop and Eclipse Eclipse Hawaii User s Group May 26th, 2009 Seth Ladd http://sethladd.com Goal YOU can use the same technologies as The Big Boys Google Yahoo (2000 nodes) Last.FM AOL Facebook (2.5 petabytes
More informationWelcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
More informationLecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
More informationZebra and MapReduce. Table of contents. 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples...
Table of contents 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples... 2 1. Overview MapReduce allows you to take full advantage of Zebra's capabilities.
More informationResearch on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
More informationSriram Krishnan, Ph.D. sriram@sdsc.edu
Sriram Krishnan, Ph.D. sriram@sdsc.edu (Re-)Introduction to cloud computing Introduction to the MapReduce and Hadoop Distributed File System Programming model Examples of MapReduce Where/how to run MapReduce
More informationDistributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez (200950006) Sameer Kumar (200950031)
using Hadoop MapReduce framework Binoy A Fernandez (200950006) Sameer Kumar (200950031) Objective To demonstrate how the hadoop mapreduce framework can be extended to work with image data for distributed
More informationMASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015
7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE
More informationBig Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
More information