map/reduce connected components

Similar documents

Hadoop WordCount Explained! IT332 Distributed Systems

Introduction to MapReduce and Hadoop

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

How To Write A Mapreduce Program In Java.Io (Orchestra)

Word count example Abdalrahman Alsaedi

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Data Science in the Wild

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Hadoop: Understanding the Big Data Processing Method

CS54100: Database Systems

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

By Hrudaya nath K Cloud Computing

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

Getting to know Apache Hadoop

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Introduction To Hadoop

HPCHadoop: MapReduce on Cray X-series

Extreme Computing. Hadoop MapReduce in more detail.

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Internals of Hadoop Application Framework and Distributed File System

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

Hadoop Streaming. Table of contents

Big Data Management and NoSQL Databases

Hadoop Configuration and First Examples

Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

MR-(Mapreduce Programming Language)

Apache Hadoop new way for the company to store and analyze big data

Programming in Hadoop Programming, Tuning & Debugging

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Word Count Code using MR2 Classes and API

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Data Science Analytics & Research Centre

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Introduction to Cloud Computing

School of Parallel Programming & Parallel Architecture for HPC ICTP October, Hadoop for HPC. Instructor: Ekpe Okorafor

MapReduce. Tushar B. Kute,

Single Node Setup. Table of contents

Mrs: MapReduce for Scientific Computing in Python

Extreme computing lab exercises Session one

Download and install Download virtual machine Import virtual machine in Virtualbox

Extreme computing lab exercises Session one

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Building a distributed search system with Apache Hadoop and Lucene. Mirko Calvaresi

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Big Data 2012 Hadoop Tutorial

Tutorial- Counting Words in File(s) using MapReduce

TP1: Getting Started with Hadoop

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Hadoop Installation MapReduce Examples Jake Karnes

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Hadoop Architecture. Part 1

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

MapReduce Tutorial. Table of contents

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL)

Tutorial for Assignment 2.0

Hadoop, Hive & Spark Tutorial

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Big Data Analytics* Outline. Issues. Big Data

Introduction to MapReduce

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Chapter 7. Using Hadoop Cluster and MapReduce

Implementations of iterative algorithms in Hadoop and Spark

A very short Intro to Hadoop

Sriram Krishnan, Ph.D.

Running Hadoop on Windows CCNP Server

Map Reduce & Hadoop Recommended Text:

INTRODUCTION TO HADOOP

Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"

HowTo Hadoop. Devaraj Das

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

Basic Hadoop Programming Skills

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

Distributed Model Checking Using Hadoop

Cloud Computing. Chapter Hadoop

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Introduction to Cloud Computing

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Tutorial for Assignment 2.0

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

MapReduce, Hadoop and Amazon AWS

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

University of Maryland. Tuesday, February 2, 2010

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

Lecture 3 Hadoop Technical Introduction CSE 490H

Transcription:

1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains (reduce) connectivity is not affected each reducer returns n edges result graph has fewer edges ( k n) number of iterations analogous to MST

2, map/reduce generalizing the filtering approach partition randomly into subproblems reduce size in each subproblem independently recombine to (reduced) problem repeat until problem size is small enough for single node solve small/sparse instance on single node for instantiations, two guarantees are needed: filtered parts in the subproblems do not affect optimal solution number of iterations is limited

3, map/reduce performance of algorithms map reduce algorithms vary in a number of behavioral parameters which are characteristic for performance (total computation time) incurred workload (total involved work on all machines) space consumption (memory) these can be formalized in two groups 1 key complexity - resource consumption on individual nodes sequential complexity - overall resource consumption 1 Goel, Munagala, 2012

4, map/reduce performance of algorithms key complexity maximum size of any key/value pair maximum running time for any mapper or reducer on any key/value pair maximum memory consumption for any mapper or reducer on any key/value pair nodes must be capable of executing individual map/reduce operations sequential complexity size of all key/value pairs input and output by mappers/reducers total running time for all mappers/reducers whole system must be capable of execution (e.g. sufficient number of machines)

5, map/reduce complexity classes efficiency of algorithms is measured by their complexity complexity (time/space) is defined depending on size of input problem problems are distinguished by the (possible) existence of efficient algorithms example: P 2 P = { decision problems, solvable in time O(p(n)) where p is a polynomial of input size n}. problems are tractable (efficiently solvable) if polynomial time algorithm exists is there a comparable definition for map/reduce? 2 This is not the exact definition, but an illustration.

6, complexity for parallel algorithms: N C PRAM setting: many cores, common memory parallel algorithms can use arbitrary many cores these are usually not available (have to be emulated) N C - Nick s class A decision problem is in N C: there exists an algorithm solving it which uses polynomial number of cores (O(n k )) solves the problem in polylogarithmic time (O((log n) c )) with k and c constant and n being the input size. considered to be the class of efficiently parallelizable problems variants: N C i in time O((log n) i ) (often denoted O(log i n))

complexity for map reduce algorithms MRC i Let ɛ > 0 be a fixed value. Let the input be a finite sequence k i, v i with total size (in bits) n. An algorithm A consists of R map (µ) and reduce (ρ) steps µ 1, ρ 1, µ 2, ρ 2,..., µ R, ρ R. A is in MRC i, if it outputs the correct answer with probability at least 3/4 and for input size n: each µ r /ρ r is a randomized mapper/reducer with run-time polynomial in n and memory consumption O(n 1 ɛ ) and word length O(log n) the total space consumption of key/value pairs resulting from any µ r is in O(n 2 2ɛ ) (note: = (n 1 ɛ ) 2 ) the number of rounds R O(log i n). deterministic variant DMRC i with probability 1 source: Karloff, Suri, Vassilvitskii, A Model of Computation for MapReduce, 2010 7,

8, interpretation of MRC i each individual task (map/reduce) has polynomial runtime space consumption is O(n 1 ɛ ) e.g. for ɛ = 0.5 O( n) ɛ should be maximized a mapper should not produce more than quadratic amount of output O(n 1 ɛ ) in memory, output O(n 2 2ɛ ) total memory on all nodes limited to O(n 2 2ɛ ) needs O(n 1 ɛ ) nodes note: P and N C are classes of problems, MRC is a class of algorithms open question: how to distribute (shuffle) O(n 2 2ɛ ) key/value pairs in memory O(n 1 1ɛ ) Graham s Greedy Algorithm (Graham 1966)

9, MST complexity note: notation of MRC and MST collide, n and ɛ from MST input size: n 1+c steps: c/ɛ O(log n) (log n = 1 + c) nodes in MST need memory n 1+ɛ memory: n 1+ɛ < n 1+c, otherwise direct solution on single node

Hadoop/HDFS 10,

11, introduction up to here: only mongodb implementation of mapreduce considered this is not a full implementation comes with limitations e.g. no guarantee that all keys end up at one reducer behavior can not be influenced has the advantage of simple setup and simple usage example for m/r-implementation with more features and possiblities is Hadoop/HDFS allows implementations in Java

12, introduction two main components HDFS - Hadoop distributed file system runs on top of OS file system provides a view on real files stored on the actual hardware fixed block size (64MB) optimized for write once, read often Hadoop - the execution layer implements the mapreduce execution handles failed tasks (retry and give up) handles distribution of tasks both (storage and execution layer) run on the same nodes

13, typical architecture a network of nodes is connected to an Hadoop cluster one master node NameNode - address data (which block on which node) JobTracker - execution management slave nodes DataNode - data storage TaskTracker - execution of tasks both processes run on the same (physical) node clients send jobs to JobTracker JobTracker distributes tasks among slave nodes

14, HDFS - overview NameNode distributes blocks to nodes ensures redundancy constant contact: check-in from slaves keeps data organized as files and directories is single point of failure handles only meta-data, nodes and clients communicate directly optimized for streaming access no random file access no appending of data organized like Unix file systems can be mounted (i.e. blended into general file system)

15, setting up an example installation requirements Hadoop/HDFS use ssh and rsync for communication ssh - secure shell client (remote login), need client and server (sshd) rsync - remote synchronization (data transfer) jobs are implemented in Java, need Java Runtime Environment installation from tarball 3 use latest stable version (2.4.1) local installation - standalone mode extract tarball, change into directory, test: run bin/hadoop 3 source: https://hadoop.apache.org/releases.html

16, test job execution Hadoop distribution provides example jobs in hadoop-examples-1.1.2.jar jobs need input an output directory create input directory and copy some files in there $mkdir input $cp conf/*.xml input execute example job: $bin/hadoop jar hadoop-examples-1.1.2.jar\ grep input output dfs[a-z.]+ result: lots of logging output files in output ls output/ _SUCCESS part-00000

job implementation a job is defined in an arbitrary Java class Hadoop will start the main()-method main method configures and runs the actual job Job configuration name input/output format special classes providing verification and reading/writing methods for I/O operations output key/value classes - type specifications as Java classes mapper/combiner/reducer mapper and reducer as usual but as Java classes combiner analogue but used to combine mapper output before sending to other nodes input/output paths for file access 17,

JobClient.runJob(conf); } 18, putting a job together public static void main(final String[] args) throws Exception { final JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); conf.setmapperclass(map.class); conf.setcombinerclass(reduce.class); conf.setreducerclass(reduce.class); conf.setinputformat(textinputformat.class); conf.setoutputformat(textoutputformat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1]));

19, the mapper class implement interface Mapper has type parameters: input key, input value output key/value class MapReduceBase provides empty implementations for functions map function implemented as public void map() key and value as input (according to type parameters) OutputCollector used to emit key/value pairs Reporter for logging and progress reports example: word count

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private final Text word = new Text(); @Override public void map(final LongWritable key, final Text value, final OutputCollector<Text, IntWritable> output, final Reporter reporter) throws IOException { final String line = value.tostring(); final StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); output.collect(word, one); } } } 20,

21, the reducer class again, type parameters for input and output function reduce() for the actual task reporting and output collection analogous to mapper class both classes need access to hadoop library hadoop-core-1.1.2.jar can be found in main directory of distribution

22, public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(final Text key, final Iterator<IntWritable> values, final OutputCollector<Text, IntWritable> output, final Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

23, compile and execute before execution, classes have to be compiled and packed into jar-file eclipse export is possible, alternatively in distribution directory: mkdir wordcount_classes javac -classpath hadoop-1.1.2-core.jar -d wordcount_classes\ WordCount.java jar -cvf wordcount.jar -C wordcount_classes without config, hdfs read directly from system: bin/hadoop dfs -ls /tmp bin/hadoop dfs -cat /tmp/hadoop/input/test.txt

24, compile and execute run with /tmp/hadoop/input as input /tmp/hadoop/output as output directory class WordCount in package my.pack here: directly in main dir of Hadoop distribution otherwise: provide full path to jar bin/hadoop wordcount.jar my.pack.wordcount \ /tmp/hadoop/input /tmp/hadoop/output TextInputFormat reads all files in the input dir use line number as key, line as value TextOutputFormat writes all key/value pairs as plain text to output dir for more details, c.f. hadoop.apache.org/docs/stable/mapred_tutorial.html# Inputs+and+Outputs

25, using hadoop with python hadoop supports execution of scripts in arbitrary languages interface: input and output via system in/out streams scripts read input from stdin and write output to stdout #!/bin/bash /opt/hadoop/bin/hadoop jar /opt/hadoop/share/hadoop/tools/lib/hadoop-*streaming*.jar \ -mapper mapper.py -reducer reducer.py \ -input pg4300.txt -output pg4300.out c.f. example from: http://www.michael-noll.com/tutorials/ writing-an-hadoop-mapreduce-program-in-python/ input/output comes as text/csv files (tab delimited)