Scalable Computing with Hadoop

Similar documents

Hadoop Design and k-means Clustering

Hadoop WordCount Explained! IT332 Distributed Systems

Introduction To Hadoop

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Introduction to MapReduce and Hadoop

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Extreme Computing. Hadoop MapReduce in more detail.

Data Science in the Wild

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

University of Maryland. Tuesday, February 2, 2010

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Getting to know Apache Hadoop

Hadoop Streaming. Table of contents

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Big Data Management and NoSQL Databases

INTRODUCTION TO HADOOP

Introduction to Cloud Computing

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Hadoop and ecosystem * 本文中的言论仅代表作者个人观点 * 本文中的一些图例来自于互联网. Information Management. Information Management IBM CDL Lab

map/reduce connected components

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

CSE-E5430 Scalable Cloud Computing Lecture 2

Data-intensive computing systems

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

The Hadoop Eco System Shanghai Data Science Meetup

Xiaoming Gao Hui Li Thilina Gunarathne

Data-Intensive Computing with Map-Reduce and Hadoop

Parallel Processing of cluster by Map Reduce

Big Data and Apache Hadoop s MapReduce

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

PassTest. Bessere Qualität, bessere Dienstleistungen!

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Join Algorithms using Map/Reduce

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Full Text Search of Web Archive Collections

Introduc)on to Map- Reduce. Vincent Leroy

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Map-Reduce and Hadoop

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Hadoop. HDFS & MapReduce. The Basics: divide and conquer petabyte-scale data. Matthew McCullough, Ambient Ideas, LLC

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Map Reduce / Hadoop / HDFS

Big Data and Scripting map/reduce in Hadoop

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Web Crawling with Apache Nutch

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

BIG DATA, MAPREDUCE & HADOOP

Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis

Optimization of Distributed Crawler under Hadoop

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Big Data With Hadoop

HADOOP PERFORMANCE TUNING

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

A very short Intro to Hadoop

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Introduction to Hadoop

Introduction to Parallel Programming and MapReduce

GraySort and MinuteSort at Yahoo on Hadoop 0.23

The Hadoop Framework

Mrs: MapReduce for Scientific Computing in Python

MapReduce with Apache Hadoop Analysing Big Data

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL)

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

MapReduce and Hadoop Distributed File System

THE HADOOP DISTRIBUTED FILE SYSTEM

Introduction to Hadoop

Hypertable Architecture Overview

CS54100: Database Systems

5 HDFS - Hadoop Distributed System

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Distributed Model Checking Using Hadoop

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Big Data Analysis and Its Scheduling Policy Hadoop

How To Write A Mapreduce Program In Java.Io (Orchestra)

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

MapReduce. Tushar B. Kute,

Transcription:

Scalable Computing with Hadoop Doug Cutting cutting@apache.org dcutting@yahoo-inc.com 5/4/06

Seek versus Transfer B-Tree requires seek per access unless to recent, cached page so can buffer & pre-sort accesses but, w/ fragmentation, must still seek per page G O U A B C D E F H I J K L M...

Seek versus Transfer update by merging merge sort takes log(updates), at transfer rate merging updates is linear in db size, at transfer rate if 10MB/s xfer, 10ms seek, 1% update of TB db 100b entries, 10kb pages, 10B entries, 1B pages seek per update requires 1000 days! seek per page requires 100 days! transfer entire db takes 1 day

Hadoop DFS modelled after Google's GFS single namenode maps name <blockid>* maps blockid <host:port> replication_level many datanodes, one per disk generally map blockid <byte>* poll namenode for replication, deletion, etc. requests client code talks to both

Hadoop MapReduce Platform for reliable, scalable computing. All data is sequences of <key,value> pairs. Programmer specifies two primary methods: map(k, v) <k', v'>* reduce(k', <v'>*) <k', v'>* also partition(), compare(), & others All v' with same k' are reduced together, in order. bonus: built-in support for sort/merge!

MapReduce job processing input split 0 map() split 1 map() reduce() part 0 split 2 map() reduce() part 1 split 3 map() reduce() part 2 split 4 map tasks map() reduce tasks output

Example: RegexMapper public class RegexMapper implements Mapper { private Pattern pattern; private int group; public void configure(jobconf job) { pattern = Pattern.compile(job.get("mapred.mapper.regex")); group = job.getint("mapred.mapper.regex.group", 0); } } public void map(writablecomparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String text = ((UTF8)value).toString(); Matcher matcher = pattern.matcher(text); while (matcher.find()) { output.collect(new UTF8(matcher.group(group)), new LongWritable(1)); } }

Example: LongSumReducer public class LongSumReducer implements Reducer { public void configure(jobconf job) {} public void reduce(writablecomparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { long sum = 0; while (values.hasnext()) { sum += ((LongWritable)values.next()).get(); } } } output.collect(key, new LongWritable(sum));

Example: main() public static void main(string[] args) throws IOException { NutchConf defaults = NutchConf.get(); JobConf job = new JobConf(defaults); job.setinputdir(new File(args[0])); job.setmapperclass(regexmapper.class); job.set("mapred.mapper.regex", args[2]); job.set("mapred.mapper.regex.group", args[3]); job.setreducerclass(longsumreducer.class); job.setoutputdir(args[1]); job.setoutputkeyclass(utf8.class); job.setoutputvalueclass(longwritable.class); } JobClient.runJob(job);

Nutch Algorithms inject urls into a crawl db, to bootstrap it. loop: generate a set of urls to fetch from crawl db; fetch a set of urls into a segment; parse fetched content of a segment; update crawl db with data parsed from a segment. invert links parsed from segments index segment text & inlink anchor text

Nutch on MapReduce & NDFS Nutch's major algorithms converted in 2 weeks. Before: several were undistributed scalabilty bottlenecks distributable algorithms were complex to manage collections larger than 100M pages impractical After: all are scalable, distributed, easy to operate code is substantially smaller & simpler should permit multi-billion page collections

Data Structure: Crawl DB CrawlDb is a directory of files containing: <URL, CrawlDatum> CrawlDatum: <status, date, interval, failures, linkcount,...> Status: {db_unfetched, db_fetched, db_gone, linked, fetch_success, fetch_fail, fetch_gone}

Algorithm: Inject MapReduce1: Convert input to DB format In: flat text file of urls Map(line) <url, CrawlDatum>; status=db_unfetched Reduce() is identity; Output: directory of temporary files MapReduce2: Merge into existing DB Input: output of Step1 and existing DB files Map() is identity. Reduce: merge CrawlDatum's into single entry Out: new version of DB

Algorithm: Generate MapReduce1: select urls due for fetch In: Crawl DB files Map() if date now, invert to <CrawlDatum, url> Partition by value hash (!) to randomize Reduce: compare() order by decreasing CrawlDatum.linkCount output only top-n most-linked entries MapReduce2: prepare for fetch Map() is invert; Partition() by host, Reduce() is identity. Out: Set of <url,crawldatum> files to fetch in parallel

Algorithm: Fetch MapReduce: fetch a set of urls In: <url,crawldatum>, partition by host, sort by hash Map(url,CrawlDatum) <url, FetcherOutput> multi-threaded, async map implementation calls existing Nutch protocol plugins FetcherOutput: <CrawlDatum, Content> Reduce is identity Out: two files: <url,crawldatum>, <url,content>

Algorithm: Parse MapReduce: parse content In: <url, Content> files from Fetch Map(url, Content) <url, Parse> calls existing Nutch parser plugins Reduce is identity. Parse: <ParseText, ParseData> Out: split in three: <url,parsetext>, <url,parsedata> and <url,crawldatum> for outlinks.

Algorithm: Update Crawl DB MapReduce: integrate fetch & parse out into db In: <url,crawldatum> existing db plus fetch & parse out Map() is identity Reduce() merges all entries into a single new entry overwrite previous db status w/ new from fetch sum count of links from parse w/ previous from db Out: new crawl db

Algorithm: Invert Links MapReduce: compute inlinks for all urls In: <url,parsedata>, containing page outlinks Map(srcUrl, ParseData> <desturl, Inlinks> collect a single-element Inlinks for each outlink limit number of outlinks per page Inlinks: <srcurl, anchortext>* Reduce() appends inlinks Out: <url, Inlinks>, a complete link inversion

Algorithm: Index MapReduce: create Lucene indexes In: multiple files, values wrapped in <Class, Object> <url, ParseData> from parse, for title, metadata, etc. <url, ParseText> from parse, for text <url, Inlinks> from invert, for anchors <url, CrawlDatum> from fetch, for fetch date Map() is identity Reduce() create a Lucene Document call existing Nutch indexing plugins Out: build Lucene index; copy to fs at end

Hadoop MapReduce Extensions Split output to multiple files saves subsequent i/o, since inputs are smaller Mix input value types saves MapReduce passes to convert values Async Map permits multi-threaded Fetcher Partition by Value facilitates selecting subsets w/ maximum key values

Thanks! http://lucene.apache.org/hadoop/