Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Similar documents
CS54100: Database Systems

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Petabyte-scale Data with Apache HDFS. Matt Foley Hortonworks, Inc.

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook

Apache Hadoop FileSystem and its Usage in Facebook

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

Apache Hadoop FileSystem Internals

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

CSE-E5430 Scalable Cloud Computing Lecture 2

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Introduction to MapReduce and Hadoop

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Big Data Management and NoSQL Databases

BBM467 Data Intensive ApplicaAons

Application Development. A Paradigm Shift

HPCHadoop: MapReduce on Cray X-series

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Apache Hadoop. Alexandru Costan

Design and Evolution of the Apache Hadoop File System(HDFS)

Hadoop Architecture and its Usage at Facebook

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Hadoop: Understanding the Big Data Processing Method

Large scale processing using Hadoop. Ján Vaňo

Hadoop WordCount Explained! IT332 Distributed Systems

Open source Google-style large scale data analysis with Hadoop

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

How To Scale Out Of A Nosql Database

Big Data and Apache Hadoop s MapReduce

THE HADOOP DISTRIBUTED FILE SYSTEM

Programming in Hadoop Programming, Tuning & Debugging

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

MapReduce, Hadoop and Amazon AWS

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Data Science in the Wild

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Getting to know Apache Hadoop

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Hadoop IST 734 SS CHUNG

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Hadoop implementation of MapReduce computational model. Ján Vaňo

MapReduce with Apache Hadoop Analysing Big Data

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source large scale distributed data management with Google s MapReduce and Bigtable

The future: Big Data, IoT, VR, AR. Leif Granholm Tekla / Trimble buildings Senior Vice President / BIM Ambassador

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Introduction To Hadoop

MR-(Mapreduce Programming Language)

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Data-Intensive Computing with Map-Reduce and Hadoop

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Chapter 7. Using Hadoop Cluster and MapReduce

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Introduction to Hadoop

HDFS Architecture Guide

Data Science Analytics & Research Centre

Internals of Hadoop Application Framework and Distributed File System

Hadoop and Map-Reduce. Swati Gore

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Hadoop Distributed File System. Dhruba Borthakur June, 2007

MapReduce and Hadoop Distributed File System

HADOOP MOCK TEST HADOOP MOCK TEST II

School of Parallel Programming & Parallel Architecture for HPC ICTP October, Hadoop for HPC. Instructor: Ekpe Okorafor

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop: Embracing future hardware

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Introduc8on to Apache Spark

NoSQL and Hadoop Technologies On Oracle Cloud

Distributed File Systems

Integrating Kerberos into Apache Hadoop

Extreme Computing. Hadoop MapReduce in more detail.

Chase Wu New Jersey Ins0tute of Technology

MapReduce and Hadoop Distributed File System V I J A Y R A O

Transcription:

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb 2006 Before Grid team, I worked on Yahoos WebMap PhD from UC Irvine in Software Testing. VP of Apache for Hadoop Chair of Hadoop Program Management Committee Responsible for Building the Hadoop community Interfacing between the Hadoop PMC and the Apache Board

Problem How do you scale up applications? 100 s of terabytes of data Takes 11 days to read on 1 computer Need lots of cheap computers Fixes speed problem (15 minutes on 1000 computers), but Reliability problems In large clusters, computers fail every day Cluster size is not fixed Need common infrastructure Must be efficient and reliable

Solution Open Source Apache Project Hadoop is Doug Cutting s son s stuff elephant. Hadoop Core includes: Distributed File System - distributes data Map/Reduce - distributes application Written in Java Runs on Linux, Mac OS/X, Windows, and Solaris Commodity hardware

Hadoop Timeline 2004 HDFS & map/reduce started in Nutch Dec 2005 Nutch ported to map/reduce Jan 2006 Doug Cutting joins Yahoo Feb 2006 Factored out of Nutch. Apr 2006 Sorts 1.9 TB on 188 nodes in 47 hours May 2006 Yahoo sets up research cluster Jan 2008 Hadoop is a top level Apache project Feb 2008 Yahoo creating Webmap with Hadoop Apr 2008 Wins Terabyte sort benchmark Aug 2008 Ran 4000 node Hadoop cluster

Commodity Hardware Cluster!"#$#%&$' ("#$#%&$' )*+, )*+, )*+, )*+, )*+, )*+, -$./. -$./ -$./. -$./ -$./. -$./ -$./. -$./ -$./. -$./ -$./. -$./ Typically in 2 level architecture Nodes are commodity Linux PCs 40 nodes/rack Uplink from rack is 8 gigabit Rack-internal is 1 gigabit all-to-all

Distributed File System Single petabyte file system for entire cluster Managed by a single namenode. Files are written, read, renamed, deleted, but append-only. Optimized for streaming reads of large files. Files are broken in to large blocks. Transparent to the client Blocks are typically 128 MB Replicated to several datanodes, for reliability Client talks to both namenode and datanodes Data is not sent through the namenode. Throughput of file system scales nearly linearly. Access from Java, C, or command line.

Block Placement Default is 3 replicas, but settable Blocks are placed (writes are pipelined): On same node On different rack On the other rack Clients read from closest replica If the replication for a block drops below target, it is automatically re-replicated.

HDFS Dataflow ()*'+#!"#" $%&'!"#" $%&'!"#" $%&'

Data Correctness Data is checked with CRC32 File Creation Client computes checksum per 512 byte DataNode stores the checksum File access Client retrieves the data and checksum from DataNode If Validation fails, Client tries other replicas Periodic validation by DataNode

Map/Reduce Map/Reduce is a programming model for efficient distributed computing It works like a Unix pipeline: cat input grep sort uniq -c cat > output Input Map Shuffle & Sort Reduce Output Efficiency from Streaming through data, reducing seeks Pipelining A good fit for a lot of applications Log processing Web index building Data mining and machine learning

Map/Reduce Dataflow

Map/Reduce features Java, C++, and text-based APIs In Java use Objects and and C++ bytes Text-based (streaming) great for scripting or legacy apps Higher level interfaces: Pig, Hive, Jaql Automatic re-execution on failure In a large cluster, some nodes are always slow or flaky Framework re-executes failed tasks Locality optimizations With large data, bandwidth to data is a problem Map-Reduce queries HDFS for locations of input data Map tasks are scheduled close to the inputs when possible

Word Count Example Mapper Input: value: lines of text of input Output: key: word, value: 1 Reducer Input: key: word, value: set of counts Output: key: word, value: sum Launching program Defines the job Submits job to cluster

Word Count Dataflow

Example: Word Count Mapper public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); } public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.tostring(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); output.collect(word, one); } }

Example: Word Count Reducer public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { } public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }

Word Count in Python import dumbo def mapper(key,value): for word in value.split(): yield word,1 def reducer(key,values): yield key,sum(values) if name == " main ": dumbo.run(mapper,reducer)

Why Yahoo! is investing in Hadoop We started with building better applications Scale up web scale batch applications (search, ads, ) Factor out common code from existing systems, so new applications will be easier to write Manage the many clusters we have more easily The mission now includes research support Build a huge data warehouse with many Yahoo! data sets Couple it with a huge compute cluster and programming models to make using the data easy Provide this as a service to our researchers We are seeing great results! Experiments can be run much more quickly in this environment

Running the Production WebMap Search needs a graph of the known web Invert edges, compute link text, whole graph heuristics Periodic batch job using Map/Reduce Uses a chain of ~100 map/reduce jobs Scale 100 billion nodes and 1 trillion edges Largest shuffle is 450 TB Final output is 300 TB compressed Runs on 10,000 cores Written mostly using Hadoop s C++ interface

WebMap in Context '&(!"#$%&" '&()#* +,-&.&" ;#178 <&#%=1:)& 45&"5 /"0,123,- 6&#"782 6&#"782 3,9:,& 3,9:,&

Inverting the Webmap One job inverts all of the edges in the Webmap From a cache of web pages Finds the text in the links that point to each page. Mapper: Input: key: URL; value: page HTML Output: key: target URL; value: source URL, Text from link Reducer: Output: new column in URL table with set of linking pages and link text

Finding Duplicate Web Pages Find close textual matches in the cache of web pages Want to compare each pair of web pages But it would take way too long T???? T??? T?? T? T Instead use a hash based on page s visible text Ignore headers and footers Shingles

Finding Duplicates Job 1 finds the groups Mapper: Input: key: URL; value: page HTML Output: key: page text hash, goodness; value: URL Reducer: Output: key: page text hash; value: URL (best first) Job 2 sorts results back into URL order Mapper: Output: key: URL; value: best URL Reducer: Tags duplicate URLs with their best representative

Research Clusters The grid team runs research clusters as a service to Yahoo researchers Analytics as a Service Mostly data mining/machine learning jobs Most research jobs are *not* Java: 42% Streaming Uses Unix text processing to define map and reduce 28% Pig Higher level dataflow scripting language 28% Java 2% C++

Hadoop clusters We have ~20,000 machines running Hadoop Our largest clusters are currently 2000 nodes Several petabytes of user data (compressed, unreplicated) We run hundreds of thousands of jobs every month

Research Cluster Usage

NY Times Needed offline conversion of public domain articles from 1851-1922. Used Hadoop to convert scanned images to PDF Ran 100 Amazon EC2 instances for around 24 hours 4 TB of input 1.5 TB of output Published 1892, copyright New York Times

Terabyte Sort Benchmark Started by Jim Gray at Microsoft in 1998 Sorting 10 billion 100 byte records Hadoop won general category in 209 seconds (prev was 297 ) 910 nodes 2 quad-core Xeons @ 2.0Ghz / node 4 SATA disks / node 8 GB ram / node 1 gb ethernet / node and 8 gb ethernet uplink / rack 40 nodes / rack Only hard parts were: Getting a total order Converting the data generator to map/reduce http://developer.yahoo.net/blogs/hadoop/2008/07

Hadoop Community Apache is focused on project communities Users Contributors write patches Committers can commit patches too Project Management Committee vote on new committers and releases too Apache is a meritocracy Use, contribution, and diversity is growing But we need and want more!

Size of Releases

Size of Developer Community

Size of User Community

Who Uses Hadoop? Amazon/A9 Facebook Google IBM Joost Last.fm New York Times PowerSet (now Microsoft) Quantcast Veoh Yahoo! More at http://wiki.apache.org/hadoop/poweredby

What s Next? 0.19 File Appends Total Order Sampler and Partitioner Pluggable scheduler 0.20 New map/reduce API Better scheduling for sharing between groups Splitting Core into sub-projects (HDFS, Map/Reduce, Hive) And beyond HDFS and Map/Reduce security High Availability via Zookeeper Get ready for Hadoop 1.0

Q&A For more information: Website: http://hadoop.apache.org/core Mailing lists: core-dev@hadoop.apache core-user@hadoop.apache IRC: #hadoop on irc.freenode.org