Yahoo! Grid Services Where Grid Computing at Yahoo! is Today



Similar documents
Extreme Computing. Hadoop. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Introduction to Hadoop

How To Use Hadoop

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

HiBench Introduction. Carson Wang Software & Services Group

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

Hadoop Hands-On Exercises

MapReduce with Apache Hadoop Analysing Big Data

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop Hands-On Exercises

How To Install Hadoop From Apa Hadoop To (Hadoop)

ITG Software Engineering

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Introduction to Cloud Computing

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Big Data Too Big To Ignore

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Map Reduce & Hadoop Recommended Text:

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Click Stream Data Analysis Using Hadoop

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

CS 378 Big Data Programming

Hadoop implementation of MapReduce computational model. Ján Vaňo

HDFS. Hadoop Distributed File System

Sriram Krishnan, Ph.D.

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop Job Oriented Training Agenda

Big Data Big Data/Data Analytics & Software Development

Hadoop & its Usage at Facebook

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

CSE-E5430 Scalable Cloud Computing. Lecture 4

Extreme computing lab exercises Session one

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Package hive. January 10, 2011

Data processing goes big

Implement Hadoop jobs to extract business value from large and varied data sets

Katta & Hadoop. Katta - Distributed Lucene Index in Production. Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com

Extreme computing lab exercises Session one

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop Ecosystem B Y R A H I M A.

Lecture 10 - Functional programming: Hadoop and MapReduce

Data Intensive Computing Handout 6 Hadoop

Introduction to HDFS. Prasanth Kothuri, CERN

Recommended Literature for this Lecture

A Cost-Evaluation of MapReduce Applications in the Cloud

Data Intensive Computing Handout 5 Hadoop

Big Data and Apache Hadoop s MapReduce

Big Data : Experiments with Apache Hadoop and JBoss Community projects

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

MapReduce, Hadoop and Amazon AWS

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Basic Hadoop Programming Skills

Cloudera Certified Developer for Apache Hadoop

MapReduce and Hadoop Distributed File System

Hadoop Big Data for Processing Data and Performing Workload

Introduction to Pig. Content developed and presented by: 2009 Cloudera, Inc.

Programming in Hadoop Programming, Tuning & Debugging

I/O Considerations in Big Data Analytics

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

To reduce or not to reduce, that is the question

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Apache Hadoop new way for the company to store and analyze big data

R / TERR. Ana Costa e SIlva, PhD Senior Data Scientist TIBCO. Copyright TIBCO Software Inc.

A Brief Outline on Bigdata Hadoop

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Internals of Hadoop Application Framework and Distributed File System

Hadoop & its Usage at Facebook

How To Create A Data Visualization With Apache Spark And Zeppelin

Chase Wu New Jersey Ins0tute of Technology

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Performance and Energy Efficiency of. Hadoop deployment models

map/reduce connected components

MapReduce. Tushar B. Kute,

MapReduce and Hadoop Distributed File System V I J A Y R A O

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Large scale processing using Hadoop. Ján Vaňo

Transcription:

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today Marco Nicosia Grid Services Operations marco@yahoo-inc.com

What is Apache Hadoop? Distributed File System and Map-Reduce programming platform combined Designed to scale out Must satisfy requirements of a full-scale WWW content system SAN and NAS devices don t support enough storage or IO bandwidth Combine the storage and compute power of any set of computers Highly portable: Framework in Java, user code in user s preferred language Apache Software Foundation Open Source Originally part of Lucene, now a sub-project Yahoo! does not maintain a separate source code repository You can download and use what I m using today

Why Open Source? Infrastructures that scale are not common If our infrastructure becomes popular, we ll be able to hire people who are already familiar with our technology Execution should not be a competitive advantage The game is killer search engine results and monetization My job is providing hosted infrastructure to Yahoo! s researchers, scientists, architects and developers Yahoo! Search Technology was comprised of several search companies: Inktomi, Altavista, Fast, Overture We are behind in infrastructure technology Companies in difficult situations don t invest in infrastructure By releasing our work, we all will catch up more quickly

A Quick Timeline 2004 - Initial versions of what is now Hadoop Distributed File System and Map-Reduce implemented by Doug Cutting & Mike Cafarella December 2005 - Nutch ported to the new framework. Hadoop runs reliably on 20 nodes. January 2006 - Doug Cutting joins Yahoo! February 2006 - Apache Hadoop project official started to support the standalone development of Map-Reduce and HDFS. March 2006 - Formation of the Yahoo! Hadoop team May 2006 - Yahoo sets up a Hadoop research cluster - 300 nodes April 2006 - Sort benchmark run on 188 nodes in 47.9 hours May 2006 - Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark) October 2006 - Research cluster reaches 600 Nodes December 2006 - Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3 hrs, 500 nodes in 5.2 hrs, 900 nodes in 7.8 January 2006 - Research cluster reaches 900 node April 2007 - Research clusters - 2 clusters of 1000 nodes

Interacting with the HDFS Simple commands: hadoop dfs -ls, -du, -rm, -rmr Uploading files hadoop dfs -put foo mydata/foo cat ReallyBigFile hadoop dfs -put - mydata/reallybigfile Downloading files hadoop dfs -get mydata/foo foo hadoop dfs -get - mydata/reallybigfile grep the answer is hadoop dfs -cat mydata/foo File Types Text files SequenceFiles Key/Value pairs formatted for the framework to consume Per-file type information (key class, value class)

Browse the HDFS in a Browser

Hadoop: Two Services in One Cluster Nodes run both DFS and M-R M-R X O Y DFS X Y X X O O O Y Y Input File (128MB blocks) X O Y

Map-Reduce Demo hod -m 10 run jar hadoop-examples.jar wordcount -r 4 /data/vespanews/20070218 /user/marco/wc.out

Running job: One failed task automatically retried

Streaming Map-Reduce Demo run jar hadoop-streaming.jar -numreducetasks 4 -input /data/vespanews/20070218 -output wc-streaming.out -mapper "perl -ane 'print join(\"\n\", @F), \"\n\"' -reducer "uniq -c

PIG A high-level declarative language (similarity to SQL) Pig Latin is a simple query algebra that lets you express data transformations such as merging data sets, filtering them, and applying functions to records or groups of records. Users can create their own functions to do special-purpose processing. A SW layer above MR, implementing a grouping syntax Wordcount example in PigLatin: input = LOAD 'documents' USING StorageText(); words = FOREACH input GENERATE FLATTEN(Tokenize(*)); grouped = GROUP words BY $0; counts = FOREACH grouped GENERATE group, COUNT(words);

Questions? http://lucene.apache.org/hadoop/ http://developer.yahoo.com/blogs/hadoop/ Marco Nicosia Grid Services Operations Yahoo! Inc. marco@yahoo-inc.com I Am Hiring!