Hadoop and its Usage at Facebook. Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009



Similar documents
Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Apache Hadoop FileSystem and its Usage in Facebook

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Apache Hadoop FileSystem Internals

Hadoop Architecture and its Usage at Facebook

Design and Evolution of the Apache Hadoop File System(HDFS)

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

CS54100: Database Systems

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, Presented at Hadoop World, New York October 2, 2009

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Open source Google-style large scale data analysis with Hadoop

Hadoop implementation of MapReduce computational model. Ján Vaňo

Application Development. A Paradigm Shift

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Large scale processing using Hadoop. Ján Vaňo

Introduction to Cloud Computing

Big Data With Hadoop

MapReduce with Apache Hadoop Analysing Big Data

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

How To Scale Out Of A Nosql Database

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Open source large scale distributed data management with Google s MapReduce and Bigtable

Chapter 7. Using Hadoop Cluster and MapReduce

BBM467 Data Intensive ApplicaAons

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Apache Hadoop. Alexandru Costan

NoSQL and Hadoop Technologies On Oracle Cloud

Big Data Technology Core Hadoop: HDFS-YARN Internals

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop IST 734 SS CHUNG

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Data Warehouse Overview. Namit Jain

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

MapReduce, Hadoop and Amazon AWS

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Alternatives to HIVE SQL in Hadoop File Structure

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Jeffrey D. Ullman slides. MapReduce for data intensive computing

IT and Storage for Big Data Analytics

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

Data-Intensive Computing with Map-Reduce and Hadoop

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

BIG DATA TECHNOLOGY. Hadoop Ecosystem

THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop Big Data for Processing Data and Performing Workload

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

America s Most Wanted a metric to detect persistently faulty machines in Hadoop

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Big Data and Apache Hadoop s MapReduce

Hadoop Ecosystem B Y R A H I M A.

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Big Data Too Big To Ignore

So What s the Big Deal?

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

MapReduce and Hadoop Distributed File System

Cost-Effective Business Intelligence with Red Hat and Open Source

Hadoop Architecture. Part 1

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

Map Reduce & Hadoop Recommended Text:

Hadoop and Map-Reduce. Swati Gore

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

HDFS Architecture Guide

Hadoop. Sunday, November 25, 12

A very short Intro to Hadoop

Intro to Map/Reduce a.k.a. Hadoop

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

<Insert Picture Here> Big Data

Big Data Big Data/Data Analytics & Software Development

Generic Log Analyzer Using Hadoop Mapreduce Framework

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang

Introduction to Hadoop

BIG DATA What it is and how to use?

Transcription:

Hadoop and its Usage at Facebook Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009

Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed on Hadoop Distributed File System Facebook (Hadoop) Yahoo! (Hadoop) Veritas (San Point Direct, VxFS) IBM Transarc (Andrew File System)

Hadoop, Why? Need to process huge datasets on large clusters of computers Very expensive to build reliability into each application. Nodes fail every day Failure is expected, rather than exceptional. The number of nodes in a cluster is not constant. Need common infrastructure Efficient, reliable, easy to use Open Source, Apache License

Hadoop History Dec 2004 Google GFS paper published July 2005 Nutch uses MapReduce Feb 2006 Becomes Lucene subproject Apr 2007 Yahoo! on 1000-node cluster Jan 2008 An Apache Top Level Project Feb 2008 Yahoo! production search index Nov 2008 SQL query language called Hive

Who uses Hadoop? Amazon/A9 Facebook Google IBM : Blue Cloud? Joost Last.fm New York Times PowerSet Veoh Yahoo!

Commodity Hardware Typically in 2 level architecture Nodes are commodity PCs 30-40 nodes/rack Uplink from rack is 3-4 gigabit Rack-internal is 1 gigabit

Goals of HDFS Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle hardware failure Detect failures and recovers from them Optimized for Batch Processing Data locations exposed so that computations can move to where data resides Provides very high aggregate bandwidth

HDFS Architecture Cluster Membership NameNode 2. BlckId, DataNodes o Secondary NameNode Client 3.Read data Cluster Membership NameNode : Maps a file to a file-id and list of MapNodes DataNode : Maps a block-id to a physical location on disk SecondaryNameNode: Periodic merge of Transaction log DataNodes

Distributed File System Single Namespace for entire cluster Data Coherency Write-once-read-many access model Client can only append to existing files Files are broken up into blocks Typically 128 MB block size Each block replicated on multiple DataNodes Intelligent Client Client can find location of blocks Client accesses data directly from DataNode

Hadoop Map/Reduce The Map-Reduce programming model Framework for distributed processing of large data sets Pluggable user code runs in generic framework Common design pattern in data processing cat * grep sort unique -c cat > file input map shuffle reduce output Natural for: Log processing Web search indexing Ad-hoc queries

Hadoop/Hive at Facebook Cross functional team of 11 members 5 people working in Hive development 2 people on Hadoop development 2 people on Data Pipelines and Oracle Data Mart 1 Production Operations

Why Hive? Large installed base of SQL users Analytics SQL queries translate well to map-reduce Files are insufficient data management abstractions Need Tables, schemas, partitions, indices Scalability of Hadoop

Why Hive? hive> select key, count(1) from kv1 where key > 100 group by key; vs $ cat > /tmp/reducer.sh uniq c awk {print $2 \t $1}` $ cat > /tmp/map.sh awk F \001 {if($1 > 100) print $1}` $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar input / user/hive/warehouse/kv1 mapper map.sh file /tmp/reducer.sh file /tm;map.sh reducer reducer.sh output /tmp/largekey numreducertasks 1

Hive Query Language Basic SQL From clause subquery Join Multi table insert Multi group-by Sampling Extensibility Pluggable map-reduce scripts

Who generates this data? Lots of data is generated on Facebook 200 million active users 20 million users update their statuses at least once each day More than 850 million photos uploaded to the site each month More than 8 million videos uploaded each month More than 1 billion pieces of content (web links, news stories, blog posts, notes, photos, etc.) shared each week

Where do we store this data? Hadoop/Hive Warehouse 4800 cores, 2 PetaBytes total size Hadoop Archival Store 200 TB

Rate of Data Growth

Data Flow Network Storage and Servers Web Servers Scribe MidTier Oracle RAC Hadoop Hive Warehouse MySQL

Statistics per day: Data Usage 4 TB of compressed new data added per day 55TB of compressed data scanned per day 3200+ Hive jobs on production cluster per day 80M compute minutes per day Barrier to entry is significantly reduced: New engineers go though a Hive training session 140+ people run jobs on Hadoop/Hive jobs Analysts (non-engineers) use Hadoop through Hive

Hadoop Scribe: Avoid Costly Web Servers Filers Scribe MidTier Scribe Writers RealBme Hadoop Cluster Oracle RAC Hadoop Hive Warehouse MySQL

Archival: Move old data to cheap storage Hadoop Warehouse distcp Hadoop Archive Node NFS Cheap NAS Hadoop Archival Cluster 20TB per node Hive Query HDFS 220

Cluster Usage Dashboard History logs are fed into a mysql database A Dashboard displays cluster usage statistics from the database Displays cluster utilization, growth rates of cluster usage, etc HADOOP-3708

Cluster Usage Dashboard

Hive WebUI

Questions?