A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18

Similar documents
Implement Hadoop jobs to extract business value from large and varied data sets

ITG Software Engineering

Big Data and Apache Hadoop s MapReduce

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

A very short Intro to Hadoop

Introduction to Hadoop

Hadoop Job Oriented Training Agenda

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

PassTest. Bessere Qualität, bessere Dienstleistungen!

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Data-Intensive Computing with Map-Reduce and Hadoop

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop/MapReduce Workshop. Dan Mazur, McGill HPC July 10, 2014

Jeffrey D. Ullman slides. MapReduce for data intensive computing

A Brief Outline on Bigdata Hadoop

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Hadoop implementation of MapReduce computational model. Ján Vaňo

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

ITG Software Engineering


DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

MapReduce. Introduction and Hadoop Overview. 13 June Lab Course: Databases & Cloud Computing SS 2012

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Click Stream Data Analysis Using Hadoop

Constructing a Data Lake: Hadoop and Oracle Database United!

Application Development. A Paradigm Shift

Data processing goes big

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop and Map-Reduce. Swati Gore

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

COURSE CONTENT Big Data and Hadoop Training

Data Mining in the Swamp

Developing a MapReduce Application

MapReduce and Hadoop Distributed File System V I J A Y R A O

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Hadoop Ecosystem B Y R A H I M A.

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Large scale processing using Hadoop. Ján Vaňo

Big Data Too Big To Ignore

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Hadoop Development & BI- 0 to 100

Internals of Hadoop Application Framework and Distributed File System

Hadoop IST 734 SS CHUNG

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

Open source Google-style large scale data analysis with Hadoop

BIG DATA HADOOP TRAINING

Big Data and Scripting map/reduce in Hadoop

Chapter 7. Using Hadoop Cluster and MapReduce

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Testing Big data is one of the biggest

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop. Sunday, November 25, 12

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Qsoft Inc

Duke University

I/O Considerations in Big Data Analytics

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Map Reduce & Hadoop Recommended Text:

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Cloud Computing using MapReduce, Hadoop, Spark

NoSQL and Hadoop Technologies On Oracle Cloud

Hadoop/MapReduce Workshop

Advanced Data Management Technologies

HiBench Introduction. Carson Wang Software & Services Group

Big Data Processing using Hadoop. Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Case Study : 3 different hadoop cluster deployments

Open source large scale distributed data management with Google s MapReduce and Bigtable

Big Data Analysis and HADOOP

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Big Data and Scripting Systems build on top of Hadoop

Introduction to Hadoop

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

BIG DATA - HADOOP PROFESSIONAL amron

Data Analyst Program- 0 to 100

MapReduce with Apache Hadoop Analysing Big Data

The Hadoop Framework

Hadoop Parallel Data Processing

Lecture 10 - Functional programming: Hadoop and MapReduce

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Apache Hadoop. Alexandru Costan

CS 378 Big Data Programming. Lecture 2 Map- Reduce

MAPREDUCE Programming Model

How To Use Hadoop

Complete Java Classes Hadoop Syllabus Contact No:

Developing MapReduce Programs

Transcription:

A bit about Hadoop Luca Pireddu CRS4Distributed Computing Group March 9, 2012 luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18

Often seen problems Often seen problems Low parallelism I/O is done to/from shared storage, not locally to the computing node limits scalability in number of nodes load on central storage and network is higher than necessary increases infrastructure cost High job failure rate No robustness to node or equipment failure A failed step requires human intervention to resolve Low automation and high operating costs luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 2 / 18

Hadoop Important features: distributed scalable robust open source luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 3 / 18

Hadoop To understand the advantages of Hadoop and how it works let's briey cover two things: 1 MapReduce 2 Hadoop MapReduce and Distributed File System luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 4 / 18

MapReduce A programming model for large-scale distributed data processing Breaks algorithms into two steps: 1 Map: map a set of input key/value pairs to a set of intermediate key/value pairs 2 Reduce: apply a function to all values associated to the same intermediate key; emit output key/value pairs Functions don't have side eects; (k,v) pairs are the only input/output Functions don't share data structures luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 5 / 18

MapReduce A programming model for large-scale distributed data processing Breaks algorithms into two steps: 1 Map: map a set of input key/value pairs to a set of intermediate key/value pairs 2 Reduce: apply a function to all values associated to the same intermediate key; emit output key/value pairs Functions don't have side eects; (k,v) pairs are the only input/output Functions don't share data structures (name, age) (name, mean age) (luca, 27) (luca, 31) (luca, 30.67) (luca, 34) luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 5 / 18

MapReduce Example Word Count Consider a program to calculate word frequency in a document. The quick brown fox ate the lazy green fox. Word Count ate 1 brown 1 fox 2 green 1 lazy 1 quick 1 the 2 luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 6 / 18

MapReduce Example Word Count The quick brown fox ate the lazy green fox. Here's some pseudo code for a MapReduce word counting program: map ( key, value ): foreach word in value : emit ( word, 1) reduce ( key, value_list ): int wordcount = 0 foreach count in value_ list : wordcount += count emit ( key, wordcount ) luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 7 / 18

MapReduce Example Word Count the quick brown fox ate the lazy green fox Mapper Mapper Mapper Map the, 1 fox, 1 quick, 1 ate, 1 brown, 1 fox, 1 the, 1 lazy, 1 green, 1 Shuffle & Sort Reducer Reducer Reduce quick, 1 brown, 1 fox, 2 ate, 1 the, 2 lazy, 1 green, 1 luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 8 / 18

MapReduce The lack of side eects and shared data structures is the key. No multi-threaded programming No synchronization, locks, mutexes, deadlocks, etc. No shared data implies no central bottleneck. Failed functions can be retriedtheir output only being committed upon successful completion. MapReduce allows you to put much of the parallel programming into a reusable framework, outside of the application. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 9 / 18

Hadoop MapReduce The MapReduce model needs an implementation Hadoop is arguably the most popular open-source MapReduce implementation Born out of Yahoo! Currently used by many very large operations, e.g.: Yahoo! Facebook Amazon Ebay etc. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 10 / 18

Hadoop DFS A MapReduce framework goes hand-in-hand with a distributed le system Multiplying the number of nodes poses challenges multiplied network trac multiplied disk accesses multiplied failure rates luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 11 / 18

Hadoop DFS Hadoop provides the Hadoop Distributed File System (HDFS) Stores blocks of the data on each node. Move computation to the data and decentralize data access Uses the disks on each node Aggregate I/O throughput scales with the number of nodes Replicates data on multiple nodes Resistance to node failure luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 12 / 18

Easier MapReduce Implementing Hadoop MapReduce programs can be time-consuming Especially true for one-o scripts/hacks There are other options Pig Hive Pydoop others... luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 13 / 18

Easier MapReduce Pig is a scripting language for Hadoop Hive is more of a query language Pydoop is a Python API for Hadoop MapReduce luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 14 / 18

Pig example Pig example... luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 15 / 18

Pydoop Pydoop script page: http://pydoop.sf.net/docs/pydoop_script.html Pydoop API example: http://pydoop.sf.net/docs/examples/wordcount.html luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 16 / 18

other things Lots of other tools in tthis ecosystem Workow management (oozie) Data stores (HBase) Data transfer (sqoop, ume)... luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 17 / 18

Questions Questions? luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 18 / 18