http://www.wordle.net/



Similar documents
Introduction to Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop Architecture. Part 1

Big Data and Apache Hadoop s MapReduce

Introduction to Hadoop

Big Data With Hadoop

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Chapter 7. Using Hadoop Cluster and MapReduce

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE-E5430 Scalable Cloud Computing Lecture 2

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Hadoop Parallel Data Processing

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Parallel Processing of cluster by Map Reduce

A very short Intro to Hadoop

Introduction to MapReduce and Hadoop

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Apache Hadoop. Alexandru Costan

Hadoop and Map-Reduce. Swati Gore

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

How To Use Hadoop

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Data-Intensive Computing with Map-Reduce and Hadoop

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Fault Tolerance in Hadoop for Work Migration

Open source Google-style large scale data analysis with Hadoop

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

HadoopRDF : A Scalable RDF Data Analysis System

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

MapReduce, Hadoop and Amazon AWS

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Survey on Scheduling Algorithm in MapReduce Framework

BBM467 Data Intensive ApplicaAons

Verification and Validation of MapReduce Program model for Parallel K-Means algorithm on Hadoop Cluster

MapReduce (in the cloud)

A computational model for MapReduce job flow

MAPREDUCE Programming Model

Reduction of Data at Namenode in HDFS using harballing Technique

MapReduce. Tushar B. Kute,

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop Scheduler w i t h Deadline Constraint

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Map Reduce & Hadoop Recommended Text:

Big Data Processing with Google s MapReduce. Alexandru Costan

Comparison of Different Implementation of Inverted Indexes in Hadoop

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Processing of Hadoop using Highly Available NameNode

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Efficient Analysis of Big Data Using Map Reduce Framework

Yuji Shirasaki (JVO NAOJ)

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

A Cost-Evaluation of MapReduce Applications in the Cloud

Apache Hadoop new way for the company to store and analyze big data

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Intro to Map/Reduce a.k.a. Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop

map/reduce connected components

The Hadoop Framework

Lifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Open source large scale distributed data management with Google s MapReduce and Bigtable

Performance and Energy Efficiency of. Hadoop deployment models

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

Analysis of MapReduce Algorithms

Big Data Analysis and Its Scheduling Policy Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Transcription:

Hadoop & MapReduce

http://www.wordle.net/

http://www.wordle.net/

Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely transparent to applications. Large clusters of commodity hardware Input data set will not fit on a single computer's hard drive computing! Built to process "web-scale" data on the order of petabytes.

Hadoop Again, software for Data intensive distributed processing. Although not necessarily, typically think of two main components: Hadoop DFS Hadoop MapReduce A distributed file system. Focuses on high-throughput access to application data. System for parallel processing of large data sets. Executes programs adhering to a specific programming model: MapReduce

HDFS Hadoop Distributed File System Breaks up input data into block of fixed length Sends the blocks to several machines in the cluster A block is stored several times to prevent losses. Quite simple, Master-Slave architecture One machine remembers what is where: Namenode All the other machines just store blocks: Datanodes

HDFS Namenode Client foo.jar 3 2 1 (Metadata) Filename: /mydir/foo.jar Block, Replicas, Locations, 1, 3, 127.0.0.1, 127.0.0.2, 127.0.0.3, 2, 3, 127.0.0.2, 127.0.0.3, 127.0.0.4, 3, 3, 127.0.0.7, 127.0.0.8, 127.0.0.9, 1 1 2 1 2 2 Datanode 127.0.0.1 Datanode 127.0.0.2 Datanode 127.0.0.3 Datanode 127.0.0.4 Rack 1 Rack 2

Hadoop MapReduce Engine The computational platform Also a simple Master Slave architecture. Clients send jobs to the JobTracker The JobTracker splits the work into small tasks" and assigns them to the TaskTrackers. What kind of jobs? MapReduce jobs. (We will get there) JobTracker Client TaskTracker 127.0.0.1 TaskTracker 127.0.0.2 TaskTracker 127.0.0.3 TaskTracker 127.0.0.4

Hadoop Summary Software for Reliable Scalable Distributed Computing Master Slave organized cluster MapReduce Layer Master Slave 1 Slave 2 Slave 3 HDFS Layer 127.0.0.0 127.0.0.1 127.0.0.2 127.0.0.3 Many more

MapReduce A programming paradigm for processing large datasets, typically in a cluster of computers MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat of Google Appeared in: OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004. Cited: About 3660 times

MapReduce A programming paradigm for processing large datasets, typically in a cluster of computers. Basic Idea: minimal data transfer - send the code to the data and execute, in 2 steps: Map step: parts of the file are processed in parallel and produce some intermediate output. Reduce step: intermediate output of all individual parts is combined to create the final output. Visualization coming up Need not worry about fault tolerance, parallelization, status and monitoring tools. All taken care by the system. Only worry about your algorithm: Map and Reduce steps.

MapReduce Input File Input data (Preloaded locally) Node 1 Node 2 Node 3 Intermediate Output Mapping Process Mapping Process Mapping Process Values exchanged (Shuffle Process) Node 1 Node 2 Node 3 Reducing Process Reducing Process Reducing Process Final Output (Stored locally) Output File

MapReduce A little more detail A file consists of Records (e.g. in a text file, a line can be a Record ). Each Record is fed to a map functions as a <k 1, v 1 > pair and processed independently. Mappers emit other <k 2, v 2 > pairs. Mapper outputs are hashed by key so that all <k 2, v 2 > pairs go to the same Reducer and grouped by key to produce <k 2, list(v 2 )>. Finally, Reducers emit <k 3, v 3 > pairs that form the final output. Mapper Reducer <k 1, v 1 > Map <k 2, v 2 > Shuffle Order Reduce <k 2, list(v 2 )> <k 3, v 3 >

MapReduce Input File Input data (Preloaded locally) Node 1 Node 2 Node 3 Intermediate Output Mapping Process Mapping Process Mapping Process Values exchanged (Shuffle Process) Node 1 Node 2 Node 3 Reducing Process Reducing Process Reducing Process Final Output (Stored locally) Output File

The Word Count Example The quick brown fox jumped over the lazy dog. Nobody saw the brown fox again. Hello " again, 1 brown, 2 dog, 1 fox, 2 jumped, 1 lazy, 1 nobody, 1 World over, 1 quick, 1 saw, 1 the, 3

The Word Count Example Mapper Reducer <k 1, v 1 > Map <k 2, v 2 > Shuffle Order Reduce <k 2, list(v 2 )> <k 3, v 3 > Tasktracker 1 <0, the quick brown fox> <28, over the lazy dog. Nobody> <54, saw the brown fox again.> <the, 1> <quick, 1> <brown, 1> <fox, 1> <jumped, 1> <over, 1> <the, 1> <lazy, 1> <dog, 1> <nobody, 1> Tasktracker 2 <again, 1> <brown, 1> <brown, 1> <dog, 1> Tasktracker 3 <fox, 1> <fox, 1> <jumped, 1> <lazy, 1> <nobody, 1> <over, 1> <again, 1> <brown, 1, 1> <dog, 1> <fox, 1, 1> <jumped,1> <lazy, 1> <nobody,1> <over, 1> <again, 1> <brown, 2> <dog, 1> <fox, 2> <jumped, 1> <lazy, 1> <nobody, 1> <over, 1> <saw, 1> <the, 1> <brown, 1> <fox, 1> <again, 1> Tasktracker 4 <quick, 1> <saw, 1> <the, 1> <the, 1> <the, 1> <quick, 1> <saw, 1> <the, 1,1,1> <quick, 1> <saw, 1> <the, 3>

MapReduce Can any problem be solved in MapReduce? Short answer: No! Even fewer in a single MapReduce cycle

MapReduce But, using a few iterations, or chains of programs, and smart choice of <key, value> pairs quite a lot can be done! distributed grep distributed sort web link-graph reversal term-vector per host web access log stats inverted index construction document clustering machine learning statistical machine translation

A simple, non-trivial example Counting Triangles and the Curse of the Last Reducer Siddharth Suri, Sergei Vassilvitskii, Yahoo! Research * The sequel contains extended parts of Sergei s presentation.

Counting Triangles Why? Clustering Coefficient: G =(V,E) Given a graph the Clustering Coefficient cc(v) of a vertex v is the fraction of pairs of neighbors of v that are also neighbors: cc(v) = {(u, w) E u, w N(v)} deg(v) 2 cc( ) = N/A cc( ) = 1/3 cc( ) = 1 cc( ) = 1

Counting Triangles Why? Clustering Coefficient: G =(V,E) Given a graph the Clustering Coefficient cc(v) of a vertex v is the fraction of pairs of neighbors of v that are also neighbors: cc(v) = # s incident to v deg(v) 2 cc( ) = N/A cc( ) = 1/3 cc( ) = 1 cc( ) = 1

Counting Triangles Why Clustering Coefficient? The Clustering Coefficient captures how tight a network is around a node. cc( ) = 1/2 cc( ) = 1/10

Counting Triangles Sequential Algorithm T = 0; foreach v in V do foreach u,w pair in N(v) do return T; Running time if (u, w) in E then T = T + 1; deg(v) 2 v V Even for sparse graphs can be quadratic in the number of edges if a vertex has high degree. It happens in natural graphs.

Counting Triangles Sequential Algorithm 2 (Schank 07) T = 0; foreach v in V do foreach u,w pair in N(v) do if deg(u) > deg(v) && deg(w) > deg(v), then if (u, w) in E then T = T + 1; return T; Running time O(m 3/2 ) There exists graph for which we cannot do better.

Counting Triangles in M/R Map 1: Input <(u,v); 0> if deg(v) > deg(u), then emit <u;v> Reduce 1: Input <v; S subset of N(v)> for (u,w) : u,w in S, do emit <v;(u,w)> Map 2: if Input of type <v;(u,w)> then emit <(u,w); v> if Input of type <(u,v);0> then emit <(u,v); $> Reduce 2: Input <(u,w); S subset of Union(V, {$})> if $ appears in S then for v in S except $ do emit <v;1> //emit <u;1>, emit <w;1>

Counting Triangles in M/R Analysis - How much main memory per reduce call? (sublinear in the input). - Total memory used for the computation on the cluster (not to exceed n 2 ) - Number of rounds.