Lecture 21 High-Performance Languages; MapReduce/Hadoop

Similar documents
Introduction to Parallel Programming and MapReduce

Introduction to Hadoop

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Introduction to Hadoop

Big Data and Apache Hadoop s MapReduce

Cloud Computing Summary and Preparation for Examination

MapReduce (in the cloud)

A programming model in Cloud: MapReduce

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Getting Started with Hadoop with Amazon s Elastic MapReduce

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

From GWS to MapReduce: Google s Cloud Technology in the Early Days

CSE-E5430 Scalable Cloud Computing Lecture 2

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Map Reduce & Hadoop Recommended Text:

Challenges for Data Driven Systems

CSCI 5417 Information Retrieval Systems Jim Martin!

Big Data Analytics Hadoop and Spark

Hadoop and Map-reduce computing

High Performance Computing with Hadoop WV HPC Summer Institute 2014

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Developing a MapReduce Application

MAPREDUCE Programming Model

Big Data: Big N. V.C Note. December 2, 2014

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Lecture 10 - Functional programming: Hadoop and MapReduce

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

The Hadoop Framework

Hadoop. Sunday, November 25, 12

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop Parallel Data Processing

Hands-on Exercises with Big Data

Introduction to Hadoop

Map Reduce / Hadoop / HDFS

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Introduction to DISC and Hadoop

Hadoop WordCount Explained! IT332 Distributed Systems

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

What is Analytic Infrastructure and Why Should You Care?

Scaling Out With Apache Spark. DTL Meeting Slides based on

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)


Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Data Science in the Wild

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Big Systems, Big Data

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

Sriram Krishnan, Ph.D.

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Architectures for massive data management

Hadoop and Map-Reduce. Swati Gore

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Cloud Computing using MapReduce, Hadoop, Spark

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Task Scheduling in Hadoop

Lecture Data Warehouse Systems

Lesson 7 Pentaho MapReduce

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

Apache Flink Next-gen data analysis. Kostas

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Advanced Data Management Technologies

Open source large scale distributed data management with Google s MapReduce and Bigtable

Introduction to Apache Pig Indexing and Search

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Hadoop-BAM and SeqPig

MapReduce and Hadoop Distributed File System V I J A Y R A O

Data-Intensive Computing with Map-Reduce and Hadoop

Improving MapReduce Performance in Heterogeneous Environments

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

CS54100: Database Systems

Distributed Apriori in Hadoop MapReduce Framework

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Getting to know Apache Hadoop

Open source Google-style large scale data analysis with Hadoop

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

The MapReduce Framework

MapReduce with Apache Hadoop Analysing Big Data

How to Run Spark Application

Hadoop Design and k-means Clustering

Spark: Cluster Computing with Working Sets

Introduction to Cloud Computing

Performance Analysis of Hadoop for Query Processing

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Transcription:

Lecture 21 High-Performance Languages; MapReduce/Hadoop ECE 459: Programming for Performance March 25, 2014

Part I High-Performance Languages 2 / 26

Introduction DARPA began a supercomputing challenge in 2002. Purpose: create multi petaflop systems (floating point operations). Three notable programming language proposals: X10 (IBM) [looks like Java] Chapel (Cray) [looks like Fortran/math] Fortress (Sun/Oracle) 3 / 26

Machine Model We ve used multicore machines and will talk about clusters (MPI, MapReduce). These languages are targeted somewhere in the middle: thousands of cores and massive memory bandwidth; Partitioned Global Address Space (PGAS) memory model: each process has a view of the global memory memory is distributed across the nodes, but processors know what global memory is local. 4 / 26

Parallelism These languages require you to specify the parallelism structure: Fortress evaluates loops and arguments in parallel by default; Others use an explicit construct, e.g. forall, async. In terms of addressing the PGAS memory: Fortress divides memory into locations, which belong to regions (in a hierarchy; the closer, the better for communication). Similarly, places (X11) and locales (Chapel). These languages make it easier to control the locality of data structures and to have distributed (fast) data. 5 / 26

X10 Example import x10. i o. C o n s o l e ; import x10. u t i l. Random ; c l a s s MontyPi { p u b l i c s t a t i c d e f main ( a r g s : A r r a y [ S t r i n g ] ( 1 ) ) { v a l N = I n t. p a r s e ( a r g s ( 0 ) ) ; v a l r e s u l t=g l o b a l R e f [ C e l l [ Double ] ] ( new C e l l [ Double ] ( 0 ) ) ; f i n i s h f o r ( p i n P l a c e. p l a c e s ( ) ) a t ( p ) a s y n c { v a l r = new Random ( ) ; v a r myresult : Double = 0 ; f o r ( 1.. ( N/ P l a c e.max_places) ) { v a l x = r. nextdouble ( ) ; v a l y = r. nextdouble ( ) ; i f ( x x + y y <= 1) myresult++; } v a l ans = myresult ; a t ( r e s u l t ) atomic r e s u l t ( ) ( ) += ans ; } v a l p i = 4 ( r e s u l t ( ) ( ) ) /N; } } 6 / 26

X10 Example: explained This shows a distributed computation. Could replace for (p in Place.places()) by for (1..P) (where P is a number) for a parallel solution. (Also, remove GlobalRef). async: creates a new child activity, which executes the statements. finish: waits for all child asyncs to finish. at: performs the statement at the place specified; here, the processor holding the result increments its value. 7 / 26

HPC Language Summary Three notable supercomputing languages: X10, Chapel, and Fortress (end-of-life). Partitioned Global Adress Space memory model: allows distributed memory with explicit locality. Parallel programming aspect to the languages are very similar to everything else we ve seen in the course. 8 / 26

Part II MapReduce 9 / 26

MapReduce (k, v) (k 2, v 2 ) map combine disk/network (k 2, l(v 2 )) concat1 (k 2, l(v 2 )) reduce l(v 3 ) concat2 10 / 26

Introduction: MapReduce Framework introduced by Google for large problems. Consists of two functional operations: map and reduce. >>> map( lambda x : x x, [ 1, 2, 3, 4, 5 ] ) [ 1, 4, 9, 16, 2 5 ] map: applies a function to an iterable data-set. >>> r e d u c e ( lambda x, y : x+y, [ 1, 2, 3, 4, 5 ] ) 15 reduce: applies a function to an iterable data-set cumulatively. ((((1+2)+3)+4)+5) in this example. 11 / 26

MapReduce Intuition In functional languages, functions are pure (no side-effects). Since they are pure, and hence independent, it s always safe to parallelize them. Note: functional languages, like Haskell, have their own parallel frameworks, which allow easy parallelization. Many problems can be represented as a map operation followed by a reduce. 12 / 26

Hadoop Apache Hadoop is a framework which implements MapReduce. Hadoop is the most widely used open source framework, used by Amazon s EC2 (elastic compute cloud). Allows work to be distributed across many different nodes (or re-tried if a node goes down). Includes HDFS (Hadoop distributed file system): distributes data across nodes and provides failure handling. (You can also use Amazon s S3 storage service). 13 / 26

Map (massive parallelism) You split the input file into multiple pieces. The pieces are then processed as (key, value) pairs. Your Mapper function uses these (key, value) pairs and outputs another set of (key, value) pairs. 14 / 26

Reduce (not as parallel) Collects the input files from the previous map (which may be on different nodes, needing copying). Merge-sorts the files, so that the key-value pairs for a given key are contiguous. Reads the file sequentially and splits the values into lists of values with the same key. Passes this data, consisting of keys and lists of values, to your reduce method (in parallel), & concatenates results. 15 / 26

Combine (optional) This step may be run right after map and before reduce. Takes advantage of the fact that elements produced by the map operation are still available in memory. Every so many elements, you can use your combine operation to take (key, value) outputs of the map and create new (key, value) inputs of the same types. 16 / 26

WordCount Example Say we want to count the number of occurrences of words in some files. Consider, for example, the following files: H e l l o World Bye World H e l l o Hadoop Goodbye Hadoop We want the following output: ( Bye, 1) ( Goodbye, 1) ( Hadoop, 2) ( H e l l o, 2) ( World, 2) 17 / 26

WordCount: Example Operations Mapper Split the input file into strings, representing words. For each word, output the following (key, value) pair: (word, 1) Reduce Sum all values for each word (key) and output: (word, sum) Combine We could do the reduce step for in-memory values while doing map. Note: here, the output of map and input/output of reduce are the same, but they don t have to be. 18 / 26

WordCount: Running the Example (File 1) H e l l o World Bye World After map: ( H e l l o, 1) ( World, 1) ( Bye, 1) ( World, 1) After combine: ( H e l l o, 1) ( Bye, 1) ( World, 2) 19 / 26

WordCount: Running the Example (File 2) H e l l o Hadoop Goodbye Hadoop After map: ( H e l l o, 1) ( Hadoop, 1) ( Goodbye, 1) ( Hadoop, 1) After combine: ( H e l l o, 1) ( Goodbye, 1) ( Hadoop, 2) 20 / 26

WordCount: Running the Example (Reduce) After concatenation, sorting, and creating lists of values: ( Bye, [ 1 ] ) ( Goodbye, [ 1 ] ) ( Hadoop, [ 2 ] ) ( H e l l o, [ 1, 1 ] ) ( World, [ 2 ] ) After the reduce, we get what we want: ( Bye, 1) ( Goodbye, 1) ( Hadoop, 2) ( H e l l o, 2) ( World, 2) 21 / 26

WordCount Example: C++ Code (1) Credit: http://wiki.apache.org/hadoop/c++wordcount. APIs for Java/Python also exist. #i n c l u d e " hadoop / P i p e s. hh " #i n c l u d e " hadoop / T e mplatefactory. hh " #i n c l u d e " hadoop / S t r i n g U t i l s. hh " c l a s s WordCountMap : p u b l i c HadoopPipes : : Mapper { p u b l i c : WordCountMap ( HadoopPipes : : TaskContext& c o n t e x t ){} v o i d map( HadoopPipes : : MapContext& c o n t e x t ) { s t d : : v e c t o r <s t d : : s t r i n g > words = H a d o o p U t i l s : : s p l i t S t r i n g ( c o n t e x t. g e t I n p u t V a l u e ( ), " " ) ; f o r ( u n s i g n e d i n t i =0; i < words. s i z e ( ) ; ++i ) { c o n t e x t. emit ( words [ i ], " 1 " ) ; } } } ; 22 / 26

WordCount Example C++ Code (2) c l a s s WordCountReduce : p u b l i c HadoopPipes : : Reducer { p u b l i c : WordCountReduce ( HadoopPipes : : TaskContext& c o n t e x t ){} v o i d r e d u c e ( HadoopPipes : : ReduceContext& c o n t e x t ) { i n t sum = 0 ; w h i l e ( c o n t e x t. n e x t V a l u e ( ) ) { sum += H a d o o p U t i l s : : t o I n t ( c o n t e x t. g e t I n p u t V a l u e ( ) ) ; } c o n t e x t. emit ( c o n t e x t. g e t I n p u t K e y ( ), H a d o o p U t i l s : : t o S t r i n g ( sum ) ) ; } } ; i n t main ( i n t argc, c h a r a r g v [ ] ) { r e t u r n HadoopPipes : : runtask ( HadoopPipes : : TemplateFactory <WordCountMap, WordCountReduce > ( ) ) ; } 23 / 26

Other Examples Distributed Grep. Count of URL Access Frequency. Reverse Web-Link Graph. Term-Vector per Host. Inverted Index: Map: parses each document, and emits a sequence of (word, document ID) pairs. Reduce: accepts all pairs for a given word, sorts the corresponding document IDs, and emits a (word, list(document ID)) pair. Output: all of the output pairs from reducing form a simple inverted index. 24 / 26

Other Notes Hive builds on top of Hadoop and allows you to use an SQL-like language to query outputs on HDFS; or you can provide custom mappers/reducers to get more information. The cloud is a great way to start a new project: you can add or remove nodes easily as your problem changes size. (Hadoop or MPI are good examples). References: http://wiki.apache.org/hadoop/ https://hadoop.apache.org/docs/r1.2.1/mapred_ tutorial.html (There used to be a Google MapReduce tutorial but it no longer seems to be on the Internet.) 25 / 26

Summary MapReduce is an excellent framework for dealing with massive data-sets. Hadoop is a common implementation you can use (on most cloud computing services, even!) You just need 2 functions (optionally 3): mapper, reducer and combiner. Just remember: output of the mapper/combiner is input to the reducer. 26 / 26