MapReduce for Statistical NLP/Machine Learning

Similar documents

MapReduce, Hadoop and Amazon AWS

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Getting Started with Hadoop with Amazon s Elastic MapReduce

Introduction to Hadoop

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Introduction to Hadoop

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

CSE-E5430 Scalable Cloud Computing Lecture 2

How To Use Hadoop

Hadoop Parallel Data Processing

Getting to know Apache Hadoop

University of Maryland. Tuesday, February 2, 2010

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Cloud Homework instructions for AWS default instance (Red Hat based)

Data-Intensive Computing with Map-Reduce and Hadoop

Hadoop Architecture. Part 1

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Sriram Krishnan, Ph.D.

Big Data Analytics. Lucas Rego Drumond

A programming model in Cloud: MapReduce

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud

Single Node Hadoop Cluster Setup

The Performance Characteristics of MapReduce Applications on Scalable Clusters

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

Parallel Processing of cluster by Map Reduce

Data-intensive computing systems

Big Data With Hadoop

A very short Intro to Hadoop

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Intro to Map/Reduce a.k.a. Hadoop

MAPREDUCE Programming Model

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

The Hadoop Framework

Lecture 3 Hadoop Technical Introduction CSE 490H

Extreme Computing. Hadoop MapReduce in more detail.

Big Data and Apache Hadoop s MapReduce

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Introduction to MapReduce and Hadoop

Apache Hadoop. Alexandru Costan

Hadoop and Map-Reduce. Swati Gore

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Apache Hadoop new way for the company to store and analyze big data

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Open source Google-style large scale data analysis with Hadoop

Tutorial for Assignment 2.0

Map Reduce & Hadoop Recommended Text:

Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Internals of Hadoop Application Framework and Distributed File System

Hadoop and Map-reduce computing

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200

Analysis of MapReduce Algorithms

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Map Reduce / Hadoop / HDFS

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #13: NoSQL and MapReduce

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Big Data Processing with Google s MapReduce. Alexandru Costan

NoSQL and Hadoop Technologies On Oracle Cloud

Cloud Computing. Lectures 10 and 11 Map Reduce: System Perspective

MapReduce and Hadoop Distributed File System

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Distributed Computing and Big Data: Hadoop and MapReduce

Running Kmeans Mapreduce code on Amazon AWS

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

map/reduce connected components

TP1: Getting Started with Hadoop

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

What is cloud computing?

Hadoop on OpenStack Cloud. Dmitry Mescheryakov Software

CSCI6900 Assignment 2: Naïve Bayes on Hadoop

Hadoop/MapReduce Workshop. Dan Mazur, McGill HPC July 10, 2014

Fault Tolerance in Hadoop for Work Migration

Cloud Computing at Google. Architecture

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Transcription:

MapReduce for Statistical NLP/Machine Learning Sameer Maskey Week 7, October 17, 2012 1

Topics for Today MapReduce Running MapReduce jobs in Amazon Hadoop MapReduce for Machine Learning Naïve Bayes, K-Means, Regression MapReduce with your code in Amazon Hadoop 2

Announcement Intermediate Report I due tonight (11:59 pm) Homework 2 due (date to be decided) Please use courseworks to discuss Homework 2 Reading Assignment for next class Neural Networks 5.1 to 5.3 in Bishop Book Graded Homework 1 will be available sometime next week 3

MapReduce Programming model to handle embarrassingly parallel problems Programmers specify two functions: Map (k1, v1) list(k2, v2) Applied to all pairs in parallel for input corpus (e.g. all document for example) Reduce (k2, list(v2)) list (v2) Applied to all groups with the same key in parallel (e.g. all words for example) 4

Components of MapReduce Program Driver Mapper Reducers May also contain Combiners (often same as Reducers) and Partioners 5

Data to Mapper : InputFormats FileInputFormat TextInputFormat Each \n terminated is treated as a value Key is the byte offset KeyValueTextInputFormat Map each \n terminated lines into Key SEPARATOR Value format SequenceFileInputFormat OutputFormat can be set as well 6

Partioners and Combiners Partioners decide what key goes to what reducer Default - hash-based For our word count we can have N different reducers for N words Combiners Combines mapper output before sending to reducer Many times similar to reducer 7

Jobs and Tasks Job : user submitted map and reduce Task Single mapper Single Reducer Job-cooridnater Manages task delegation, retries, etc Task tracker Manages task in each slave nodes 8

Naïve Bayes Experiment from Homework 1 wget http://www.cs.columbia.edu/~smaskey/cs6998-0412/homework/hw1.tar.gz We had small data set which we used to build Naïve Bayes Classifier Let us build Naïve Bayes for much larger corpus 9

Naïve Bayes Classifier for Text P(Y =y k X 1,X 2,...,X N )= P(Y=y k)p(x 1,X 2,..,X N Y=y k ) j P(Y=y j)p(x 1,X 2,..,X N Y=y j ) = P(Y=y k)π i P(X i Y=y k ) j P(Y=y j)π i P(X i Y=y j ) Y argmax yk P(Y =y k )Π i P(X i Y =y k ) 10

Naïve Bayes Classifier for Text Given the training data what are the parameters to be estimated? P(Y) P(X Y 1 ) P(X Y 2 ) Diabetes : 0.8 Hepatitis : 0.2 the: 0.001 diabetic : 0.02 blood : 0.0015 sugar : 0.02 weight : 0.018 the: 0.001 diabetic : 0.0001 water : 0.0118 fever : 0.01 weight : 0.008 11

We Simply Want to Count To compute posteriors of P(C wi) we prepared P(wi C) for each class We compute P(wi C) by counting # of documents the word wi occurred in and dividing by the total # of documents Can we put this in MapReduce Framework easily? #D{X i =x ij &Y =y k } #D{Y=y k } 12

Naïve Bayes for 2 Billion Documents Assume you have 2 billion documents 1 billion in Hockey class 1 billion in Baseball class We want to compute numerator of the equation in the last slide We want to compute Number of x=1 where x=1 if word x occurs in the document number of documents that has word x Multinomial Naïve Bayes Version : sum of x=n where N is the number of times word occurs in a document What will be the Map function? What will be the Reduce function? 13

Remember our Word Count Example Last lecture we talked about word counts in MapReduce framework What will be Map and Reduce function for Wordcount Example? 14

Remember MapReduce Picture k 1 v 1 k 2 v 2 k 3 v 3 k 4 v 4 k 5 v 5 k 6 v 6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 8 reduce reduce reduce r 1 s 1 r 2 s 2 r 3 s 3 Picture from [1] 15

MapReduce for Word Count Doc 1 one fish, two fish Doc 2 red fish, blue fish Doc 3 cat in the hat Map one 1 red 1 two 1 blue 1 Combiner fish 1 fish 1 fish 1 fish 2 fish 1 Combiner fish 2 cat hat in the 1 1 1 1 Shuffle and Sort: aggregate values by keys Reduce cat fish one red 1 4 1 1 blue hat two 1 1 1 Picture from [1] 16

WordCount Example in AWS MapReduce Let s try to run Amazon s in built example for word count Login to your AWS accounts follow instructions in the class 17

Using AWS https://stanlpcolumbia.signin.aws.amazon.com/console UserID Password emailed to you Access Keys emailed you to as well LogIn Create Instance with LAMP stack AMI or whatever AMI you want Create certificates and store it in a safe place Access Virtual Server Create S3 bucket Try MapReduce Example 18

Setting up your AWS Machine We went through this last time Setup LAMP stack if you want Setup any other AMI you want Install any software you want Ubuntu apt-get CentOS yum sudo su yum install httpd yum install mysql-server mysql yum install php php-mysql service httpd start service mysqld start /usr/bin/mysqladmin -u root password 'password' 19

Basic AWS EC2 setup Addusers, groups, etc Edit /etc/ssh/sshd_config setting where PasswordAuthentication yes Restart the sshd sudo service sshd restart Find online tutorials for more sysadmin related topics 20

Setup S3 wget http://s3tools.org/repo/centos_5/s3tools.repo yum install s3cmd s3cmd --configure http://www.s3fox.net/downloadpage.aspx firefox plugin Enter new values or accept defaults in brackets with Enter. Refer to user manual for detailed description of all options. Access key and Secret key are your identifiers for Amazon S3 Access Key: XXXXXXXXXXXXXXXXXXXX (enter your access key here) Secret Key: SSSSSSSSSSSSSSSSSSSSSSSSSSSSSS (enter your secret key here) Encryption password is used to protect your files from reading by unauthorized persons while in transfer to S3 Encryption password: PPPPPPPPPPP (enter a password of your choice here) Path to GPG program [/usr/bin/gpg]: When using secure HTTPS protocol all communication with Amazon S3 servers is protected from 3rd party eavesdropping. This method is slower than plain HTTP and can't be used if you're behind a proxy Use HTTPS protocol [No]: On some networks all internet access must go through a HTTP proxy. Try setting it here if you can't conect to S3 directly HTTP Proxy server name: 21

MapReduce Computation Split data into 16-64MB chunks, start many copies of program Master assigns map tasks to idle workers Map is applied Key-value pairs written to disk, locations passed to Master Reducer reads data from remote location keys are grouped Reduce is applied on intermediate key-value pairs 22

MapReduce Framework We just write mapper and reducer Hand over many details to the framework How to assign workers to the map tasks? How to assign workers to the reduce tasks? Handle synchronization across many map tasks Handle failures, errors How to distribute data in the file system 23

User Program (1) submit Master (2) schedule map (2) schedule reduce worker split 0 split 1 split 2 split 3 split 4 (3) read worker (4) local write (5) remote read worker worker (6) write output file 0 output file 1 worker Input files Map phase Intermediate files (on local disk) Reduce phase Output files Adapted from (Dean and Ghemawat, OSDI 2004) 24

MapReduce Implementations Google has a proprietary implementation in C++ Bindings in Java, Python Hadoop is an open-source implementation in Java Development led by Yahoo, used in production Now an Apache project Rapidly expanding software ecosystem Lots of custom research implementations For GPUs, cell processors, etc. 25

Shuffle and Sort Can we use locality information for Reducer? How about Combiner? Sum vs Mean? Picture from [2] 26

Examples Distributed Grep Map <line, 1> Reduce <line, 1> URL Access Frequency Count Map <URL, 1> Reduce <URL, total count> Reverse Web-Link Graph Map <target, source> Reduce <target, list(source)> 27

Examples Term-Vector per Host Map <hostdocment, tf vector> Reduce <hostdocument, tf vector> Inverted Index Map <word, document id> Reduce <word, list<document id> 28

Distributed File System We don t want to move data (remember : we can store a file of 5TB in HDFS) Put data in local disks and move the workers to the local disk A distributed file system allows to put data in a distributed fashion allowing worker to move where the data chunk resides GFS (Google File System) for Google s MapReduce HDFS (Hadoop Distributed File System) for Hadoop 29

HDFS File System Designs Use fixed chunk size Fixed size (64MB) Replicated 3 times across different nodes Centralized management with datanode that keeps track of namenodes No caching HDFS = GFS clone (same basic ideas) 30

HDFS Architecture HDFS namenode Application HDFS Client (file name, block id) (block id, block location) File namespace /foo/bar block 3df2 instructions to datanode (block id, byte range) block data datanode state HDFS datanode Linux file system HDFS datanode Linux file system Adapted from (Ghemawat et al., SOSP 2003) Picture from [1] 31

Namenode Functions Managing namespace of the filesystem Metadata, permissions, etc File Operations coordinate through namenodes Which datanode to write on Which datanode to read from Maintain overall health Pings to datanodes to check if they are up Garbage collection 32

HDFS and MapReduce namenode job submission node namenode daemon jobtracker tasktracker datanode daemon Linux file system slave node tasktracker datanode daemon Linux file system slave node tasktracker datanode daemon Linux file system slave node Picture from [1] 33

Map Reduce for Multiple Linear Regression Y X θ 34

Map Reduce for Multiple Linear Regression A b A= m i w i (x i x T i ) Mappers subgroup w i(x i x T i ) subgroup w i(x i x T i ) Reducer A Compute b similarly 35

MapReduce for Naïve Bayes We need to sum over all xj where j=1 to M and xj=1 if xj occurs in document D of class 1 We need to sum over all xj where j=1 to M and xj=1 if xj occurs in document D of class 1 M 1 1{x j=k y=1} Mappers subgroup 1{x j=k y=1} subgroup 1{x j=k y=1} Reducer Compute M 1 1{y=1} 36

MapReduce for K-Means r nk Step 1 Keep µ k fixed Minimize J with respect to r nk Optimize for each n separately by choosing for k that gives minimum x n r nk 2 r nk =1ifk=argmin j x n µ j 2 =0otherwise This requires assigning value of 0 or 1 to each data point We just need to split the data with mapper to many split Map : <ClusterID, point> 37

MapReduce for K-Means Minimize J with respect to r nk µ k Step 2 Keep fixed µ k µ k n µ k = r nkx n n r nk J is quadratic in. Minimize by setting derivative w.rt. zero to subgroup r nkx n subgroup r nkx n Reduce : <ClusterID, mean> 38

Speedup with MapReduce - EM Chu, Cheng-Tao, et. al 39

Speedup with MapReduce Naïve Byes Chu, Cheng-Tao, et. al 40

Speedup with MapReduce : KMeans Chu, Cheng-Tao, et. al 41

Amazon s Elastic MapReduce We want to write ML algorithms we have learned so far in MapReduce Framework Use Elastic MapReduce of Amazon Naïve Bayes in MapReduce framework to classify our hockey and baseball documents We need to loadup data in Amazon s persistent storage, S3 42

Naïve Bayes Experiment from Homework 1 wget http://www.cs.columbia.edu/~smaskey/cs6998-0412/homework/hw1.tar.gz s3cmd put baseball_hockey_train.txt s3://srmbucket/exp1/input/baseball_hockey_train.txt 43

Step Back : Word Count Example Map lines = sys.stdin for line in lines: words = line.split() for word in words: print word, "\t", 1; Key = wordid Value = 1 44

Word Count Example Reduce wordfreq={} lines = sys.stdin for line in lines: words = line.split("\t") key = words[0] val = words[1] if key in wordfreq: wordfreq[key] = wordfreq[key]+1 else: wordfreq[key] = 1 for key in wordfreq.keys(): print key, "\t", wordfreq[key] 45

Running MapReduce based WordCount in Amazon Copy over mapper.py and reducer.py to s3 s3cmd --recursive put mybin s3://maptest1/exp2/mybin Copy over data to s3 s3cmd put baseball_hockey_train.txt s3://maptest1/exp2/input/baseball_hockey_train.txt 46

Setup MapReduce Job Flow 47

Commandline Interface elastic-mapreduce --create --stream --num-instances 1 --mapper s3://yourbucket/bin/mapper.py --reducer s3://yourbucket/bin/reducer.py --input s3://yourbucket/data --name 'My WordCount Job' --output s3://yourbucket/output 48

References [1] Jimmy Lin, http://www.umiacs.umd.edu/~jimmylin/cloud-2010- Spring/syllabus.html 49