MapReduce for Statistical NLP/Machine Learning
|
|
|
- Kelley Gregory
- 10 years ago
- Views:
Transcription
1 MapReduce for Statistical NLP/Machine Learning Sameer Maskey Week 7, October 17,
2 Topics for Today MapReduce Running MapReduce jobs in Amazon Hadoop MapReduce for Machine Learning Naïve Bayes, K-Means, Regression MapReduce with your code in Amazon Hadoop 2
3 Announcement Intermediate Report I due tonight (11:59 pm) Homework 2 due (date to be decided) Please use courseworks to discuss Homework 2 Reading Assignment for next class Neural Networks 5.1 to 5.3 in Bishop Book Graded Homework 1 will be available sometime next week 3
4 MapReduce Programming model to handle embarrassingly parallel problems Programmers specify two functions: Map (k1, v1) list(k2, v2) Applied to all pairs in parallel for input corpus (e.g. all document for example) Reduce (k2, list(v2)) list (v2) Applied to all groups with the same key in parallel (e.g. all words for example) 4
5 Components of MapReduce Program Driver Mapper Reducers May also contain Combiners (often same as Reducers) and Partioners 5
6 Data to Mapper : InputFormats FileInputFormat TextInputFormat Each \n terminated is treated as a value Key is the byte offset KeyValueTextInputFormat Map each \n terminated lines into Key SEPARATOR Value format SequenceFileInputFormat OutputFormat can be set as well 6
7 Partioners and Combiners Partioners decide what key goes to what reducer Default - hash-based For our word count we can have N different reducers for N words Combiners Combines mapper output before sending to reducer Many times similar to reducer 7
8 Jobs and Tasks Job : user submitted map and reduce Task Single mapper Single Reducer Job-cooridnater Manages task delegation, retries, etc Task tracker Manages task in each slave nodes 8
9 Naïve Bayes Experiment from Homework 1 wget We had small data set which we used to build Naïve Bayes Classifier Let us build Naïve Bayes for much larger corpus 9
10 Naïve Bayes Classifier for Text P(Y =y k X 1,X 2,...,X N )= P(Y=y k)p(x 1,X 2,..,X N Y=y k ) j P(Y=y j)p(x 1,X 2,..,X N Y=y j ) = P(Y=y k)π i P(X i Y=y k ) j P(Y=y j)π i P(X i Y=y j ) Y argmax yk P(Y =y k )Π i P(X i Y =y k ) 10
11 Naïve Bayes Classifier for Text Given the training data what are the parameters to be estimated? P(Y) P(X Y 1 ) P(X Y 2 ) Diabetes : 0.8 Hepatitis : 0.2 the: diabetic : 0.02 blood : sugar : 0.02 weight : the: diabetic : water : fever : 0.01 weight :
12 We Simply Want to Count To compute posteriors of P(C wi) we prepared P(wi C) for each class We compute P(wi C) by counting # of documents the word wi occurred in and dividing by the total # of documents Can we put this in MapReduce Framework easily? #D{X i =x ij &Y =y k } #D{Y=y k } 12
13 Naïve Bayes for 2 Billion Documents Assume you have 2 billion documents 1 billion in Hockey class 1 billion in Baseball class We want to compute numerator of the equation in the last slide We want to compute Number of x=1 where x=1 if word x occurs in the document number of documents that has word x Multinomial Naïve Bayes Version : sum of x=n where N is the number of times word occurs in a document What will be the Map function? What will be the Reduce function? 13
14 Remember our Word Count Example Last lecture we talked about word counts in MapReduce framework What will be Map and Reduce function for Wordcount Example? 14
15 Remember MapReduce Picture k 1 v 1 k 2 v 2 k 3 v 3 k 4 v 4 k 5 v 5 k 6 v 6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c reduce reduce reduce r 1 s 1 r 2 s 2 r 3 s 3 Picture from [1] 15
16 MapReduce for Word Count Doc 1 one fish, two fish Doc 2 red fish, blue fish Doc 3 cat in the hat Map one 1 red 1 two 1 blue 1 Combiner fish 1 fish 1 fish 1 fish 2 fish 1 Combiner fish 2 cat hat in the Shuffle and Sort: aggregate values by keys Reduce cat fish one red blue hat two Picture from [1] 16
17 WordCount Example in AWS MapReduce Let s try to run Amazon s in built example for word count Login to your AWS accounts follow instructions in the class 17
18 Using AWS UserID Password ed to you Access Keys ed you to as well LogIn Create Instance with LAMP stack AMI or whatever AMI you want Create certificates and store it in a safe place Access Virtual Server Create S3 bucket Try MapReduce Example 18
19 Setting up your AWS Machine We went through this last time Setup LAMP stack if you want Setup any other AMI you want Install any software you want Ubuntu apt-get CentOS yum sudo su yum install httpd yum install mysql-server mysql yum install php php-mysql service httpd start service mysqld start /usr/bin/mysqladmin -u root password 'password' 19
20 Basic AWS EC2 setup Addusers, groups, etc Edit /etc/ssh/sshd_config setting where PasswordAuthentication yes Restart the sshd sudo service sshd restart Find online tutorials for more sysadmin related topics 20
21 Setup S3 wget yum install s3cmd s3cmd --configure firefox plugin Enter new values or accept defaults in brackets with Enter. Refer to user manual for detailed description of all options. Access key and Secret key are your identifiers for Amazon S3 Access Key: XXXXXXXXXXXXXXXXXXXX (enter your access key here) Secret Key: SSSSSSSSSSSSSSSSSSSSSSSSSSSSSS (enter your secret key here) Encryption password is used to protect your files from reading by unauthorized persons while in transfer to S3 Encryption password: PPPPPPPPPPP (enter a password of your choice here) Path to GPG program [/usr/bin/gpg]: When using secure HTTPS protocol all communication with Amazon S3 servers is protected from 3rd party eavesdropping. This method is slower than plain HTTP and can't be used if you're behind a proxy Use HTTPS protocol [No]: On some networks all internet access must go through a HTTP proxy. Try setting it here if you can't conect to S3 directly HTTP Proxy server name: 21
22 MapReduce Computation Split data into 16-64MB chunks, start many copies of program Master assigns map tasks to idle workers Map is applied Key-value pairs written to disk, locations passed to Master Reducer reads data from remote location keys are grouped Reduce is applied on intermediate key-value pairs 22
23 MapReduce Framework We just write mapper and reducer Hand over many details to the framework How to assign workers to the map tasks? How to assign workers to the reduce tasks? Handle synchronization across many map tasks Handle failures, errors How to distribute data in the file system 23
24 User Program (1) submit Master (2) schedule map (2) schedule reduce worker split 0 split 1 split 2 split 3 split 4 (3) read worker (4) local write (5) remote read worker worker (6) write output file 0 output file 1 worker Input files Map phase Intermediate files (on local disk) Reduce phase Output files Adapted from (Dean and Ghemawat, OSDI 2004) 24
25 MapReduce Implementations Google has a proprietary implementation in C++ Bindings in Java, Python Hadoop is an open-source implementation in Java Development led by Yahoo, used in production Now an Apache project Rapidly expanding software ecosystem Lots of custom research implementations For GPUs, cell processors, etc. 25
26 Shuffle and Sort Can we use locality information for Reducer? How about Combiner? Sum vs Mean? Picture from [2] 26
27 Examples Distributed Grep Map <line, 1> Reduce <line, 1> URL Access Frequency Count Map <URL, 1> Reduce <URL, total count> Reverse Web-Link Graph Map <target, source> Reduce <target, list(source)> 27
28 Examples Term-Vector per Host Map <hostdocment, tf vector> Reduce <hostdocument, tf vector> Inverted Index Map <word, document id> Reduce <word, list<document id> 28
29 Distributed File System We don t want to move data (remember : we can store a file of 5TB in HDFS) Put data in local disks and move the workers to the local disk A distributed file system allows to put data in a distributed fashion allowing worker to move where the data chunk resides GFS (Google File System) for Google s MapReduce HDFS (Hadoop Distributed File System) for Hadoop 29
30 HDFS File System Designs Use fixed chunk size Fixed size (64MB) Replicated 3 times across different nodes Centralized management with datanode that keeps track of namenodes No caching HDFS = GFS clone (same basic ideas) 30
31 HDFS Architecture HDFS namenode Application HDFS Client (file name, block id) (block id, block location) File namespace /foo/bar block 3df2 instructions to datanode (block id, byte range) block data datanode state HDFS datanode Linux file system HDFS datanode Linux file system Adapted from (Ghemawat et al., SOSP 2003) Picture from [1] 31
32 Namenode Functions Managing namespace of the filesystem Metadata, permissions, etc File Operations coordinate through namenodes Which datanode to write on Which datanode to read from Maintain overall health Pings to datanodes to check if they are up Garbage collection 32
33 HDFS and MapReduce namenode job submission node namenode daemon jobtracker tasktracker datanode daemon Linux file system slave node tasktracker datanode daemon Linux file system slave node tasktracker datanode daemon Linux file system slave node Picture from [1] 33
34 Map Reduce for Multiple Linear Regression Y X θ 34
35 Map Reduce for Multiple Linear Regression A b A= m i w i (x i x T i ) Mappers subgroup w i(x i x T i ) subgroup w i(x i x T i ) Reducer A Compute b similarly 35
36 MapReduce for Naïve Bayes We need to sum over all xj where j=1 to M and xj=1 if xj occurs in document D of class 1 We need to sum over all xj where j=1 to M and xj=1 if xj occurs in document D of class 1 M 1 1{x j=k y=1} Mappers subgroup 1{x j=k y=1} subgroup 1{x j=k y=1} Reducer Compute M 1 1{y=1} 36
37 MapReduce for K-Means r nk Step 1 Keep µ k fixed Minimize J with respect to r nk Optimize for each n separately by choosing for k that gives minimum x n r nk 2 r nk =1ifk=argmin j x n µ j 2 =0otherwise This requires assigning value of 0 or 1 to each data point We just need to split the data with mapper to many split Map : <ClusterID, point> 37
38 MapReduce for K-Means Minimize J with respect to r nk µ k Step 2 Keep fixed µ k µ k n µ k = r nkx n n r nk J is quadratic in. Minimize by setting derivative w.rt. zero to subgroup r nkx n subgroup r nkx n Reduce : <ClusterID, mean> 38
39 Speedup with MapReduce - EM Chu, Cheng-Tao, et. al 39
40 Speedup with MapReduce Naïve Byes Chu, Cheng-Tao, et. al 40
41 Speedup with MapReduce : KMeans Chu, Cheng-Tao, et. al 41
42 Amazon s Elastic MapReduce We want to write ML algorithms we have learned so far in MapReduce Framework Use Elastic MapReduce of Amazon Naïve Bayes in MapReduce framework to classify our hockey and baseball documents We need to loadup data in Amazon s persistent storage, S3 42
43 Naïve Bayes Experiment from Homework 1 wget s3cmd put baseball_hockey_train.txt s3://srmbucket/exp1/input/baseball_hockey_train.txt 43
44 Step Back : Word Count Example Map lines = sys.stdin for line in lines: words = line.split() for word in words: print word, "\t", 1; Key = wordid Value = 1 44
45 Word Count Example Reduce wordfreq={} lines = sys.stdin for line in lines: words = line.split("\t") key = words[0] val = words[1] if key in wordfreq: wordfreq[key] = wordfreq[key]+1 else: wordfreq[key] = 1 for key in wordfreq.keys(): print key, "\t", wordfreq[key] 45
46 Running MapReduce based WordCount in Amazon Copy over mapper.py and reducer.py to s3 s3cmd --recursive put mybin s3://maptest1/exp2/mybin Copy over data to s3 s3cmd put baseball_hockey_train.txt s3://maptest1/exp2/input/baseball_hockey_train.txt 46
47 Setup MapReduce Job Flow 47
48 Commandline Interface elastic-mapreduce --create --stream --num-instances 1 --mapper s3://yourbucket/bin/mapper.py --reducer s3://yourbucket/bin/reducer.py --input s3://yourbucket/data --name 'My WordCount Job' --output s3://yourbucket/output 48
49 References [1] Jimmy Lin, Spring/syllabus.html 49
MapReduce, Hadoop and Amazon AWS
MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Getting Started with Hadoop with Amazon s Elastic MapReduce
Getting Started with Hadoop with Amazon s Elastic MapReduce Scott Hendrickson [email protected] http://drskippy.net/projects/emr-hadoopmeetup.pdf Boulder/Denver Hadoop Meetup 8 July 2010 Scott Hendrickson
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Introduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
How To Use Hadoop
Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop
Hadoop Parallel Data Processing
MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for
Getting to know Apache Hadoop
Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the
University of Maryland. Tuesday, February 2, 2010
Data-Intensive Information Processing Applications Session #2 Hadoop: Nuts and Bolts Jimmy Lin University of Maryland Tuesday, February 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
Cloud Homework instructions for AWS default instance (Red Hat based)
Cloud Homework instructions for AWS default instance (Red Hat based) Automatic updates: Setting up automatic updates: by Manuel Corona $ sudo nano /etc/yum/yum-updatesd.conf Look for the line that says
http://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
Sriram Krishnan, Ph.D. [email protected]
Sriram Krishnan, Ph.D. [email protected] (Re-)Introduction to cloud computing Introduction to the MapReduce and Hadoop Distributed File System Programming model Examples of MapReduce Where/how to run MapReduce
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 21 Outline
A programming model in Cloud: MapReduce
A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud
Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud Aditya Jadhav, Mahesh Kukreja E-mail: [email protected] & [email protected] Abstract : In the information industry,
Single Node Hadoop Cluster Setup
Single Node Hadoop Cluster Setup This document describes how to create Hadoop Single Node cluster in just 30 Minutes on Amazon EC2 cloud. You will learn following topics. Click Here to watch these steps
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13
How to properly misuse Hadoop Marcel Huntemann NERSC tutorial session 2/12/13 History Created by Doug Cutting (also creator of Apache Lucene). 2002 Origin in Apache Nutch (open source web search engine).
Parallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai [email protected] MapReduce is a parallel programming model
Data-intensive computing systems
Data-intensive computing systems Hadoop Universtity of Verona Computer Science Department Damiano Carra Acknowledgements! Credits Part of the course material is based on slides provided by the following
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
A very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5
Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 [email protected], 2 [email protected],
Intro to Map/Reduce a.k.a. Hadoop
Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by
MAPREDUCE Programming Model
CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
The Hadoop Framework
The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen [email protected] Abstract. The Hadoop Framework offers an approach to large-scale
Lecture 3 Hadoop Technical Introduction CSE 490H
Lecture 3 Hadoop Technical Introduction CSE 490H Announcements My office hours: M 2:30 3:30 in CSE 212 Cluster is operational; instructions in assignment 1 heavily rewritten Eclipse plugin is deprecated
Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk
Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
Introduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology [email protected] Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea Overview Riding Google App Engine Taming Hadoop Summary Riding
Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW
Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop
Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps
Distributed Recommenders Fall 2010 Distributed Recommenders Distributed Approaches are needed when: Dataset does not fit into memory Need for processing exceeds what can be provided with a sequential algorithm
Apache Hadoop new way for the company to store and analyze big data
Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
Tutorial for Assignment 2.0
Tutorial for Assignment 2.0 Florian Klien & Christian Körner IMPORTANT The presented information has been tested on the following operating systems Mac OS X 10.6 Ubuntu Linux The installation on Windows
Map Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"
Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster" Mahidhar Tatineni SDSC Summer Institute August 06, 2013 Overview "" Hadoop framework extensively used for scalable distributed processing
Hadoop. History and Introduction. Explained By Vaibhav Agarwal
Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
Internals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
Hadoop and Map-reduce computing
Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.
Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200
Hadoop Learning Resources 1 Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200 Author: Hadoop Learning Resource Hadoop Training in Just $60/3000INR
Analysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT MapReduce is a programming model
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
Map Reduce / Hadoop / HDFS
Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview
CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #13: NoSQL and MapReduce
CS 4604: Introduc0on to Database Management Systems B. Aditya Prakash Lecture #13: NoSQL and MapReduce Announcements HW4 is out You have to use the PGSQL server START EARLY!! We can not help if everyone
and HDFS for Big Data Applications Serge Blazhievsky Nice Systems
Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted
t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from
Hadoop Beginner's Guide Learn how to crunch big data to extract meaning from data avalanche Garry Turkington [ PUBLISHING t] open source I I community experience distilled ftu\ ij$ BIRMINGHAMMUMBAI ')
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
THE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart
Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Cloud Computing. Lectures 10 and 11 Map Reduce: System Perspective 2014-2015
Cloud Computing Lectures 10 and 11 Map Reduce: System Perspective 2014-2015 1 MapReduce in More Detail 2 Master (i) Execution is controlled by the master process: Input data are split into 64MB blocks.
MapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) [email protected] http://www.cse.buffalo.edu/faculty/bina Partially
Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan [email protected]
Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan [email protected] Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.
MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015
7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan [email protected] Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
Running Kmeans Mapreduce code on Amazon AWS
Running Kmeans Mapreduce code on Amazon AWS Pseudo Code Input: Dataset D, Number of clusters k Output: Data points with cluster memberships Step 1: for iteration = 1 to MaxIterations do Step 2: Mapper:
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015
Lecture 2 (08/31, 09/02, 09/09): Hadoop Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 K. Zhang BUDT 758 What we ll cover Overview Architecture o Hadoop
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has
Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc [email protected]
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc [email protected] What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
map/reduce connected components
1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains
TP1: Getting Started with Hadoop
TP1: Getting Started with Hadoop Alexandru Costan MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development of web
This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.
Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,
What is cloud computing?
Introduction to Clouds and MapReduce Jimmy Lin University of Maryland What is cloud computing? With mods by Alan Sussman This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike
Hadoop on OpenStack Cloud. Dmitry Mescheryakov Software Engineer, @MirantisIT
Hadoop on OpenStack Cloud Dmitry Mescheryakov Software Engineer, @MirantisIT Agenda OpenStack Sahara Demo Hadoop Performance on Cloud Conclusion OpenStack Open source cloud computing platform 17,209 commits
CSCI6900 Assignment 2: Naïve Bayes on Hadoop
DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF GEORGIA CSCI6900 Assignment 2: Naïve Bayes on Hadoop DUE: Friday, September 18 by 11:59:59pm Out September 4, 2015 1 IMPORTANT NOTES You are expected to use
Hadoop/MapReduce Workshop. Dan Mazur, McGill HPC [email protected] [email protected] July 10, 2014
Hadoop/MapReduce Workshop Dan Mazur, McGill HPC [email protected] [email protected] July 10, 2014 1 Outline Hadoop introduction and motivation Python review HDFS - The Hadoop Filesystem MapReduce
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets
!"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.
