Hadoop/MapReduce Workshop

Similar documents

Hadoop/MapReduce Workshop. Dan Mazur, McGill HPC July 10, 2014

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Chapter 7. Using Hadoop Cluster and MapReduce

BIG DATA What it is and how to use?

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop. Bioinformatics Big Data

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop Ecosystem B Y R A H I M A.

Extreme computing lab exercises Session one

A very short Intro to Hadoop

How To Use Hadoop

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Large scale processing using Hadoop. Ján Vaňo

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Big Data and Apache Hadoop s MapReduce

Hadoop Job Oriented Training Agenda

Introduction to Hadoop

Hadoop implementation of MapReduce computational model. Ján Vaňo

A Brief Outline on Bigdata Hadoop

How To Scale Out Of A Nosql Database

Chase Wu New Jersey Ins0tute of Technology

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data processing goes big

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop. Sunday, November 25, 12

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data With Hadoop

Intro to Map/Reduce a.k.a. Hadoop

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Tutorial for Assignment 2.0

Constructing a Data Lake: Hadoop and Oracle Database United!

Hadoop IST 734 SS CHUNG

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Map Reduce & Hadoop Recommended Text:

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Introduction to Cloud Computing

HDFS. Hadoop Distributed File System

Data-Intensive Computing with Map-Reduce and Hadoop

Internals of Hadoop Application Framework and Distributed File System

ITG Software Engineering

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

COURSE CONTENT Big Data and Hadoop Training

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Workshop on Hadoop with Big Data

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop Streaming. Table of contents

map/reduce connected components

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Apache Hadoop new way for the company to store and analyze big data

Introduction to MapReduce and Hadoop

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Big Data Processing with Google s MapReduce. Alexandru Costan

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

PassTest. Bessere Qualität, bessere Dienstleistungen!

Extreme computing lab exercises Session one

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Deploying Hadoop with Manager

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Big Data : Experiments with Apache Hadoop and JBoss Community projects

Getting Started with Hadoop with Amazon s Elastic MapReduce

Open source Google-style large scale data analysis with Hadoop

MapReduce with Apache Hadoop Analysing Big Data

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Hadoop & Spark Using Amazon EMR

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Big Data Too Big To Ignore

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Reduction of Data at Namenode in HDFS using harballing Technique

MapReduce. Tushar B. Kute,

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Distributed Filesystems

HiBench Introduction. Carson Wang Software & Services Group

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Tutorial for Assignment 2.0

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Maximizing Hadoop Performance with Hardware Compression

Click Stream Data Analysis Using Hadoop

HADOOP. Revised 10/19/2015

Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

Transcription:

Hadoop/MapReduce Workshop Dan Mazur daniel.mazur@mcgill.ca Simon Nderitu simon.nderitu@mcgill.ca guillimin@calculquebec.ca August 14, 2015 1

Outline Hadoop introduction and motivation Python review HDFS - The Hadoop Filesystem MapReduce examples and exercises Wordcount Distributed grep Distributed sort Maximum Mean and standard deviation Combiners 2

Exercise 0: Login and Setup Log-in to Guillimin $ ssh -X class##@hadoop.hpc.mcgill.ca Use the account number and password from the slip you received Copy workshop files $ cp -R /software/workshop/hadoop. Load Hadoop Module $ module show hadoop $ module load hadoop python 3

Exercise 1: First MapReduce job To make sure your environment is set up correctly Launch an example MapReduce job $ hadoop jar $HADOOP_EXAMPLES pi 100 100 $HADOOP_EXAMPLES=/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/hadoopmapreduce-examples.jar Final output Job Finished in 16.983 seconds Estimated value of Pi is 3.14080000000000000000 4

Hadoop What is Hadoop? 5

Hadoop What is Hadoop? A collection of related software ( software framework, software stack, software ecosystem ) Data-intensive computing ( Big Data ) Scales with data size Fault-tolerant Parallel Analysis of unstructured, semi-structured data Cluster Commodity hardware Open Source 6

MapReduce What is MapReduce? Parallel, distributed programming model Large-scale data processing Map() Filter, sort, embarrassingly parallel e.g. sort participants by first name Reduce() Summary e.g. count participants whose first name starts with each letter 7

Hadoop Ecosystem Apache Hadoop core HDFS - Hadoop Distributed File System Yarn - Resource management, job scheduling Pig - high-level scripting for MapReduce programs Hive - Data warehouse with SQL-like queries HCatalog - Abstracts data storage filenames and formats HBase - Database Zookeeper - Maintains and synchronizes configuration Oozie - Workflow scheduling Sqoop - Transfer data to/from relational databases 8

Hadoop Motivation For big data, hard disk input/output (I/O) is a bottleneck We have seen huge technology improvements in both CPU speeds and storage capacity I/O performance has not improved as dramatically We will see that Hadoop solves this problem by parallelizing I/O operations across many disks The genius of Hadoop is in how easy this parallelization is from the developer's perspective 9

Hadoop Motivation Big Data Challenges - The V -words Volume Amount of data (terabytes, petabytes) Want to distribute storage across many nodes and analyze it in place Variety Data comes in many formats, not always in a relational database Want to store data in original format Velocity Rate at which size of data grows, or speed with which it must be processed Want to expand storage and analysis capacity as data grows Data is big data if its volume, variety, or velocity are too great to manage easily with traditional tools 10

How Does Hadoop Solve These Problems? Distributed file system (HDFS) Scalable with redundancy Parallel computations on data nodes Batch (scheduled) processing 11

Hadoop works well for loosely coupled (embarrassingly) parallel problems Process coupling Hadoop vs. Competition MPI Database Hadoop Parallelism 12

Hadoop vs. Competition Map or reduce tasks are automatically re-run if they fail Data is stored redundantly and reproduced automatically if a drive fails Hadoop Fault Tolerance Database MPI Parallelism 13

Hadoop makes certain problems very easy to solve Hardest parts of parallel programming are abstracted away Today we will write several practical codes that could scale to 1000s of nodes with just a few lines of code Developer Productivity Hadoop vs. Competition Hadoop Database MPI Parallelism 14

Hadoop vs. and Competition Hadoop, MPI, and databases are all improving their weaknesses All are becoming fault-tolerant platforms for tightly-coupled, massively parallel problems Hadoop integrates easily into a workflow that includes MPI and/or databases Sqoop, HBase, etc. for working with databases Hadoop for post MPI data analysis Hadoop for pre MPI data processing Hadoop 2 introduced a scheduler, Yarn, that can schedule MPI, MapReduce, and other types of workloads New tools for tightly-coupled problems Apache Spark 15

Python Hadoop is implemented in Java Developers can work with any programming language For a workshop, it is important to have a common language 16

Python - for loops Can loop over individual instances of 'iterable' objects (lists, tuples, dictionaries) Looped sections use an indented block Be consistent: use a tab or 4 spaces, not both Do not forget the colon mylist = ['one', 'two', 'three'] for item in mylist: print(item) 17

Python standard input/output #!/usr/bin/python import sys import csv Load modules for system and comma separated value file functions reader = csv.reader(sys.stdin, delimiter=',') for line in reader: data0 = line[0] data1 = line[1] Loop over lines in reader 18

Dictionaries Unordered set of key/value pairs keys are unique, can be used as an index >>> dict = {'key':'value', 'apple':'the round fruit of a tree'} >>> print dict['key'] value >>> print dict['apple'] the round fruit of a tree dict 'key' 'apple' 'value' 'the round fruit of a tree' 19

Hadoop Distributed File System (HDFS) Key Concepts: Data is read and written in minimum units ( blocks ) A master node ( namenode ) manages the filesystem tree and the metadata for each file Data is stored on a group of worker nodes ( datanodes ) The same blocks are replicated across multiple datanodes (default replication = 3) 20

HDFS Blocks datanode datanode datanode datanode datanode datanode 1: 64 MB 2: 64 MB 3: 22 MB 150 MB myfile.txt 21

HDFS Blocks datanode datanode 1: 64 MB 2: 64 MB datanode datanode 1: 64 MB 2: 64 MB 3: 22 MB 150 MB myfile.txt Data is distributed block-by-block to multiple nodes 3: 22 MB datanode datanode 22

HDFS Blocks 1: 64 MB 2: 64 MB 3: 22 MB 150 MB myfile.txt Data redundancy default = 3x datanode datanode 1: 64 MB 2: 64 MB 3: 22 MB 3: 22 MB datanode datanode 3: 22 MB 1: 64 MB 2: 64 MB 2: 64 MB datanode datanode 1: 64 MB If we lose a node, data is available on 2 other nodes and the namenode arranges to create a 3rd copy on another node 23

Exercise 2: Using HDFS Put a file into HDFS $ hdfs dfs -put titanic.txt List files in HDFS $ hdfs dfs -ls Output the file contents $ hdfs dfs -cat titanic.txt $ hdfs dfs -tail titanic.txt Get help $ hdfs dfs -help Put the workshop data sets into HDFS $ hdfs dfs -put usask_access_logs $ hdfs dfs -put household_power_consumption.txt 24

MapReduce Roman census approach: Want to count (and tax) all people in the Roman empire Better to go to where the people are (decentralized) than try to bring them to you (centralized) Bring back information from each village (map phase) Summarize the global picture (reduce phase) 25

Roman Census: Mapping Village Village mapper mapper Village Capital Village mapper Village 287 men 293 women 104 children 854 sheep... mapper Village Note: These mappers are also combiners in Hadoop language. We will discuss what this means. 26

Roman Census: Reducing 854 sheep reducer 34 sheep reducer 1032 sheep 854sheep sheep 854 854 sheep 854sheep sheep 854 104 children 206 sheep 854sheep sheep 854 854 sheep 854sheep sheep 854 293 women 854sheep sheep 854 854 sheep 854sheep sheep 854 287 men 91 sheep 545 sheep reducer 2762 sheep 27

MapReduce Data key, key, key, key, key, key, key, key, key, value value value value value value value value value pairs pairs pairs pairs pairs pairs pairs pairs pairs mappers sort and shuffle key, all values key, all values key, all values key, all values reducers results 28

Mapper Takes specified data as input Works with a fraction of the data Works in parallel Outputs intermediate records key, value pairs Recall hash tables or python dictionaries 29

Reducer Takes a key or set of keys with all associated values as input Works with all data for that key Outputs the final results 30

MapReduce Word Counting Want to count the frequency of words in a document What are the key, value pairs for our mapper output? A) key=1, value='word' B) key='word', value=1 C) key=[number in hdfs block], value='word' D) key='word', value=[number in hdfs block] E) Something else 31

MapReduce Word Counting Want to count the frequency of words in a document What are the key, value pairs for our mapper output? A) key=1, value='word' B) key='word', value=1 C) key=[number in hdfs block], value='word' D) key='word', value=[number in hdfs block] E) Something else Explanation: We want to sort according to the words, so that is the key. We can generate a pair for each word, we don't need the mapper to keep track of frequencies 32

MapReduce Word Counting If our mapper input is hello world, hello!, what will our reducer input look like? A) hello world hello B) hello world hello 1 1 1 C) hello hello world D) hello world 2 1 1 1 1 33

MapReduce Word Counting If our mapper input is hello world, hello!, what will our reducer input look like? A) hello world hello B) hello world hello C) hello hello world D) hello world 1 1 1 1 1 1 2 1 Explanation: The reducer receives SORTED key, value pairs. The sorting is done automatically by Hadoop. D is also possible, we will learn about combiners later. 34

MapReduce Word Counting Want to count the frequency of words in a document What are the key, value pairs for our reducer output? A) key=1, value='word' B) key='word', value=1 C) key=[count in document], value='word' D) key='word', value=[count in document] E) Something else 35

MapReduce Word Counting Want to count the frequency of words in a document What are the key, value pairs for our reducer output? A) key=1, value='word' B) key='word', value=1 C) key=[count in document], value='word' D) key='word', value=[count in document] E) Something else 36

Hadoop Streaming Streaming lets developers use any programming language for mapping and reducing Use standard input and standard output The first tab character delimits between key and value Similar to Bash pipes $ cat file.txt./mapper sort./reducer HADOOP_STREAM=/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/hadoop-streaming.jar hadoop jar $HADOOP_STREAM -files <mapper script>,<reducer script> -input <input dir> -output <output dir> -mapper <mapper script> -reducer <reducer script> 37

mapper.py #!/usr/bin/env python Scripts require a hash bang line import sys Import the sys module for stdin for line in sys.stdin: # split the line into words words = line.split() for word in words: print word, '\t', 1 Loop over standard input Loop over words Print tab-separated key, value pairs 38

Reducer: Checking for key changes Often reducers will have to detect when the key changes in the sorted mapper output prevkey = None for line in inputreader: key = line[0] #If the current key is the same as previous key if key == prevkey: value =... # update for current line # Else we have started a new group of keys else: # if not first line of input if not prevkey == None: # Completed entire key, print value print prevkey, \t, value value =... # set for first entry of new key # Set prevkey to the current key prevkey = key #Output final key, value pair if prevkey: print prevkey, '\t', value 39

reducer.py #!/usr/bin/env python import sys prevword = None wordcount = 0 word = None for line in sys.stdin: word, count = line.split('\t', 1) count = int(count) if word == prevword: wordcount += count else: if prevword: print prevword, '\t', wordcount wordcount = count prevword = word if prevword: print prevword, '\t', wordcount 40

Testing map and reduce scripts It is useful to test your scripts with a small amount of data in serial to check for syntax errors $ head -100 mydata.txt./mapper.py sort./reducer.py 41

Exercise 3: Word count Place the directory montgomery into HDFS $ hdfs dfs -put montgomery Submit a MapReduce job with your *tested* scripts to count the word frequencies in Lucy Maud Montgomery's books $ hadoop jar $HADOOP_STREAM \ -files mapper_wordcount.py,reducer_wordcount.py\ -mapper mapper_wordcount.py \ -reducer reducer_wordcount.py \ -input montgomery \ -output wordcount 42

Exercise 3: Word count View the output directory $ hdfs dfs -ls wordcount View your results $hdfs dfs -cat wordcount/part-00000 View your (sorted) results $ hdfs dfs -cat wordcount/part-00000 sort -k 2 -n 43

Storage A Hadoop cluster has 10 nodes with 300GB of storage per node with the default HDFS setup (replication factor 3, 64MB blocks). Alice wants to upload two 400GB files and run WordCount on them both. What will happen? A) The data upload fails at the first file B) The data upload fails at the second file C) MapReduce job fails D) None of the above. MapReduce is successful. 44

Storage A Hadoop cluster has 10 nodes with 300GB of storage per node with the default HDFS setup (replication factor 3, 64MB blocks). Alice wants to upload two 400GB files and run WordCount on them both. What will happen? A) The data upload fails at the first file B) The data upload fails at the second file C) MapReduce job fails D) None of the above. MapReduce is successful. 45

Storage A Hadoop cluster has 10 nodes with 300GB of storage per node with the default HDFS setup (replication factor 3, 64MB blocks). Alice wants to upload three 400GB files and run WordCount on them all. What will happen? A) The data upload fails at the first file B) The data upload fails at the second file C) The data upload fails at the third file D) MapReduce job fails E) None of the above. MapReduce is successful. 46

Storage A Hadoop cluster has 10 nodes with 300GB of storage per node with the default HDFS setup (replication factor 3, 64MB blocks). Alice wants to upload three 400GB files and run WordCount on them all. What will happen? A) The data upload fails at the first file B) The data upload fails at the second file C) The data upload fails at the third file D) MapReduce job fails E) None of the above. MapReduce is successful. Explanation: The small cluster can only store 1.0TB of data with 3X replication! Alice wants to upload 1.2TB. 47

Simplifying MapReduce Commands The native streaming commands are cumbersome TIP: Create simplifying aliases and functions e.g. mapreduce_stream(){ hadoop jar $HADOOP_STREAM -files $1,$2 \ -mapper $1 \ -reducer $2 \ -input $3 -output $4 } alias mrs mapreduce_stream Place these commands into ~/.bashrc so they are executed in each new bash session (each login) To avoid confusion, we will only use the native commands today 48

Exercise 4: Hadoop Web UI Hadoop includes a web-based user interface Launch a firefox window $ firefox & Navigate to the Hadoop Job Monitor http://lm-2r01-n02:8088/cluster Navigate to the namenode and filesystem http://lm-2r01-n01:50070/dfshealth.jsp Navigate to the job history http://lm-2r01-n02:19888/jobhistory 49

50

Accessing Job Logs Through the web UI Through the command line $ yarn logs -applicationid application_1431541293708_0051 51

Example: Distributed Grep Note: We don't have to write any scripts! Note: There is no reducer phase Hadoop Command: $ hadoop jar $HADOOP_STREAM \ -D mapreduce.job.reduces=0 \ -D mapred.reduce.tasks=0 \ -input titanic.txt \ -output grepout \ -mapper /bin/grep Williams View Results: $ hdfs dfs -cat grepout/part-0000* 52

Household Power Consumption Dataset: household_power_consumption.txt From the UCI Machine Learning Repository 9 Columns, semicolon separated (see household_power_consumption.explain) 1. date: dd/mm/yyyy 2. time: hh:mm:ss 3. minute-averaged active power (kilowatts) 4. minute-averaged reactive power (kilowatts) 5. minute-averaged voltage (volts) 6. minute-averaged current (amps) 7. kitchen active energy (watt-hours) 8. laundry active energy (watt-hours) 9. water-heater and A/C active energy (watt-hours) $ hdfs dfs -put household_power_consumption.txt 53

Problematic Input In the household_power_consumption data, missing values are specified by '?' Analysts must decide how to deal with unexpected values in unstructured data Today, we will ignore it for line in sys.stdin: try: data = float(line.split(';')[2]) except: continue... 54

On which date was the maximum minute-averaged active power? What should the output of our mapper be? A) power 1 B) date 1 C) date power D) something else 55

On which date was the maximum minute-averaged active power? What should the output of our mapper be? A) power 1 B) date 1 C) date power D) something else 56

Working with.csv files in python We can use the csv module in python to parse.csv files more easily import sys, csv reader = csv.reader(sys.stdin, delimiter=';') for line in reader: data = float(line[2])... 57

Exercise 5: Compute the maximum Write a mapper and a reducer to compute the maximum value of the minute-averaged active power (3rd column), as well as the date on which this power occurred 58

Combiners To compute the max, you may have... Output a list of all values with a single common key Had a single reducer compute the maximum value in serial We would like to do some pre-reduction on the mapper nodes to balance the workload from the reducer to the mappers To find the maximum, we only need to send the maximum from each mapper through the network, not every value 59

Combiners (maximum)... 20/12/2006;02:46:00;1.516;0.262;245.780;6.200;0.000;1.000;18.000 20/12/2006;02:47:00;1.498;0.258;246.060;6.200;0.000;2.000;19.000 20/12/2006;02:48:00;1.518;0.264;246.240;6.200;0.000;1.000;18.000... Map 16/12/2006 16/12/2006 16/12/2006 16/12/2006 16/12/2006 4.216 5.360 5.374 5.388 3.666 20/12/2006 20/12/2006 20/12/2006 20/12/2006 20/12/2006 1.516 1.498 1.518 1.492 1.504 5.388 20/12/2006 1.518 Combine 16/12/2006 Shuffle and sort, reduce 60

Combiners To compute the maximum the reducer and the combiner can be the same script max() is associative and commutative max([a,b]) = max([b,a]) max([a, max([b,c])]) = max([max([a,b]),c]) 61

Combiners To test your combiner scripts $ cat data.txt./mapper.py sort./combiner.py./reducer.py Hadoop sorts locally before the combiner and globally between the combiner and reducer Note that Hadoop does not guarantee how many times the combiner will be run 62

Combiners What is the benefit of reducing the number of keyvalue pairs sent to the reducer? A) The amount of work done in the Map phase is reduced B) The amount of work done in the Reduce phase is reduced C) The amount of data sent through the network is reduced D) More than one of the above E) None of the above 63

Combiners What is the benefit of reducing the number of keyvalue pairs sent to the reducer? A) The amount of work done in the Map phase is reduced B) The amount of work done in the Reduce phase is reduced C) The amount of data sent through the network is reduced D) More than one of the above E) None of the above 64

Histograms Histograms 'bin' continuous data into discrete ranges Mwtoews, 2008 65

Exercise 6: Histogram Write a mapper that uses round(power) to 'bin' the minute-averaged active power readings (3rd column) Output for each reading: [power bin], 1 Write a reducer that creates combined counts Input: [power bin], count Output: [power bin], count This script must also function as combiner Submit your tested scripts as a Hadoop job Use the reducer script as a combiner $ hadoop jar $HADOOP_STREAM... -combiner reducer_hist.py... You may generate a plot (plotting is outside the scope of the workshop) $ hdfs dfs -cat histogram/part-00000 solutions/plot_hist.py 66

Histogram 67

Mean and Standard Deviation We can't easily use combiners to compute the mean max(max(a,b), max(c,d,e)) = max(a,b,c,d,e) mean(mean(a,b), mean(c,d,e))!= mean(a,b,c,d,e) Reducer function can be used as a combiner if associative: (A*B)*C = A*(B*C) commutative: A*B = B*A e.g.: counting, addition, multiplication,... Computing the mean and standard deviation means the reducer is stuck with a lot of math Combiner idea for mean key = intermediate sum value = number of entries 68

Power consumption by day of week Are there days of the week when more power is consumed, on average? Want to know the mean and standard deviation for each week day Simplification: Compute average of minuteaveraged powers, grouped by day of week 69

Python datetime The datetime module is a powerful library for working with dates and times We can easily find the day of the week from the date from datetime import datetime weekday = datetime.strptime(date, "%d/%m/%y").weekday() 70

Exercise 7: Mean and Standard Deviation Write mapper and reducer code to compute the mean and standard deviation for active power (3rd column) for each of the seven days of the week Test your scripts using serial Bash commands Submit your job to Hadoop Tip: Wikipedia - Algorithms for calculating variance Python code to compute mean and variance in a single pass 71

Speedup - Mean and St.Dev. ~2 million entries Serial Bash version $ cat household_power_consumption.txt./mapper.py sort./reducer.py 75 seconds Hadoop version 2 mappers, 1 reducer: 50 seconds Speedup: 1.5X 2 mappers, 2 reducers: 48 seconds Speedup: 1.6X -D mapred.reduce.tasks=2 4 mappers, 4 reducers: 29 seconds Speedup: 2.6X 72

Choosing numbers of maps/reduces Mappers More mappers increases parallelism Too many mappers increases scheduling overhead Hadoop automatically sets the number of mappers according to the block size and input data size Reducers Too few reducers increases the computational load on each reducer Too many reducers increases shuffle and HDFS overhead Rule of thumb: Each reducer should process 1-10GB of data 73

Iterative MapReducing Many tasks in scientific computing cannot be easily expressed as a single MapReduce job Often, we require iterating over data K-means clustering is an example We will see how it can be implemented in MapReduce We will not implement it, just see how it works with the MapReduce framework A MapReduce K-means clustering is implemented in Mahout (scalable machine learning algorithms) 74

K-means clustering Unsupervised machine learning Divide a data set into k different categories based on the features of that data set Computational hotspot: computing distances between each cluster centroid and each data point, O(n*k) E.g. Clothing manufacturer: based on customer's height and weight data, divide them into 3 or more size categories E.g. Categorize astronomical objects into stars, galaxies, quasars, etc. based on spectral data E.g. Categorize gene expression profiles to study function within similar expressions 75

K-means clustering Step 1: Randomly generate K locations (circles) Step 3: Update locations to the centroid of each group Step 2: Group data points by proximity to locations Iterate over steps 2 and 3 Images: I, Weston.pace 76

MapReduce K-Means Data points Centroid locations Mapper Mapper Calculate centroid distances Assign data points to nearest centroid Mapper key: best centroid value: data point Mapper No Reducer Compute new centroids Reducer key: old centroid value: new centroid Converged? Yes Final Centroid locations 77

Iterative MapReducing To make iterative jobs easier, the Hadoop ecosystem has tools for iterative workloads Twister - Iterative MapReduce framework HaLoop - Iterative MapReduce framework Mahout - Scalable implementations of machine learning algorithms on Hadoop (including k-means) Spark - Framework for in-memory distributed computing, in-memory data sharing between jobs Make use of high-level interfaces to MapReduce for more complex jobs Pig, Mahout, etc. 78

What questions do you have? 79

In the time remaining... Import your own data into HDFS for analysis Your quota is 300GB (after replication) by default Examine some data from Twitter /software/workshop/twitter_data 3.8 million tweets + metadata ~ 11 GB Continue to work with the workshop data sets titanic.txt household_power_consumption.txt usask_access_log Contact us to add your user account to the Hadoop test environment (class accounts deactivate later today) guillimin@calculquebec.ca 80

Keep Learning... Contact us for access to our Hadoop test system guillimin@calculquebec.ca Download a Hadoop virtual machine http://hortonworks.com/products/hortonworks-sandbox View online training materials https://www.udacity.com/course/ud617 http://cloudera.com/content/cloudera/en/training/library.html Calcul Quebec workshop on Apache Spark (French) https://wiki.calculquebec.ca/w/formations 81

Bonus Exercise: Top 10 websites Produce a top 10 list of websites accessed on the University of Saskatchewan website usask_access_logs Be careful Some lines wont conform to your expectations How to handle? skip? exception? 82

Top 10 List Mapper output: key 1 Combiner returns top 10 for each mapper output: key count Reducer finds the global top 10 output: key count 83