COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers



Similar documents
Hadoop Architecture. Part 1

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Map Reduce / Hadoop / HDFS

Big Data Processing with Google s MapReduce. Alexandru Costan

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

MapReduce and Hadoop Distributed File System V I J A Y R A O

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

MapReduce and Hadoop Distributed File System

CSE-E5430 Scalable Cloud Computing Lecture 2

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Challenges for Data Driven Systems

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Applied Multivariate Analysis - Big data analytics

Fast Analytics on Big Data with H20

Classification On The Clouds Using MapReduce

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Data Mining. Nonlinear Classification

Parallel Processing of cluster by Map Reduce

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

16.1 MAPREDUCE. For personal use only, not for distribution. 333

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

The Stratosphere Big Data Analytics Platform

L3: Statistical Modeling with Hadoop

Spark and the Big Data Library

Machine Learning Big Data using Map Reduce

Data Mining in the Swamp

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

A Performance Analysis of Distributed Indexing using Terrier

Data-Intensive Computing with Map-Reduce and Hadoop

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Introduction to Hadoop

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Intro to Map/Reduce a.k.a. Hadoop

Improving MapReduce Performance in Heterogeneous Environments

Hadoop IST 734 SS CHUNG

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Survey on Load Rebalancing for Distributed File System in Cloud

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Big Data With Hadoop

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

RevoScaleR Speed and Scalability

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Energy Efficient MapReduce

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Hadoop Distributed File System (HDFS) Overview

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.

Big Data and Apache Hadoop s MapReduce

MapReduce with Apache Hadoop Analysing Big Data

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

BIG DATA TRENDS AND TECHNOLOGIES

A Comparison of Approaches to Large-Scale Data Analysis

Analysis of MapReduce Algorithms

Hadoop and Map-Reduce. Swati Gore

Distributed Filesystems

In Memory Accelerator for MongoDB

BIG DATA What it is and how to use?

Big Data: Big N. V.C Note. December 2, 2014

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)

NoSQL and Hadoop Technologies On Oracle Cloud

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Knowledge Discovery and Data Mining

Apache Hadoop. Alexandru Costan

Hadoop Parallel Data Processing

Hadoop Cluster Applications

HT2015: SC4 Statistical Data Mining and Machine Learning

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Chapter 7. Using Hadoop Cluster and MapReduce

Lecture 10 - Functional programming: Hadoop and MapReduce

What is Big Data? Concepts, Ideas and Principles. Hitesh Dharamdasani

Map-Reduce for Machine Learning on Multicore

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Open source Google-style large scale data analysis with Hadoop

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Big Fast Data Hadoop acceleration with Flash. June 2013

Unified Big Data Processing with Apache Spark. Matei

From GWS to MapReduce: Google s Cloud Technology in the Early Days

ADVANCED MACHINE LEARNING. Introduction

Introduction to DISC and Hadoop

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

SEIZE THE DATA SEIZE THE DATA. 2015

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Transcription:

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: (jpineau@cs.mcgill.ca) TAs: Pierre-Luc Bacon (pbacon@cs.mcgill.ca) Ryan Lowe (ryan.lowe@mail.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp598 Big Data by the numbers 20+ billion web pages x 20KB = 400+TB 1 computer reads 30-35 MB/sec from disk. ~4 months to read the web. ~1000 hard drives to store the web. In 2011: An estimated 1M machines used by Google. An expected 1000 machines fail every day! In 2012: An estimated 100PB of data on Facebook s cluster. In 2014: An estimated 455PB of data stored on Yahoo! cluster. Copying large amounts of data over a network takes time. Hard to do something useful with all that data! 2 1

Big Data Big data is large number of features, or large number of datapoints, or both. Major issues for machine learning: Need distributed data storage / access. Can t use algorithms that are superlinear in the number of examples. Large number of features causes overfitting. Need parallelized algorithms. Often need to deal with unbalanced datasets. Need to represent high-dimensional data. 3 An era of supercomputers 4 http://top500.org/lists/2014/06/ 2

An era of supercomputers https://computecanada.ca/ 5 Hadoop Open-source software framework for datacentric computing: Distributed File System + MapReduce + Advanced components. MapReduce = The system that assigns jobs to nodes in the cluster. Allows distributed processing of large datasets across clusters. High degree of fault tolerance: When nodes go down, jobs are re-distributed. Data is replicated multiple times, in small chunks. Substantial uptake by industry. 6 3

Hadoop Distributed File System (HDFS) All data is stored multiple times: Robust to failure. Don t need to backup a Hadoop cluster. Multiple access points for any one piece of data. Files are stored in 64MB chunks ( shards ). Sequential reads are fast. 100MB/s disk requires 0.64s to read a chunk but only 0.01s to start reading it. Not made for random reads. Not efficient for accessing small files. No support for file modification. 7 MapReduce 8 4

MapReduce Programming framework for applying parallel algorithms to large datasets, across a cluster of computers. Developed by Google. Map: Partition the data across nodes, perform computation (independently) on each partition. Reduce: Merge values from nodes to produce global output. 9 Example Task: Count word occurrences across many documents 10 5

Word count example 11 Execution 12 6

Parallel Execution 13 Task scheduling Typically many more tasks than machines. E.g. 200,000 map / 5000 reduce tasks w/ 2000 machines. 14 7

Fault tolerance Handled via re-execution. Worker failure: Detect via periodic checking. Re-execute completed and in-progress map tasks. Re-execute in-progress reduce tasks. Master failure: Could handle; highly unlikely. Slow workers significantly lengthen completion time: Near end of job, spawn backup copies of in-progress tasks. 15 Growing applicability 16 8

Properties Advantages: Automatic parallelization and distribution. Fault-tolerance. I/O scheduling Status and monitoring Limitations: Cannot share data across jobs. 17 MapReduce for machine learning Many ML algorithms can be re-written in summation form. Consider linear regression: Err(w) = i=1:n ( y i - w T x i ) 2 ŵ = (X T X) -1 X T Y Let A = (X T X), B = (X T Y) A = i=1:n (x i x it ), B = i=1:n (x i y i ) Now the computation of A and B can be split between different nodes. Other algorithms that are amenable to this form: Naïve Bayes, LDA/QDA, K-means, logistic regression, neural networks, mixture of Gaussians (EM), SVMs, PCA. 18 9

Parallel linear learning Given 2.1 TB of data, how can you learn a good linear predictor? How long does it take? 17B examples, 16M parameters, 1K computation nodes. Stochastic gradient descent will take a long time going through all this data! Answer: Can be done using parallelization in 70 minutes = 500M features/second. Faster than the I/O bandwidth of a single machine. 19 Parallel learning for parameter search One of the most common use of parallelization in ML. Map: Create separate jobs, exploring different subsets of parameters. Train separate models, calculate performance on validation set. Reduce: Select subset with the best validation set performance. 20 10

Parallel learning of decision trees Usually, all data examples are used to select each subtree split. If data is not partitioned properly, may produce poor trees. When parallelizing, each processor can handle either: Subset of the features: Easier to efficiently parallelize tree. Subset of the training data: More complex to properly parallelize tree. Easier to parallelize learning of Random Forests (true of bagging in general). 21 PLANET (Panda et al., VLDB 09) Parallel Learner for Assembling Numerous Ensemble Trees. Insight: Build the tree level by level. Map: Consider a number of possible splits on subsets of the data. Intermediate values = partial statistics on # points. Reduce: Determine the best split from the partial statistics. Repeat. One MapReduce step builds one level of the tree 22 11

Final notes Mini-project #2 evaluations are available on CMT. Mini-project #3 peer reviews due Nov. 25. Mini-project #4: Has everyone started thinking about a topic, dataset, and team? Oral presentations, Dec.2, during class. Midterm: Friday November 20, 11:30am-1pm. Divided in 2 rooms. Last name A-K: Burnside 1B36 Last name L-Z: MAASS 10 Closed book. No calculator. You can bring 1 page (2-sided) of hand-written notes. Covers lectures 1-21. Course evaluations now available on Minerva. Please fill out! 23 Today s lecture Significant material (pictures, text, equations) for these slides was taken from: http://www.stanford.edu/class/cs246 http://cilvr.cs.nyu.edu/doku.php?id=courses:bigdata:slides:start http://research.google.com/archive/mapreduce-osdi04-slides/indexauto-0002.html http://machinelearning.wustl.edu/mlpapers/paper_files/ NIPS2006_725.pdf 24 12