Large Scale Learning

Similar documents
Scaling Up Machine Learning

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Machine Learning over Big Data

Sibyl: a system for large scale machine learning

Introduction to Machine Learning Using Python. Vikram Kamath

Distributed Computing and Big Data: Hadoop and MapReduce

Big Data With Hadoop

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Chapter 7. Using Hadoop Cluster and MapReduce

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Online Semi-Supervised Learning

Machine Learning using MapReduce

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

MapReduce/Bigtable for Distributed Optimization

Big Data Processing with Google s MapReduce. Alexandru Costan

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Big Data and Scripting map/reduce in Hadoop

The Stratosphere Big Data Analytics Platform

Open source Google-style large scale data analysis with Hadoop

Processing of Big Data. Nelson L. S. da Fonseca IEEE ComSoc Summer Scool Trento, July 9 th, 2015

Privacy-Preserving Big Data Publishing

Map-Reduce for Machine Learning on Multicore

How To Handle Big Data With A Data Scientist

L3: Statistical Modeling with Hadoop

Hadoop SNS. renren.com. Saturday, December 3, 11

Big Data Analytics CSCI 4030

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Cloud Computing at Google. Architecture

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Big Data: Study in Structured and Unstructured Data

Big Data Analytics. Lucas Rego Drumond

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Developing MapReduce Programs

Healthcare data analytics. Da-Wei Wang Institute of Information Science

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Hadoop Ecosystem B Y R A H I M A.

Developing a MapReduce Application

Verification and Validation of MapReduce Program model for Parallel K-Means algorithm on Hadoop Cluster

International Journal of Engineering Research ISSN: & Management Technology November-2015 Volume 2, Issue-6

Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015

Open source large scale distributed data management with Google s MapReduce and Bigtable

An Overview of Knowledge Discovery Database and Data mining Techniques

Similarity Search in a Very Large Scale Using Hadoop and HBase

Challenges for Data Driven Systems

Bringing Big Data Modelling into the Hands of Domain Experts

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

How To Solve The Kd Cup 2010 Challenge

A Logistic Regression Approach to Ad Click Prediction

Introduction to Online Learning Theory

Hadoop and Map-Reduce. Swati Gore

MapReduce Approach to Collective Classification for Networks

Machine Learning for Cyber Security Intelligence

Scalable Machine Learning - or what to do with all that Big Data infrastructure

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Big Data Analytics. Lucas Rego Drumond

Apache Hadoop. Alexandru Costan

Big Data Analytics Nokia

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

NoSQL and Hadoop Technologies On Oracle Cloud

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Understanding NoSQL on Microsoft Azure

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Classification On The Clouds Using MapReduce

Journée Thématique Big Data 13/03/2015

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers

How To Use Hadoop

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

HES-SO Master of Science in Engineering. Clustering. Prof. Laura Elena Raileanu HES-SO Yverdon-les-Bains (HEIG-VD)

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework


Data Mining in the Swamp

HadoopRDF : A Scalable RDF Data Analysis System

Introduction. A. Bellaachia Page: 1

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

BIG DATA AND ANALYTICS

Big Systems, Big Data

RevoScaleR Speed and Scalability

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Social Media Mining. Data Mining Essentials

Data Mining - Evaluation of Classifiers

Linear Threshold Units

Cross-validation for detecting and preventing overfitting

BIG DATA FOR YOUR DC

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Introduction to Data Mining

Transcription:

Large Scale Learning

Data hypergrowth: an example Reuters- 21578: about 10K docs (ModApte) Bekkerman et al, SIGIR 2001 RCV1: about 807K docs Bekkerman & Scholz, CIKM 2008 LinkedIn job Mtle data: about 100M docs Bekkerman & Gavish, KDD 2011 100000000 10000000 1000000 100000 10000 2000 2004 2008 2012 Slide by R. Bekkerman, M. Bilenko, J. Langford

New age of big data The world has gone mobile 5 billion cellphones produce daily data Social networks have gone online TwiVer produces 200M tweets a day Crowdsourcing is the reality Labeling of 100,000+ data instances is doable Within a week J Slide by R. Bekkerman, M. Bilenko, J. Langford

Size mavers One thousand data instances One million data instances One billion data instances One trillion data instances Those are not different numbers, those are different mindsets J Slide by R. Bekkerman, M. Bilenko, J. Langford

One million data instances Currently, the most acmve zone Can be crowdsourced Can be processed by a quadramc algorithm Once parallelized 1M data collecmon cannot be too diverse But can be too homogenous Preprocessing / data probing is crucial Slide by R. Bekkerman, M. Bilenko, J. Langford

Big dataset cannot be too sparse 1M data instances cannot belong to 1M classes Simply because it s not pracmcal to have 1M classes J Here s a stamsmcal experiment, in text domain: 1M documents Each document is 100 words long Randomly sampled from a unigram language model No stopwords 245M pairs have word overlap of 10% or more Real- world datasets are denser than random Slide by R. Bekkerman, M. Bilenko, J. Langford

One billion data instances Web- scale Guaranteed to contain data in different formats ASCII text, pictures, javascript code, PDF documents Guaranteed to contain (near) duplicates Likely to be badly preprocessed J Storage is an issue Slide by R. Bekkerman, M. Bilenko, J. Langford

One trillion data instances Beyond the reach of the modern technology Peer- to- peer paradigm is (arguably) the only way to process the data Data privacy / inconsistency / skewness issues Can t be kept in one locamon Is intrinsically hard to sample Slide by R. Bekkerman, M. Bilenko, J. Langford

Not enough (clean) training data? Use exismng labels as a guidance rather than a direcmve In a semi- supervised clustering framework Or label more data! J With a livle help from the crowd Slide by R. Bekkerman, M. Bilenko, J. Langford

Crowdsourcing labeled data Crowdsourcing is a tough business J People are not machines Any worker who can game the system game the system will ValidaMon framework + qualificamon tests are a must Labeling a lot of data can be fairly expensive Slide by R. Bekkerman, M. Bilenko, J. Langford

Let s talk about how we can learn with datasets this large... 15

StochasMc Gradient Descent 16

Consider Learning with Numerous Data LogisMc regression objecmve: J( ) = 1 nx [y i log h (x i )+(1 y i ) log (1 h (x i ))] n i=1 @ cost (x i,y i ) @ j Fit via gradient descent: j j 1 nx (h (x i ) y i ) x ij n i=1 What is the computamonal complexity in terms of n? 17

Batch Gradient Descent IniMalize θ Repeat { } j j 1 n Gradient Descent nx (h (x i ) y i ) x ij for j = 0...d! i=1 StochasMc Gradient Descent IniMalize θ Randomly shuffle dataset Repeat { (Typically 1 10x) For i = 1...n, do j j (h (x i ) y i ) x ij } @ @ j J( ) for j = 0...d! @ @ j cost (x i,y i ) 18

Batch vs StochasMc GD Batch GD StochasMc GD Learning rate α is typically held constant Can slowly decrease α over Mme to force θ to converge: e.g., = constant1 iterationnumber + constant2 Based on slide by Andrew Ng 19

Graph- and Data- Parallelism 20

Map- Reduce Computer 1 Training set Computer 2 Combine results Computer 3 Computer 4 Based on slide by Andrew Ng 21

MulM- Core Machines Core 1 Training set Core 2 Combine results Core 3 Core 4 Based on slide by Andrew Ng 22

Map- Reduce for Batch GD Split dataset up into chunks (e.g., with n = 400) to nx compute j j 1 n i=1 (h (x i ) y i ) x ij temp1 = P 100 i=1 (h (x i ) y i ) x ij (x 1,y 1 )... (x 100,y 100 )! (x 101,y 101 )... (x 200,y 200 )! temp2 = P 200 i=101 (h (x i ) y i ) x ij (x 201,y 201 )... (x 300,y 300 )! temp3 = P 300 i=201 (h (x i ) y i ) x ij! (x 301,y 301 )... (x 400,y 400 )! temp4 = P 400 i=301 (h (x i ) y i ) x ij Training set Based on example by Andrew Ng 23

Map- Reduce for Batch GD Split dataset up into chunks (e.g., with n = 400) to nx compute j j 1 n i=1 (h (x i ) y i ) x ij temp1 = P 100 i=1 (h (x i ) y i ) x ij (x 1,y 1 )... (x 100,y 100 )! (x 101,y 101 )... (x 200,y 200 )! (x 201,y 201 )... (x 300,y 300 )! temp2 = P 200 i=101 Combine (h (x i ) results y i ) x ij j j 1 400 4X tempi i=1 temp3 = P 300 i=201 (h (x i ) y i ) x ij (x 301,y 301 )... (x 400,y 400 )!! Training set Based on example by Andrew Ng temp4 = P 400 i=301 (h (x i ) y i ) x ij 24

Slide by R. Bekkerman, M. Bilenko, J. Langford Parallelizing k- means

Slide by R. Bekkerman, M. Bilenko, J. Langford Parallelizing k- means

Slide by R. Bekkerman, M. Bilenko, J. Langford Parallelizing k- means

k- means on MapReduce Mappers read data pormons and centroids Mappers assign data instances to clusters Mappers compute new local centroids and local cluster sizes Reducers aggregate local centroids (weighted by local cluster sizes) into new global centroids Reducers write the new centroids Slide by R. Bekkerman, M. Bilenko, J. Langford

Discussion on MapReduce MapReduce is not designed for iteramve processing Mappers read the same data again and again MapReduce looks too low- level to some people Data analysts are tradimonally SQL folks J MapReduce looks too high- level to others A lot of MapReduce logic is hard to adapt Example: grouping documents by words Slide by R. Bekkerman, M. Bilenko, J. Langford

GraphLab Open- source parallel machine learning Developed at Carnegie Mellon Univ. Available at www.graphlab.org 30

For more informamon... Cambridge Univ. Press Released in 2011 21 chapters Covering Plasorms Algorithms Learning setups ApplicaMons Slide by R. Bekkerman, M. Bilenko, J. Langford

Learning MulMple Tasks via Knowledge Transfer 35

Transfer Learning Idea: Transfer informamon from one or more source tasks to improve learning on a target task Data Model Step 1 Source Tasks Task 1 Task 2 Task N Learner Learner Learner Source Knowledge n Plenty of training data for each source task Eric Eaton 36

Transfer Learning Idea: Transfer informamon from one or more source tasks to improve learning on a target task Source Knowledge Step 2 New Target Task Data Machine Learner Model n Insufficient training data on the target task Eric Eaton 37

Benefits of Transfer in Learning n Primary goal: learning the target task T new bever auer first learning related source tasks T 1,, T N Performance BeVer means some combinamon of: More rapid learning with transfer without transfer Performance Improved inimal performance with transfer without transfer Performance Higher achievable performance with transfer without transfer # Training Examples # Training Examples Figures adapted from (DARPA/IPTO, 2005) # Training Examples Secondary goal: creamng chunks of reusable knowledge Eric Eaton 38

MulH- Task Learning n Idea: Learn all task models simultaneously, sharing knowledge (Caruana 1997; Zhang et al. 2008; Kumar & Daumé 2012) Data Model Task 1 Task 2 Task N MulH- Task Learner Eric Eaton 39