Scalable Machine Learning - or what to do with all that Big Data infrastructure



Similar documents
Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Big-data Analytics: Challenges and Opportunities

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

L3: Statistical Modeling with Hadoop

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Mammoth Scale Machine Learning!

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Recommender Systems Seminar Topic : Application Tung Do. 28. Januar 2014 TU Darmstadt Thanh Tung Do 1

Spark and the Big Data Library

Challenges for Data Driven Systems

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Architectures for Big Data Analytics A database perspective

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Distributed Computing and Big Data: Hadoop and MapReduce

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Big Data Analytics. Lucas Rego Drumond

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Table of Contents. June 2010

Azure Machine Learning, SQL Data Mining and R

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Towards a Thriving Data Economy: Open Data, Big Data, and Data Ecosystems

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Jubatus: An Open Source Platform for Distributed Online Machine Learning

Weekly Sales Forecasting

Journée Thématique Big Data 13/03/2015

MapReduce/Bigtable for Distributed Optimization

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

MADlib. An open source library for in-database analytics. Hitoshi Harada PGCon 2012, May 17th

The Stratosphere Big Data Analytics Platform

Maximize Revenues on your Customer Loyalty Program using Predictive Analytics

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Spark: Cluster Computing with Working Sets

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

How To Test The Performance Of An Ass 9.4 And Sas 7.4 On A Test On A Powerpoint Powerpoint 9.2 (Powerpoint) On A Microsoft Powerpoint 8.4 (Powerprobe) (

Machine Learning over Big Data

Steven C.H. Hoi School of Information Systems Singapore Management University

Large Scale Learning to Rank

Big Data Analytics CSCI 4030

Big Data Research in Berlin BBDC and Apache Flink

Intelligent Heuristic Construction with Active Learning

Sibyl: a system for large scale machine learning

Algorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)

Prerequisites. Course Outline

Compact Summaries for Large Datasets

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Massive scale analytics with Stratosphere using R

Big Data Analytics. Lucas Rego Drumond

Play with Big Data on the Shoulders of Open Source

Summary Data Structures for Massive Data

The Need for Training in Big Data: Experiences and Case Studies

Hadoop Ecosystem B Y R A H I M A.

Enabling Multi-pipeline Data Transfer in HDFS for Big Data Applications

Apache Mahout's new DSL for Distributed Machine Learning. Sebastian Schelter GOTO Berlin 11/06/2014

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

HIGH PERFORMANCE BIG DATA ANALYTICS

Steven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore

CAS CS 565, Data Mining

Similarity Search in a Very Large Scale Using Hadoop and HBase

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Advanced In-Database Analytics

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Collaborative Filtering. Radek Pelánek

Fast Analytics on Big Data with H20

MLlib: Scalable Machine Learning on Spark

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Unified Big Data Processing with Apache Spark. Matei

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

FEATURE 1. HASHING-TRICK

Cloud Computing at Google. Architecture

Hadoop Usage At Yahoo! Milind Bhandarkar

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Turn Big Data to Small Data

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Real Time Analytics for Big Data. NtiSh Nati

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Big Data from a Database Theory Perspective

AN EFFICIENT SELECTIVE DATA MINING ALGORITHM FOR BIG DATA ANALYTICS THROUGH HADOOP

Large-Scale Test Mining

SQL Server 2005 Features Comparison

A Professional Big Data Master s Program to train Computational Specialists

Big Data Analytics - Accelerated. stream-horizon.com

Classification using Logistic Regression

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Scaling Out With Apache Spark. DTL Meeting Slides based on

Introduction to Big Data with Apache Spark UC BERKELEY

B669 Sublinear Algorithms for Big Data

Data De-duplication Methodologies: Comparing ExaGrid s Byte-level Data De-duplication To Block Level Data De-duplication

Learning at scale on Hadoop

Genome Analysis in a Dynamically Scaled Hybrid Cloud

Introduction to Machine Learning Using Python. Vikram Kamath

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Big Data and Data Science: Behind the Buzz Words

Big Data With Hadoop

Transcription:

- or what to do with all that Big Data infrastructure TU Berlin blog.mikiobraun.de Strata+Hadoop World London, 2015 1

Complex Data Analysis at Scale Click-through prediction Personalized Spam Detection Image Classification Recommendation... 2

Click Through Prediction 3

Personalized Spam Detection 4

Supervised Learning in a Nutshell 5

Large Scale Learning 100 millions of examples Millions of (potential) features Linear models: Learn one weight per feature Deep learning: Complex models with several stages of processing 6

ML Pipeline 7

Easily parallelized Data preparation, exactraction, normalization Parallel runs of ML pipelines for evaluation Mass application of data sets 8

What about the training step? Optimization problems Sometimes closed-form solutions ( matrix computations) Most often not, or it is slow 9

Gradient Descent Very generic way to optimize functions Update parameters along line of steepest descent of cost function e.g. logistic regression 10

Gradient Descent in Parallel 11

Stochastic Gradient Descent #1 workhorse for large scale learning Gradient Descent = Error on whole data set Stochastic Gradient Descent = consider one point at a time 1. predict on one point 2. compute gradient for that one point 3. update parameters 12

Why can we do this? Learning is about prediction performance on future data Lots of freedom to find alternative algorithms 13

Ad Click Prediction at Google SIGKDD '13 Variant of SGD With tons of hacks: 16-bit fixed point, sharing memory, etc. No parallelization! 14

SGD is hard to parallelize 15

Parameter Servers Dean et al. Large Scale Distributed Deep Networks, NIPS 2012 Li et al. Communication Efficient Distributed Machine Learning with the Parameter Server, NIPS 2014 16

Parallelized vs. Distributed Search Faster optimization by parallel searches Randomized updates lead to overall better performance 17

Approximation is key Goal is good predictions on future data You can solve an easier problem instead not just for the learning algorithms! 18

Approximating Features with Hashing Millions of features, but often very sparse Use random projections to compress feature space Attenberg et al. Collaborative Email-Spam Filtering with the Hashing-Trick CEAS 2009 19

Counting: Count Min-Sketch update: hash value, increase counters query: hash value, take minimum Cormode, Muthukrishnan, "An Improved Data Stream Summary: The Count-Min Sketch and its Applications". J. Algorithms 2004 21

Clustering with Count Min-Sketches Aggarwal et al. A Framework for Clustering Massive-Domain Data Streams, ICDE 2009 22

Scalable ML is... NOT: learning algorithms + parallelization INSTEAD: faster learning algorithms approximative optimization feature hashing approximative counting asynchronous distributed search AND parallelization 24

So where are we with Spark and friends Original MR paper: Focus on simple stuff Distributed Collections & Dataflow the right abstraction? Distributed Matrices? Who tells us what the right approximation is? New mode of asynchronous, distributed, approximative computation? OR go with something which is slower but easier to parallelize? 25

Summary Large scale data sets: billions of examples, millions of features A lot of the ML pipeline (feature extraction, data preparation, evaluation) is parallelizable and should be! Large scale learning algorithms are harder to parallelize Approximation, asynchronous distributed computing, etc. 26

Thank you! 27