Sibyl: a system for large scale machine learning

Similar documents
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Processing with Google s MapReduce. Alexandru Costan

A Logistic Regression Approach to Ad Click Prediction

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Machine Learning using MapReduce

Developing MapReduce Programs

Similarity Search in a Very Large Scale Using Hadoop and HBase

Hadoop SNS. renren.com. Saturday, December 3, 11

Fast Analytics on Big Data with H20

Introduction to Online Learning Theory

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Large Scale Learning

Linear Threshold Units

Distributed Computing and Big Data: Hadoop and MapReduce

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

MapReduce: Algorithm Design Patterns

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Physical Data Organization

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Introduction to Parallel Programming and MapReduce

Invited Applications Paper

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Search and Information Retrieval

The Stratosphere Big Data Analytics Platform

Hypertable Architecture Overview

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Lecture 2: The SVM classifier

Social Media Mining. Data Mining Essentials

Bringing Big Data Modelling into the Hands of Domain Experts

Statistical Machine Learning

Vowpal Wabbit 7 Tutorial. Stephane Ross

Data Mining - Evaluation of Classifiers

Distributed R for Big Data

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Supervised Learning (Big Data Analytics)

Architectures for Big Data Analytics A database perspective

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Data Mining for Knowledge Management. Classification

Data Mining Part 5. Prediction

Dynamical Clustering of Personalized Web Search Results

MAPREDUCE Programming Model

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Active Learning with Boosting for Spam Detection

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Parquet. Columnar storage for the people

Amadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Machine Learning and Pattern Recognition Logistic Regression

RevoScaleR Speed and Scalability

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Lecture 3: Linear methods for classification

Healthcare data analytics. Da-Wei Wang Institute of Information Science

L3: Statistical Modeling with Hadoop

Quantcast Petabyte Storage at Half Price with QFS!

Hadoop Ecosystem B Y R A H I M A.

Introduction to Machine Learning Using Python. Vikram Kamath

10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming

Rethinking SIMD Vectorization for In-Memory Databases

REVIEW OF ENSEMBLE CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Algorithmic Techniques for Big Data Analysis. Barna Saha AT&T Lab-Research

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Object Recognition and Template Matching

Introduction to Hadoop

Data, Measurements, Features

Journée Thématique Big Data 13/03/2015

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Big Data Text Mining and Visualization. Anton Heijs

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

COSHH: A Classification and Optimization based Scheduler for Heterogeneous Hadoop Systems

JAVA IEEE Privacy Policy Inference of User-Uploaded Images on Content Sharing Sites Data Mining

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

A survey on click modeling in web search

CSE-E5430 Scalable Cloud Computing Lecture 2

Networking in the Hadoop Cluster

How To Create A Multi Disk Raid

Search Engines. Stephen Shaw 18th of February, Netsoc

Big Data and Scripting map/reduce in Hadoop

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Open source large scale distributed data management with Google s MapReduce and Bigtable

Hadoop. Bioinformatics Big Data

ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #

Transcription:

Sibyl: a system for large scale machine learning Tushar Chandra, Eugene Ie, Kenneth Goldman, Tomas Lloret Llinares, Jim McFadden, Fernando Pereira, Joshua Redstone, Tal Shaked, Yoram Singer

Machine Learning Background Use the past to predict the future Core technology for internet-based prediction tasks Examples of problems that can be solved with machine learning: Classify email as spam or not Estimate relevance of an impression in context: Search, advertising, videos, etc. Rank candidate impressions The internet adds a scaling challenge: 100s of millions of users interacting every day Good solutions require a mix of theory and systems

Overview of Results Built a large scale machine learning system: Used recently developed machine learning algorithm Algorithms have provable convergence & quality guarantees Solves internet scale problems with reasonable resources Flexible: various loss functions and regularizations Used numerous well known systems techniques MapReduce for scalability Multiple cores and threads per computer for efficiency GFS to store lots of data Compressed column-oriented data format for performance

Inference and Learning Objective: draw reliable inferences from all the evidence in our data Is this email SPAM? Is this webpage porn? Will this user click on that ad? Learning: create concise representations of the data to support good inferences

Many, Sparse Features Many elementary features: words, etc. Complex features: combination of elementary features Most elementary features are infrequent discretization of real-valued features Most complex features don t occur at all We want algorithms that scale well with number of features that are actually present, not with the number of possible features

Supervised Learning Given feature-based representation Feedback through a label: Good or Bad Spam or Not-spam Relevant or Not-relevant Supervised learning task: Given training examples, find an accurate model that predicts their labels

Machine learning overview Training data Label Feature 1,... Feature n Label Feature 1,... Feature n Label Feature 1,... Feature n

Machine learning overview Training data Label Feature 1,... Feature n Label Feature 1,... Feature n Label Feature 1,... Feature n Model Feature 1 = 0.2,... Feature n = -0.5

Machine learning overview Training data Label Feature 1,... Feature n Label Feature 1,... Feature n Label Feature 1,... Feature n Model Feature 1 = 0.2,... Feature n = -0.5 + Feature 1,... Feature n

Machine learning overview Training data Label Feature 1,... Feature n Label Feature 1,... Feature n Label Feature 1,... Feature n Model Feature 1 = 0.2,... Feature n = -0.5 + Feature 1,... Feature n Predicted label

Machine learning overview Training data Model Label Feature 1,... Feature n Label Feature 1,... Feature n Label Feature 1,... Feature n Label Feature 1,... Feature n Feature 1 = 0.2,... Feature n = -0.5 + Feature 1,... Feature n Predicted label

Example: Spam Prediction Feedback on emails: Move to Spam, Move to Inbox Lots of features: Viagra Document IP Address of sender is bad Sender s domain @google.com... Feedback returned daily and grows with time New features appear every day

From Emails to Vectors User receives an email from an unknown sender Email is tokenized:... Viagra Document Sudafed Document Find a young wife Document... Compressed instance: x {0, 1} n (0, 0, 1, 0, 1, 0,...,0, 0, 1, 0)

From Emails to Vectors User receives an email from an unknown sender Email is tokenized:... Viagra Document Sudafed Document Find a young wife Document... Compressed instance: x {0, 1} n (0, 0, 1, 0, 1, 0,...,0, 0, 1, 0)

Prediction Models Captures importance of features Viagra Document => score +2.0 Sudafed Document => score +0.5 Sender s domain @google.com => score -1.0 Represented as a vector of weights w = (0, 0, 2.0, -0.1, 0.5,..., -1.0,...) Scoring the email w.x = 2.0 + 0.5-1.0 Logistic regression (used for probability predictions) Probability =

Prediction Models Captures importance of features Viagra Document => score +2.0 Sudafed Document => score +0.5 Sender s domain @google.com => score -1.0 Represented as a vector of weights w = (0, 0, 2.0, -0.1, 0.5,..., -1.0,...) Scoring the email w.x = 2.0 + 0.5-1.0 Logistic regression (used for probability predictions) Probability =

Parallel Boosting (Collins, Schapire, Singer 2001) Iterative algorithm, each iteration improves model Number of iterations to get within of the optimum: log(m)/ 2 Updates correlated with gradients, but not a gradient algorithm Self-tuned step size, large when instances are sparse

Boosting: Illustration

Boosting: Illustration

Boosting: Illustration

Parallel Boosting Algorithm instances features q(i) = 1 1 + exp(y i (w x i )) µ + j = µ j = w j += η log i:y i =1 x ij =1 i:y i = 1 x ij =1 µ + j µ j q(i) q(i)

Parallel Boosting Algorithm instances features mistake probability q(i) = 1 1 + exp(y i (w x i )) µ + j = µ j = w j += η log i:y i =1 x ij =1 i:y i = 1 x ij =1 µ + j µ j q(i) q(i)

Parallel Boosting Algorithm instances features positive correlation mistake probability q(i) = 1 1 + exp(y i (w x i )) µ + j = µ j = w j += η log i:y i =1 x ij =1 i:y i = 1 x ij =1 µ + j µ j q(i) q(i)

Parallel Boosting Algorithm instances features positive correlation negative correlation mistake probability q(i) = 1 1 + exp(y i (w x i )) µ + j = µ j = w j += η log i:y i =1 x ij =1 i:y i = 1 x ij =1 µ + j µ j q(i) q(i)

Parallel Boosting Algorithm instances features positive correlation negative correlation mistake probability q(i) = 1 1 + exp(y i (w x i )) µ + j = µ j = w j += η log i:y i =1 x ij =1 i:y i = 1 x ij =1 step size µ + j µ j q(i) q(i)

Properties of parallel boosting Embarrassingly parallel: 1. Computes feature correlations for each example in parallel 2. Feature are updated in parallel We need to shuffle the outputs of Step 1 for Step 2 Step size inversely proportional to number of active features per example Not total number of features Good for sparse training data Needs some form of regularization

Learning w/ L1 Regularization

Learning w/ L1 Regularization

Learning w/ L1 Regularization 680 660 Loss + Regularization 640 620 600 580 560 540 520 10 20 30 40 50 60 70 80 90 100 Iterations

Implementing Parallel Boosting + + + - Embarrassingly parallel Stateless, so robust to transient data errors Each model is consistent, sequence of models for debugging 10-50 iterations to converge Data Model i MapReduce Model i+1 MapReduce

Some observations We typically train multiple models To explore different types of features Don t read unnecessary features To explore different levels of regularization Amortize fixed costs across similar models Computers have lots of RAM Store the model and training stats in RAM at each worker Design for multi-core Computers have lots of cores Training data is highly compressible

Design principle: use column-oriented data store Column for each field Benefits Learners read much less data Efficient to transform fields Data compresses better Each learner only reads relevant columns

Design principle: use model sets Train multiple similar models together Cost of reading training data Cost of transforming data Downsides Need more RAM Shuffle more data Benefit: amortize fixed costs across models

Design principle: Integerize features Each column has its own dense integer space Encode features in decreasing order of frequency Variable-length encoding of integers Benefits: Training data compression Store in-memory model and statistics as arrays rather than hash tables Compact, faster, less data to shuffle

Design principle: store model and stats in RAM Each worker keeps in RAM A copy of the previous model Learning statistics for its training data Boosting requires O(10 bytes) per feature Possible to handle billions of features

Design principle: optimize for multi-core Share model across cores MapReduce optimizations Multi-shard combiners Share training statistics across cores

Design principle: use combiners to limit communication + Standard Mapper Mapper with Combiner

Design principle: use combiners to limit communication Fewer large shards mean less shuffling, but possible stragglers when shards fail Map Shard Output to Shuffle Less shuffling Faster recovery Map Shard Input Size

Design principle: use combiners to limit communication Solution: Multishard Combining Multiple threads per worker Many small map shards per thread One accumulator shared across threads One supershard per worker... less shuffling Spread shards from failed workers across the remaining workers... fewer stragglers

Design principle: use combiners to limit communication + + + + + Standard Mapper Mapper with Combiner Combiner per Map Thread Multishard Combiner

Compression results Data Set 1 3.2x compression (source is unsorted and has medium compression) 2.6x compression (source is sorted and has medium compression) 1.7x compression (source is sorted and has max compression) string -> int map overhead < 0.5% Data Set 2 string -> int map overhead < 0.5% 1.8x compression (default compression options)

Performance results Number of models in model set 1 2 3 4 5 Cores 80 160 240 320 400 1.8M 4.0M 4.4M 5.4M 4.5M 1.3M 2.4M 3.0M 4.4M 3.5M 1.4M 2.2M 3.0M 3.9M 3.5M 1.2M 2.0M 2.4M 2.9M 3.3M 1.1M 1.7M 2.4M 2.1M 2.7M Measurements in features/second per core

Infrastructure challenges Sibyl is an HPC workload running on infrastructure designed for the web Rapidly opens lots of files GFS master overload Concurrently reads 100s of files per machine Cluster cross-sectional bandwidth overload Denial of service for co-resident processes Random accesses into large vectors Prefetch performance Page-table performance MapReduce challenges Multi-shard combiners, column-oriented format Column oriented data format creates lots of small files Outside the GFS sweet spot