Simple big data, in Python. Gaël Varoquaux



Similar documents
Big Data Paradigms in Python

Machine Learning in Python with scikit-learn. O Reilly Webcast Aug. 2014

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Maschinelles Lernen mit MATLAB

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Big-data Analytics: Challenges and Opportunities

Fast Analytics on Big Data with H20

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Parallel and Large Scale Learning with scikit-learn

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Social Media Mining. Data Mining Essentials

Analytics on Big Data

The Scientific Data Mining Process

Machine Learning with MATLAB David Willingham Application Engineer

Machine learning for algo trading

Machine Learning using MapReduce

SURVEY REPORT DATA SCIENCE SOCIETY 2014

Parallel Computing. Benson Muite. benson.

Analysis Tools and Libraries for BigData

Advanced Big Data Analytics with R and Hadoop

Car Insurance. Prvák, Tomi, Havri

Using Data Mining for Mobile Communication Clustering and Characterization

Why is Internal Audit so Hard?

MACHINE LEARNING BRETT WUJEK, SAS INSTITUTE INC.

Unlocking the True Value of Hadoop with Open Data Science

The Internet of Things and Big Data: Intro

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Machine Learning Capacity and Performance Analysis and R

Grab some coffee and enjoy the pre-show banter before the top of the hour!

Machine Learning for Data Science (CS4786) Lecture 1

Spark and the Big Data Library

Bringing Big Data Modelling into the Hands of Domain Experts

Predicting borrowers chance of defaulting on credit loans

RevoScaleR Speed and Scalability

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Apache Hama Design Document v0.6

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Classification Techniques in Remote Sensing Research using Smart Data Analytics

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Introduction to Machine Learning Using Python. Vikram Kamath

Large Scale Electroencephalography Processing With Hadoop

Chapter 6. The stacking ensemble approach

Rackspace Cloud Databases and Container-based Virtualization

What is RAID? data reliability with performance

CS 6290 I/O and Storage. Milos Prvulovic

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

MS1b Statistical Data Mining

Big Graph Processing: Some Background

A Brief Introduction to Apache Tez

Fall 2012 Q530. Programming for Cognitive Science

CS 40 Computing for the Web

Predicting outcome of soccer matches using machine learning

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING

Data Mining. Nonlinear Classification

Statistics, Data Mining and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data. and Alex Gray

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Similarity Search in a Very Large Scale Using Hadoop and HBase

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Microsoft Private Cloud Fast Track

The Artificial Prediction Market

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Data Mining - Evaluation of Classifiers

CSci 538 Articial Intelligence (Machine Learning and Data Analysis)

II. RELATED WORK. Sentiment Mining

Big data in R EPIC 2015

Machine Learning.

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

INTRODUCING AZURE MACHINE LEARNING

Journée Thématique Big Data 13/03/2015

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Hadoop. Sunday, November 25, 12

Azure Machine Learning, SQL Data Mining and R

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

DSS. Diskpool and cloud storage benchmarks used in IT-DSS. Data & Storage Services. Geoffray ADDE

Pattern-Aided Regression Modelling and Prediction Model Analysis

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML

Unsupervised Data Mining (Clustering)

R YOU READY FOR PYTHON? Sunday 19th April, 2015

Machine Learning over Big Data

Moving From Hadoop to Spark

Knowledge Discovery and Data Mining

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

8. Machine Learning Applied Artificial Intelligence

Processing of Big Data. Nelson L. S. da Fonseca IEEE ComSoc Summer Scool Trento, July 9 th, 2015

Transcription:

Simple big data, in Python Gaël Varoquaux

Simple big data, in Python Gaël Varoquaux This is a lie!

Please allow me to introduce myself Physicist gone bad I m a man of wealth and taste I ve been around for a long, long year Neuroscience, Machine learning Worked in a software startup Enthought: scientific computing consulting in Python Coder (done my share of mistake) Mayavi, scikit-learn, joblib... Scipy community Chair of scipy and EuroScipy conferences Researcher (PI) at INRIA G Varoquaux 2

G Varoquaux 3 1 Machine learning in 2 words 2 Scikit-learn 3 Big data on a budget 4 The bigger picture: a community

1 Machine learning in 2 words G Varoquaux 4

G Varoquaux 5 1 A historical perspective Artificial Intelligence Building decision rules The 80s Eatable? Tall? Mobile?

G Varoquaux 5 1 A historical perspective Artificial Intelligence Building decision rules Machine learning Learn these from observations The 80s The 90s

1 A historical perspective Artificial Intelligence Building decision rules The 80s Machine learning Learn these from observations The 90s Statistical learning Model the noise in the observations G Varoquaux 2000s 5

G Varoquaux 5 1 A historical perspective Artificial Intelligence Building decision rules Machine learning Learn these from observations Statistical learning Model the noise in the observations Big data Many observations, simple rules The 80s The 90s 2000s today

1 A historical perspective Artificial Intelligence Building decision rules Machine learning Learn these from observations Statistical learning Model the noise in the observations Big data Many observations, simple rules The 80s The 90s 2000s today Big data isn t actually interesting without machine learning Steve Jurvetson, VC, Silicon Valley G Varoquaux 5

1 Machine learning G Varoquaux 6 Example: face recognition Andrew Bill Charles Dave

1 Machine learning? G Varoquaux 6 Example: face recognition Andrew Bill Charles Dave

1 A simple method G Varoquaux 7 1 Store all the known (noisy) images and the names that go with them. 2 From a new (noisy) images, find the image that is most similar. Nearest neighbor method

1 A simple method G Varoquaux 7 1 Store all the known (noisy) images and the names that go with them. 2 From a new (noisy) images, find the image that is most similar. Nearest neighbor method How many errors on already-known images?... 0: no erreurs Test data Train data

G Varoquaux 8 1 1 st problem: noise Signal unrelated to the prediction problem Prediction accuracy 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Noise level

1 2 nd problem: number of variables G Varoquaux 9 Finding a needle in a haystack Prediction accuracy 0.95 0.90 0.85 0.80 0.75 0.70 0.65 1 2 3 4 5 6 7 8 9 10 Useful fraction of the frame

1 Machine learning Example: face recognition Learning from numerical descriptors Difficulties: i) noise, ii) number of features Andrew Bill Charles Dave supervised task: known labels unsupervised task: unknown labels? G Varoquaux 10

G Varoquaux 11 1 Machine learning: regression A single descriptor: one dimension y x

G Varoquaux 11 1 Machine learning: regression A single descriptor: one dimension y y x Which model to prefer? x

1 Machine learning: regression A single descriptor: one dimension y y x Problem of over-fitting Minimizing error is not always the best strategy (learning noise) Test data train data G Varoquaux 11 x

G Varoquaux 11 1 Machine learning: regression A single descriptor: one dimension y y x Prefer simple models = concept of regularization Balance the number of parameters to learn with the amount of data x

1 Machine learning: regression A single descriptor: one dimension Two descriptors: 2 dimensions y y x X_2 X_1 More parameters G Varoquaux 11

1 Machine learning: regression A single descriptor: one dimension Two descriptors: 2 dimensions y y x need more data curse of dimensionality X_2 X_1 More parameters G Varoquaux 11

G Varoquaux 12 1 Supervised learning: classification Predicting categories, e.g. numbers X 2 X 1

1 Unsupervised learning G Varoquaux 13 Stock market structure

1 Unsupervised learning G Varoquaux 13 Stock market structure Unlabeled data more common than labeled data

1 Recommender systems G Varoquaux 14

1 Recommender systems G Varoquaux 14 Andrew Bill Charles Dave Edie Little overlap between users

1 Machine learning Challenges Statistics Computational G Varoquaux 15

1 Petty day-to-day technicalities G Varoquaux 16 Buggy code Slow code Lead data scientist leaves New intern to train I don t understand the code I have written a year ago

1 Petty day-to-day technicalities G Varoquaux 16 Buggy code An in-house data science squad Slow Difficulties code LeadRecruitment data scientist leaves NewLimited intern toresources train (people & hardware) I don t understand the code I have written a year ago Risks Bus factor Technical dept

1 Petty day-to-day technicalities Buggy code An in-house data science squad Slow Difficulties code LeadRecruitment data scientist leaves NewLimited intern toresources train (people & hardware) I don t understand the code I have written a year ago Risks Bus factor Technical dept We need big data (and machine learning) on a tight budget G Varoquaux 16

2 Scikit-learn Machine learning without learning the machinery G Varoquaux c Theodore W. Gray 17

2 My stack G Varoquaux 18 Python, what else? Interactive language Easy to read / write General purpose

2 My stack G Varoquaux 18 Python, what else? The scientific Python stack numpy arrays pandas... It s about plugin things together

2 scikit-learn vision G Varoquaux 19 Machine learning for all No specific application domain No requirements in machine learning High-quality software library Interfaces designed for users Community-driven development BSD licensed, very diverse contributors http://scikit-learn.org

2 A Python library G Varoquaux 20 A library, not a program More expressive and flexible Easy to include in an ecosystem As easy as py from s k l e a r n import svm c l a s s i f i e r = svm.svc() c l a s s i f i e r. f i t ( X t r a i n, Y t r a i n ) Y t e s t = c l a s s i f i e r. p r e d i c t ( X t e s t )

2 Very rich feature set G Varoquaux 21 Supervised learning Decision trees (Random-Forest, Boosted Tree) Linear models SVM Unsupervised Learning Clustering Dictionary learning Outlier detection Model selection Built in cross-validation Parameter optimization

2 Computational performance scikit-learn mlpy pybrain pymvpa mdp shogun SVM 5.2 9.47 17.5 11.52 40.48 5.63 LARS 1.17 105.3-37.35 - - Elastic Net 0.52 73.7-1.44 - - knn 0.57 1.41-0.56 0.58 1.36 PCA 0.18 - - 8.93 0.47 0.33 k-means 1.34 0.79-35.75 0.68 Algorithmic optimizations Minimizing data copies G Varoquaux 22

2 Computational performance 14000 Random Forest fit time scikit-learn mlpy pybrain pymvpa mdp shogun randomforest SVM 5.2 OpenCV-RF 9.47 17.5 11.52 40.48 5.63 12000 R, Fortran OpenCV-ETs OK3-RF LARS 1.17 105.3-37.35 10941.72 OK3-ETs - - Weka-RF Orange 10000 Elastic Net 0.52 R-RF 73.7-1.44 -Python - Orange-RF knn 0.57 1.41-0.56 0.58 1.36 8000 PCA 0.18 - - 8.93 0.47 0.33 k-means6000 1.34 0.79 OpenCV - 35.75 0.68 Fit tim e (s) 4000 Algorithmic optimizations 2000 Minimizing data Scikit-Learn copies 0 Scikit-Learn-RF Scikit-Learn-ETs Python, Cython 203.01 211.53 C++ 4464.65 3342.83 OK3 C Weka 1518.14 1711.94 Java 1027.91 13427.06 Figure: Gilles Louppe G Varoquaux 22

3 Big data on a budget G Varoquaux 23

G Varoquaux 23 3 Big data on a budget Biggish Big data : Petabytes... Distributed storage Computing cluster Mere mortals: Gigabytes... Python programming Off-the-self computers

G Varoquaux 23 3 Big data on a budget Biggish Big data : Mere mortals: Petabytes... Simple data processing patterns Distributed storage Gigabytes... Computing cluster Python programming Off-the-self computers

Big data, but big how? 2 scenarios: Many observations samples e.g. twitter Many descriptors per observation features e.g. brain scans G Varoquaux 24

3 On-line algorithms G Varoquaux 25 Process the data one sample at a time Compute the mean of a gazillion numbers Hard?

3 On-line algorithms Process the data one sample at a time Compute the mean of a gazillion numbers Hard? No: just do a running mean G Varoquaux 25

3 On-line algorithms Converges to expectations Mini-batch = bunch observations for vectorization Example: K-Means clustering X = np.random.normal(size=(10 000, 200)) scipy.cluster.vq. kmeans(x, 10, iter=2) 11.33 s sklearn.cluster. MiniBatchKMeans(n clusters=10, n init=2).fit(x) 0.62 s G Varoquaux 25

3 On-the-fly data reduction G Varoquaux 26 Big data is often I/O bound Layer memory access CPU caches RAM Local disks Distant storage Less data also means less work

3 On-the-fly data reduction Dropping data 1 loop: take a random fraction of the data 2 run algorithm on that fraction 3 aggregate results across sub-samplings Looks like bagging: bootstrap aggregation Exploits redundancy across observations Run the loop in parallel G Varoquaux 26

3 On-the-fly data reduction G Varoquaux 26 Random projections (will average features) sklearn.random projection random linear combinations of the features Fast clustering of features sklearn.cluster.wardagglomeration on images: super-pixel strategy Hashing when observations have varying size (e.g. words) sklearn.feature extraction.text. HashingVectorizer stateless: can be used in parallel

G Varoquaux 26 3 On-the-fly data reduction Example: randomized SVD Random projection sklearn.utils.extmath.randomized svd X = np.random.normal(size=(50000, 200)) %timeit lapack = linalg.svd(x, full matrices=false) 1 loops, best of 3: 6.09 s per loop %timeit arpack=splinalg.svds(x, 10) 1 loops, best of 3: 2.49 s per loop %timeit randomized = randomized svd(x, 10) 1 loops, best of 3: 303 ms per loop linalg.norm(lapack[0][:, :10] - arpack[0]) / 2000 0.0022360679774997738 linalg.norm(lapack[0][:, :10] - randomized[0]) / 2000 0.0022121161221386925

3 Data-parallel computing I tackle only embarrassingly parallel problem Life is too short to worry about deadlocks Stratification to follow the statistical dependencies and the data storage structure Batch size scaled by the relevent cache pool - Too fine overhead - Too coarse memory shortage joblib.parallel G Varoquaux 27

3 Caching G Varoquaux 28 Minimizing data-access latency Never computing twice the same thing joblib.cache

3 Fast access to data Stored representation consistent with access patterns Compression to limit bandwidth usage - CPU are faster than data access Data access speed is often more a limitation than raw processing power joblib.dump/joblib.load G Varoquaux 29

G Varoquaux 30 3 Biggish iron Our new box: 15 ke 48 cores 384G RAM 70T storage (SSD cache on RAID controller) Gets our work done faster than our 800 CPU cluster It s the access patterns! Nobody ever got fired for using Hadoop on a cluster A. Rowstron et al., HotCDP 12

4 The bigger picture: a community Helping your future self G Varoquaux 31

G Varoquaux 32 4 Community-based development in scikit-learn Huge feature set: benefits of a large team Project growth: More than 200 contributors 12 core contributors 1 full-time INRIA programmer from the start Estimated cost of development: $ 6 millions COCOMO model, http://www.ohloh.net/p/scikit-learn

G Varoquaux 33 4 The economics of open source Code maintenance too expensive to be alone scikit-learn 300 email/month nipy 45 email/month joblib 45 email/month mayavi 30 email/month Hey Gael, I take it you re too busy. That s okay, I spent a day trying to install XXX and I think I ll succeed myself. Next time though please don t ignore my emails, I really don t like it. You can say, sorry, I have no time to help you. Just don t ignore.

G Varoquaux 33 4 The economics of open source Code maintenance too expensive to be alone scikit-learn 300 email/month nipy 45 email/month joblib 45 email/month mayavi 30 email/month Your benefits come from a fraction of the code Data loading? Maybe? Standard algorithms? Nah Share the common code......to avoid dying under code Code becomes less precious with time And somebody might contribute features

4 Many eyes makes code fast L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer G Varoquaux 34

4 Communities increase the knowledge pool Even if you don t do software, you should worry about communities More experts in the package Easier recruitement The future is open Enhancements are possible Meetups, conferences, are where new ideas are born G Varoquaux 35

4 6 steps to a community-driven project http://www.slideshare.net/gaelvaroquaux/ scikit-learn-dveloppement-communautaire G Varoquaux 36 1 Focus on quality 2 Build great docs and examples 3 Use github 4 Limit the technicality of your codebase 5 Releasing and packaging matter 6 Focus on your contributors, give them credit, decision power

4 Core project contributors Credit: Fernando Perez, Gist 5843625 G Varoquaux 37

4 The tragedy of the commons Individuals, acting independently and rationally according to each one s self-interest, behave contrary to the whole group s long-term best interests by depleting some common resource. Wikipedia Make it work, make it right, make it boring Core projects (boring) taken for granted Hard to fund, less excitement They need citation, in papers & on corporate web pages G Varoquaux 38

Simple big data, in Python Beyond the lie Machine learning gives value to (big) data Python + scikit-learn = - from interactive data processing (IPython notebook) - to crazy big problems (Python + spark) Big data will require you to understand the data-flow patterns (access, parallelism, statistics) Big data community addresses the human factor @GaelVaroquaux