Big Data Paradigms in Python San Diego Data Science and R Users Group January 2014 Kevin Davenport! http://kldavenport.com kldavenportjr@gmail.com @KevinLDavenport Thank you to our sponsors:
Setting up your environment Completely free enterprise-ready Python distribution for large-scale data processing, predictive analytics, and scientific computing Spend time writing code, not working to set up a system. Create interactive plots in your browser with Bokeh or D3.
http://ipython.org/notebook.html http://docs.continuum.io/anaconda/index.html http://pandas.pydata.org http://scikit-learn.org/stable/
Anaconda $ conda update conda $ conda update anaconda $ conda update numpy $ conda update bokeh $ conda update numba $ conda install ggplot
Wakari
Two Worlds Big Data! Everything else! Size Commodity hardware Computing Clusters Python Programming Distributed Storage
Tools scikit-learn! Python ML library based on (NumPy, SciPy, matplotlib)! joblib! Pipeline jobs with python functions Learn a model from the data: estimator.fit(x train, Y train)! Predict using learned model estimator.predict(x test)! Test goodness of fit estimator.score(x test, y test)! Apply change of representation estimator.transform(x, y)
Design 1. Debugging:! %debug, %time, %timeit, %lprun, %prun,%mprun, %memit! 2. Dependency Hell:! HomeBrew, Anaconda, Enthought! 3. Get to NumPy as fast possible:! Optimized C drivers/connectors
Efficient Data Handling 1. On the fly data reduction! 2. On-line algorithms! 3. Parallel Processing patterns! 4. Caching
Efficient Data Handling 1. On the fly data reduction! 2. On-line algorithms! 3. Parallel Processing patterns! 4. Caching
Simpler Case for ML Data-driven work needs ML because of the curse of dimensionality
Manner of Big 1. Large N (Many obs.)! 2. Large M (Features, Descriptors) 1
Less data = less work 1. Big Data often I/O Bound! 2. Layer Memory Access! CPU cache! RAM! Local disks! Distant Storage 1
Dropping Data in a Loop 1. Take a random subset/sample of the data! 2. Apply algorithm on given subset! 3. aggregate results across subsets 1 -Run the loop in parallel -Exploit redundancy across obs.
Bootstrap Aggregating (Bagging): Sample 2 2 2 2 Resample the sample with replacement 1
Dimension Reduction Random Projections (averaging features)! sklearn.random_projection! Fast (sub-optimal) clustering of features:! sklearn.cluster.wardagglomeration! Hashing (obs. of varying size, e.g. words)! sklearn.feature_extraction.text.hashingvectorizer 1
Gaussian Random Projection 1
PCA using randomized SVD 1
Efficient Data Handling Schemes 1. On the fly data reduction! 2. On-line algorithms! 3. Parallel Processing patterns! 4. Caching
Convergence 1. i.i.d. converges to expectations of distribution of interest! 2. Mini-batch: bunch observations Trade-off between memory usage and vectorization 2
Batch http://scikit-learn.org/stable/modules/clustering.html 2
Batch Minibatch 15.1 ms Vanilla 50.9 ms 2
Efficient Data Handling Schemes 1. On the fly data reduction! 2. On-line algorithms! 3. Parallel Processing patterns! 4. Caching
Embarrassingly Parallel Loops A B C D E F Poor load balance (4 UEs) A B C D F Good load balance (4 UEs) A E C B F D 3 E Unit of Execution - a collection of concurrently-executing entities, usually either processes or threads
joblib Running Python functions as pipeline jobs! Don t change your code! No dependencies 3
Return vs Yield Quick Review 3
scikit-learn Integration Random Projections (averaging features)! cross_val(model, X, y, n_jobs=4,cv=3)! Grid Search:! GridsearchCV(model, n_jobs=4,cv=3).fit(x,y)! Random Forests! RandomForestClassifier(n_jobs=4).fit(X,y)! ExtraTreesClassifier(n_jobs=4).fit(X,y) 3
In-memory Replicate Dataset Train Forest Models in Parallel All Data All Data All Data All Data All Data All Data All Data All Labels to Predict All Labels All Labels All Labels All Labels All Labels All Labels clf_1 clf_1 clf_1 Seed each model with a different random state integer clf = (clf_1 + clf_2 + clf_2) 3 http://ogrisel.com/ http://scikit-learn.org/dev/modules/ensemble.html#random-forests
Too large for memory Replicate Partition Dataset Train Forest Models in Parallel All Data Data 1 Data 2 Data 3 Data 1 Data 2 Data 3 All Labels to Predict Labels 1 Labels 2 Labels 3 Labels 1 Labels 2 Labels 3 clf_1 clf_2 clf_3 Seed each model with a different random state integer clf = (clf_1 + clf_2 + clf_3) 3 http://ogrisel.com/ http://scikit-learn.org/dev/modules/ensemble.html#random-forests
Efficient Data Handling 1. On the fly data reduction! 2. On-line algorithms! 3. Parallel Processing patterns! 4. Caching
Memoize 3
Don t underestimate the cost of complexity whether it be cognitive, maintenance, mutability, portability, etc.! "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil - Donald Knuth
Please Donate to numfocus.org