Big Data Paradigms in Python

Size: px

Start display at page:

Download "Big Data Paradigms in Python"

Allison Harrison
10 years ago
Views:

1 Big Data Paradigms in Python San Diego Data Science and R Users Group January 2014 Kevin Davenport! Thank you to our sponsors:

2 Setting up your environment Completely free enterprise-ready Python distribution for large-scale data processing, predictive analytics, and scientific computing Spend time writing code, not working to set up a system. Create interactive plots in your browser with Bokeh or D3.

and scientific computing Spend time writing code, not working to set

4 Anaconda $ conda update conda $ conda update anaconda $ conda update numpy $ conda update bokeh $ conda update numba $ conda install ggplot

5 Wakari

6 Two Worlds Big Data! Everything else! Size Commodity hardware Computing Clusters Python Programming Distributed Storage

7 Tools scikit-learn! Python ML library based on (NumPy, SciPy, matplotlib)! joblib! Pipeline jobs with python functions Learn a model from the data: estimator.fit(x train, Y train)! Predict using learned model estimator.predict(x test)! Test goodness of fit estimator.score(x test, y test)! Apply change of representation estimator.transform(x, y)

fit(x train, Y train)! Predict using learned model estimator.predict(x test)!

8 Design 1. Debugging:! %debug, %time, %timeit, %lprun, %prun,%mprun, %memit! 2. Dependency Hell:! HomeBrew, Anaconda, Enthought! 3. Get to NumPy as fast possible:! Optimized C drivers/connectors

9 Efficient Data Handling 1. On the fly data reduction! 2. On-line algorithms! 3. Parallel Processing patterns! 4. Caching

10 Efficient Data Handling 1. On the fly data reduction! 2. On-line algorithms! 3. Parallel Processing patterns! 4. Caching

11 Simpler Case for ML Data-driven work needs ML because of the curse of dimensionality

12 Manner of Big 1. Large N (Many obs.)! 2. Large M (Features, Descriptors) 1

13 Less data = less work 1. Big Data often I/O Bound! 2. Layer Memory Access! CPU cache! RAM! Local disks! Distant Storage 1

14 Dropping Data in a Loop 1. Take a random subset/sample of the data! 2. Apply algorithm on given subset! 3. aggregate results across subsets 1 -Run the loop in parallel -Exploit redundancy across obs.

15 Bootstrap Aggregating (Bagging): Sample Resample the sample with replacement 1

16 Dimension Reduction Random Projections (averaging features)! sklearn.random_projection! Fast (sub-optimal) clustering of features:! sklearn.cluster.wardagglomeration! Hashing (obs. of varying size, e.g. words)! sklearn.feature_extraction.text.hashingvectorizer 1

Fast (sub-optimal) clustering of features:! sklearn.cluster.wardagglomeration!

17 Gaussian Random Projection 1

18 PCA using randomized SVD 1

19 Efficient Data Handling Schemes 1. On the fly data reduction! 2. On-line algorithms! 3. Parallel Processing patterns! 4. Caching

20 Convergence 1. i.i.d. converges to expectations of distribution of interest! 2. Mini-batch: bunch observations Trade-off between memory usage and vectorization 2

21 Batch 2

22 Batch Minibatch 15.1 ms Vanilla 50.9 ms 2

23 Efficient Data Handling Schemes 1. On the fly data reduction! 2. On-line algorithms! 3. Parallel Processing patterns! 4. Caching

24 Embarrassingly Parallel Loops A B C D E F Poor load balance (4 UEs) A B C D F Good load balance (4 UEs) A E C B F D 3 E Unit of Execution - a collection of concurrently-executing entities, usually either processes or threads

25 joblib Running Python functions as pipeline jobs! Don t change your code! No dependencies 3

26 Return vs Yield Quick Review 3

27 scikit-learn Integration Random Projections (averaging features)! cross_val(model, X, y, n_jobs=4,cv=3)! Grid Search:! GridsearchCV(model, n_jobs=4,cv=3).fit(x,y)! Random Forests! RandomForestClassifier(n_jobs=4).fit(X,y)! ExtraTreesClassifier(n_jobs=4).fit(X,y) 3

28 In-memory Replicate Dataset Train Forest Models in Parallel All Data All Data All Data All Data All Data All Data All Data All Labels to Predict All Labels All Labels All Labels All Labels All Labels All Labels clf_1 clf_1 clf_1 Seed each model with a different random state integer clf = (clf_1 + clf_2 + clf_2) 3

29 Too large for memory Replicate Partition Dataset Train Forest Models in Parallel All Data Data 1 Data 2 Data 3 Data 1 Data 2 Data 3 All Labels to Predict Labels 1 Labels 2 Labels 3 Labels 1 Labels 2 Labels 3 clf_1 clf_2 clf_3 Seed each model with a different random state integer clf = (clf_1 + clf_2 + clf_3) 3

30 Efficient Data Handling 1. On the fly data reduction! 2. On-line algorithms! 3. Parallel Processing patterns! 4. Caching

31 Memoize 3

32 Don t underestimate the cost of complexity whether it be cognitive, maintenance, mutability, portability, etc.! "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil - Donald Knuth

33 Please Donate to numfocus.org

Unlocking the True Value of Hadoop with Open Data Science

Unlocking the True Value of Hadoop with Open Data Science Kristopher Overholt Solution Architect Big Data Tech 2016 MinneAnalytics June 7, 2016 Overview Overview of Open Data Science Python and the Big