Big Data Paradigms in Python



Similar documents
Unlocking the True Value of Hadoop with Open Data Science

Simple big data, in Python. Gaël Varoquaux

Parallel and Large Scale Learning with scikit-learn

Machine Learning in Python with scikit-learn. O Reilly Webcast Aug. 2014

RevoScaleR Speed and Scalability

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Fast Analytics on Big Data with H20

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

MACHINE LEARNING IN HIGH ENERGY PHYSICS

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Big-data Analytics: Challenges and Opportunities

bigdata Managing Scale in Ontological Systems

Parallel Analysis and Visualization on Cray Compute Node Linux

Car Insurance. Prvák, Tomi, Havri

Understanding the Value of In-Memory in the IT Landscape

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Hadoop Architecture. Part 1

Project Convergence: Integrating Data Grids and Compute Grids. Eugene Steinberg, CTO Grid Dynamics May, 2008

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

GraySort on Apache Spark by Databricks

Big Data Analytics - Accelerated. stream-horizon.com

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Machine Learning over Big Data

SAP HANA SAP s In-Memory Database. Dr. Martin Kittel, SAP HANA Development January 16, 2013

NoSQL Data Base Basics

Unsupervised Data Mining (Clustering)

:Introducing Star-P. The Open Platform for Parallel Application Development. Yoel Jacobsen E&M Computing LTD

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

HPC performance applications on Virtual Clusters

CME 193: Introduction to Scientific Python Lecture 8: Unit testing, more modules, wrap up

Graph Database Proof of Concept Report

Part I Courses Syllabus

Introduction to Python

Unified Big Data Processing with Apache Spark. Matei

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Weekly Sales Forecasting

Parallel Computing. Benson Muite. benson.

Big Fast Data Hadoop acceleration with Flash. June 2013

Big Data, Fast Processing Speeds Kevin McGowan SAS Solutions on Demand, Cary NC

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Multi-GPU Load Balancing for Simulation and Rendering

Introduction to Spark

Application of Predictive Analytics for Better Alignment of Business and IT

Analysis Tools and Libraries for BigData

Big Data With Hadoop

Parallel Computing for Data Science

Cloud Computing at Google. Architecture

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Bootstrapping Big Data

Mrs: MapReduce for Scientific Computing in Python

Advanced Big Data Analytics with R and Hadoop

Hybrid Software Architectures for Big

Predicting outcome of soccer matches using machine learning

Benchmarking Cassandra on Violin

Ground up Introduction to In-Memory Data (Grids)

Introduction to DISC and Hadoop

Large scale processing using Hadoop. Ján Vaňo

CIS 192: Lecture 13 Scientific Computing and Unit Testing

Session 85 IF, Predictive Analytics for Actuaries: Free Tools for Life and Health Care Analytics--R and Python: A New Paradigm!

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Memory-Centric Database Acceleration

Scientific Programming, Analysis, and Visualization with Python. Mteor 227 Fall 2015

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

2015 The MathWorks, Inc. 1

Object Oriented Database Management System for Decision Support System.

R YOU READY FOR PYTHON? Sunday 19th April, 2015

Oracle Database In-Memory The Next Big Thing

Big Data for Big Intel

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Promise of Low-Latency Stable Storage for Enterprise Solutions

NextGen Infrastructure for Big DATA Analytics.

Journée Thématique Big Data 13/03/2015

High-Performance Processing of Large Data Sets via Memory Mapping A Case Study in R and C++

Navigating the Big Data infrastructure layer Helena Schwenk

MapReduce/Bigtable for Distributed Optimization

A Performance Analysis of Distributed Indexing using Terrier

Maximizing SQL Server Virtualization Performance

DSS. Diskpool and cloud storage benchmarks used in IT-DSS. Data & Storage Services. Geoffray ADDE

Hue Streams. Seismic Compression Technology. Years of my life were wasted waiting for data loading and copying

Assignment 2: Option Pricing and the Black-Scholes formula The University of British Columbia Science One CS Instructor: Michael Gelbart

PACE Predictive Analytics Center of San Diego Supercomputer Center, UCSD. Natasha Balac, Ph.D.

Unified Big Data Analytics Pipeline. 连 城

MOSIX: High performance Linux farm

Developing MapReduce Programs

CSE 6040 Computing for Data Analytics: Methods and Tools. Lecture 1 Course Overview

Solid State Storage in Massive Data Environments Erik Eyberg

Spark and the Big Data Library

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju

Using Data Mining for Mobile Communication Clustering and Characterization

Data-Intensive Applications on HPC Using Hadoop, Spark and RADICAL-Cybertools

Transcription:

Big Data Paradigms in Python San Diego Data Science and R Users Group January 2014 Kevin Davenport! http://kldavenport.com kldavenportjr@gmail.com @KevinLDavenport Thank you to our sponsors:

Setting up your environment Completely free enterprise-ready Python distribution for large-scale data processing, predictive analytics, and scientific computing Spend time writing code, not working to set up a system. Create interactive plots in your browser with Bokeh or D3.

http://ipython.org/notebook.html http://docs.continuum.io/anaconda/index.html http://pandas.pydata.org http://scikit-learn.org/stable/

Anaconda $ conda update conda $ conda update anaconda $ conda update numpy $ conda update bokeh $ conda update numba $ conda install ggplot

Wakari

Two Worlds Big Data! Everything else! Size Commodity hardware Computing Clusters Python Programming Distributed Storage

Tools scikit-learn! Python ML library based on (NumPy, SciPy, matplotlib)! joblib! Pipeline jobs with python functions Learn a model from the data: estimator.fit(x train, Y train)! Predict using learned model estimator.predict(x test)! Test goodness of fit estimator.score(x test, y test)! Apply change of representation estimator.transform(x, y)

Design 1. Debugging:! %debug, %time, %timeit, %lprun, %prun,%mprun, %memit! 2. Dependency Hell:! HomeBrew, Anaconda, Enthought! 3. Get to NumPy as fast possible:! Optimized C drivers/connectors

Efficient Data Handling 1. On the fly data reduction! 2. On-line algorithms! 3. Parallel Processing patterns! 4. Caching

Efficient Data Handling 1. On the fly data reduction! 2. On-line algorithms! 3. Parallel Processing patterns! 4. Caching

Simpler Case for ML Data-driven work needs ML because of the curse of dimensionality

Manner of Big 1. Large N (Many obs.)! 2. Large M (Features, Descriptors) 1

Less data = less work 1. Big Data often I/O Bound! 2. Layer Memory Access! CPU cache! RAM! Local disks! Distant Storage 1

Dropping Data in a Loop 1. Take a random subset/sample of the data! 2. Apply algorithm on given subset! 3. aggregate results across subsets 1 -Run the loop in parallel -Exploit redundancy across obs.

Bootstrap Aggregating (Bagging): Sample 2 2 2 2 Resample the sample with replacement 1

Dimension Reduction Random Projections (averaging features)! sklearn.random_projection! Fast (sub-optimal) clustering of features:! sklearn.cluster.wardagglomeration! Hashing (obs. of varying size, e.g. words)! sklearn.feature_extraction.text.hashingvectorizer 1

Gaussian Random Projection 1

PCA using randomized SVD 1

Efficient Data Handling Schemes 1. On the fly data reduction! 2. On-line algorithms! 3. Parallel Processing patterns! 4. Caching

Convergence 1. i.i.d. converges to expectations of distribution of interest! 2. Mini-batch: bunch observations Trade-off between memory usage and vectorization 2

Batch http://scikit-learn.org/stable/modules/clustering.html 2

Batch Minibatch 15.1 ms Vanilla 50.9 ms 2

Efficient Data Handling Schemes 1. On the fly data reduction! 2. On-line algorithms! 3. Parallel Processing patterns! 4. Caching

Embarrassingly Parallel Loops A B C D E F Poor load balance (4 UEs) A B C D F Good load balance (4 UEs) A E C B F D 3 E Unit of Execution - a collection of concurrently-executing entities, usually either processes or threads

joblib Running Python functions as pipeline jobs! Don t change your code! No dependencies 3

Return vs Yield Quick Review 3

scikit-learn Integration Random Projections (averaging features)! cross_val(model, X, y, n_jobs=4,cv=3)! Grid Search:! GridsearchCV(model, n_jobs=4,cv=3).fit(x,y)! Random Forests! RandomForestClassifier(n_jobs=4).fit(X,y)! ExtraTreesClassifier(n_jobs=4).fit(X,y) 3

In-memory Replicate Dataset Train Forest Models in Parallel All Data All Data All Data All Data All Data All Data All Data All Labels to Predict All Labels All Labels All Labels All Labels All Labels All Labels clf_1 clf_1 clf_1 Seed each model with a different random state integer clf = (clf_1 + clf_2 + clf_2) 3 http://ogrisel.com/ http://scikit-learn.org/dev/modules/ensemble.html#random-forests

Too large for memory Replicate Partition Dataset Train Forest Models in Parallel All Data Data 1 Data 2 Data 3 Data 1 Data 2 Data 3 All Labels to Predict Labels 1 Labels 2 Labels 3 Labels 1 Labels 2 Labels 3 clf_1 clf_2 clf_3 Seed each model with a different random state integer clf = (clf_1 + clf_2 + clf_3) 3 http://ogrisel.com/ http://scikit-learn.org/dev/modules/ensemble.html#random-forests

Efficient Data Handling 1. On the fly data reduction! 2. On-line algorithms! 3. Parallel Processing patterns! 4. Caching

Memoize 3

Don t underestimate the cost of complexity whether it be cognitive, maintenance, mutability, portability, etc.! "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil - Donald Knuth

Please Donate to numfocus.org