Simple big data, in Python. Gaël Varoquaux
|
|
|
- Frank Earl Bruce
- 10 years ago
- Views:
Transcription
1 Simple big data, in Python Gaël Varoquaux
2 Simple big data, in Python Gaël Varoquaux This is a lie!
3 Please allow me to introduce myself Physicist gone bad I m a man of wealth and taste I ve been around for a long, long year Neuroscience, Machine learning Worked in a software startup Enthought: scientific computing consulting in Python Coder (done my share of mistake) Mayavi, scikit-learn, joblib... Scipy community Chair of scipy and EuroScipy conferences Researcher (PI) at INRIA G Varoquaux 2
4 G Varoquaux 3 1 Machine learning in 2 words 2 Scikit-learn 3 Big data on a budget 4 The bigger picture: a community
5 1 Machine learning in 2 words G Varoquaux 4
6 G Varoquaux 5 1 A historical perspective Artificial Intelligence Building decision rules The 80s Eatable? Tall? Mobile?
7 G Varoquaux 5 1 A historical perspective Artificial Intelligence Building decision rules Machine learning Learn these from observations The 80s The 90s
8 1 A historical perspective Artificial Intelligence Building decision rules The 80s Machine learning Learn these from observations The 90s Statistical learning Model the noise in the observations G Varoquaux 2000s 5
9 G Varoquaux 5 1 A historical perspective Artificial Intelligence Building decision rules Machine learning Learn these from observations Statistical learning Model the noise in the observations Big data Many observations, simple rules The 80s The 90s 2000s today
10 1 A historical perspective Artificial Intelligence Building decision rules Machine learning Learn these from observations Statistical learning Model the noise in the observations Big data Many observations, simple rules The 80s The 90s 2000s today Big data isn t actually interesting without machine learning Steve Jurvetson, VC, Silicon Valley G Varoquaux 5
11 1 Machine learning G Varoquaux 6 Example: face recognition Andrew Bill Charles Dave
12 1 Machine learning? G Varoquaux 6 Example: face recognition Andrew Bill Charles Dave
13 1 A simple method G Varoquaux 7 1 Store all the known (noisy) images and the names that go with them. 2 From a new (noisy) images, find the image that is most similar. Nearest neighbor method
14 1 A simple method G Varoquaux 7 1 Store all the known (noisy) images and the names that go with them. 2 From a new (noisy) images, find the image that is most similar. Nearest neighbor method How many errors on already-known images?... 0: no erreurs Test data Train data
15 G Varoquaux st problem: noise Signal unrelated to the prediction problem Prediction accuracy Noise level
16 1 2 nd problem: number of variables G Varoquaux 9 Finding a needle in a haystack Prediction accuracy Useful fraction of the frame
17 1 Machine learning Example: face recognition Learning from numerical descriptors Difficulties: i) noise, ii) number of features Andrew Bill Charles Dave supervised task: known labels unsupervised task: unknown labels? G Varoquaux 10
18 G Varoquaux 11 1 Machine learning: regression A single descriptor: one dimension y x
19 G Varoquaux 11 1 Machine learning: regression A single descriptor: one dimension y y x Which model to prefer? x
20 1 Machine learning: regression A single descriptor: one dimension y y x Problem of over-fitting Minimizing error is not always the best strategy (learning noise) Test data train data G Varoquaux 11 x
21 G Varoquaux 11 1 Machine learning: regression A single descriptor: one dimension y y x Prefer simple models = concept of regularization Balance the number of parameters to learn with the amount of data x
22 1 Machine learning: regression A single descriptor: one dimension Two descriptors: 2 dimensions y y x X_2 X_1 More parameters G Varoquaux 11
23 1 Machine learning: regression A single descriptor: one dimension Two descriptors: 2 dimensions y y x need more data curse of dimensionality X_2 X_1 More parameters G Varoquaux 11
24 G Varoquaux 12 1 Supervised learning: classification Predicting categories, e.g. numbers X 2 X 1
25 1 Unsupervised learning G Varoquaux 13 Stock market structure
26 1 Unsupervised learning G Varoquaux 13 Stock market structure Unlabeled data more common than labeled data
27 1 Recommender systems G Varoquaux 14
28 1 Recommender systems G Varoquaux 14 Andrew Bill Charles Dave Edie Little overlap between users
29 1 Machine learning Challenges Statistics Computational G Varoquaux 15
30 1 Petty day-to-day technicalities G Varoquaux 16 Buggy code Slow code Lead data scientist leaves New intern to train I don t understand the code I have written a year ago
31 1 Petty day-to-day technicalities G Varoquaux 16 Buggy code An in-house data science squad Slow Difficulties code LeadRecruitment data scientist leaves NewLimited intern toresources train (people & hardware) I don t understand the code I have written a year ago Risks Bus factor Technical dept
32 1 Petty day-to-day technicalities Buggy code An in-house data science squad Slow Difficulties code LeadRecruitment data scientist leaves NewLimited intern toresources train (people & hardware) I don t understand the code I have written a year ago Risks Bus factor Technical dept We need big data (and machine learning) on a tight budget G Varoquaux 16
33 2 Scikit-learn Machine learning without learning the machinery G Varoquaux c Theodore W. Gray 17
34 2 My stack G Varoquaux 18 Python, what else? Interactive language Easy to read / write General purpose
35 2 My stack G Varoquaux 18 Python, what else? The scientific Python stack numpy arrays pandas... It s about plugin things together
36 2 scikit-learn vision G Varoquaux 19 Machine learning for all No specific application domain No requirements in machine learning High-quality software library Interfaces designed for users Community-driven development BSD licensed, very diverse contributors
37 2 A Python library G Varoquaux 20 A library, not a program More expressive and flexible Easy to include in an ecosystem As easy as py from s k l e a r n import svm c l a s s i f i e r = svm.svc() c l a s s i f i e r. f i t ( X t r a i n, Y t r a i n ) Y t e s t = c l a s s i f i e r. p r e d i c t ( X t e s t )
38 2 Very rich feature set G Varoquaux 21 Supervised learning Decision trees (Random-Forest, Boosted Tree) Linear models SVM Unsupervised Learning Clustering Dictionary learning Outlier detection Model selection Built in cross-validation Parameter optimization
39 2 Computational performance scikit-learn mlpy pybrain pymvpa mdp shogun SVM LARS Elastic Net knn PCA k-means Algorithmic optimizations Minimizing data copies G Varoquaux 22
40 2 Computational performance Random Forest fit time scikit-learn mlpy pybrain pymvpa mdp shogun randomforest SVM 5.2 OpenCV-RF R, Fortran OpenCV-ETs OK3-RF LARS OK3-ETs - - Weka-RF Orange Elastic Net 0.52 R-RF Python - Orange-RF knn PCA k-means OpenCV Fit tim e (s) 4000 Algorithmic optimizations 2000 Minimizing data Scikit-Learn copies 0 Scikit-Learn-RF Scikit-Learn-ETs Python, Cython C OK3 C Weka Java Figure: Gilles Louppe G Varoquaux 22
41 3 Big data on a budget G Varoquaux 23
42 G Varoquaux 23 3 Big data on a budget Biggish Big data : Petabytes... Distributed storage Computing cluster Mere mortals: Gigabytes... Python programming Off-the-self computers
43 G Varoquaux 23 3 Big data on a budget Biggish Big data : Mere mortals: Petabytes... Simple data processing patterns Distributed storage Gigabytes... Computing cluster Python programming Off-the-self computers
44 Big data, but big how? 2 scenarios: Many observations samples e.g. twitter Many descriptors per observation features e.g. brain scans G Varoquaux 24
45 3 On-line algorithms G Varoquaux 25 Process the data one sample at a time Compute the mean of a gazillion numbers Hard?
46 3 On-line algorithms Process the data one sample at a time Compute the mean of a gazillion numbers Hard? No: just do a running mean G Varoquaux 25
47 3 On-line algorithms Converges to expectations Mini-batch = bunch observations for vectorization Example: K-Means clustering X = np.random.normal(size=(10 000, 200)) scipy.cluster.vq. kmeans(x, 10, iter=2) s sklearn.cluster. MiniBatchKMeans(n clusters=10, n init=2).fit(x) 0.62 s G Varoquaux 25
48 3 On-the-fly data reduction G Varoquaux 26 Big data is often I/O bound Layer memory access CPU caches RAM Local disks Distant storage Less data also means less work
49 3 On-the-fly data reduction Dropping data 1 loop: take a random fraction of the data 2 run algorithm on that fraction 3 aggregate results across sub-samplings Looks like bagging: bootstrap aggregation Exploits redundancy across observations Run the loop in parallel G Varoquaux 26
50 3 On-the-fly data reduction G Varoquaux 26 Random projections (will average features) sklearn.random projection random linear combinations of the features Fast clustering of features sklearn.cluster.wardagglomeration on images: super-pixel strategy Hashing when observations have varying size (e.g. words) sklearn.feature extraction.text. HashingVectorizer stateless: can be used in parallel
51 G Varoquaux 26 3 On-the-fly data reduction Example: randomized SVD Random projection sklearn.utils.extmath.randomized svd X = np.random.normal(size=(50000, 200)) %timeit lapack = linalg.svd(x, full matrices=false) 1 loops, best of 3: 6.09 s per loop %timeit arpack=splinalg.svds(x, 10) 1 loops, best of 3: 2.49 s per loop %timeit randomized = randomized svd(x, 10) 1 loops, best of 3: 303 ms per loop linalg.norm(lapack[0][:, :10] - arpack[0]) / linalg.norm(lapack[0][:, :10] - randomized[0]) /
52 3 Data-parallel computing I tackle only embarrassingly parallel problem Life is too short to worry about deadlocks Stratification to follow the statistical dependencies and the data storage structure Batch size scaled by the relevent cache pool - Too fine overhead - Too coarse memory shortage joblib.parallel G Varoquaux 27
53 3 Caching G Varoquaux 28 Minimizing data-access latency Never computing twice the same thing joblib.cache
54 3 Fast access to data Stored representation consistent with access patterns Compression to limit bandwidth usage - CPU are faster than data access Data access speed is often more a limitation than raw processing power joblib.dump/joblib.load G Varoquaux 29
55 G Varoquaux 30 3 Biggish iron Our new box: 15 ke 48 cores 384G RAM 70T storage (SSD cache on RAID controller) Gets our work done faster than our 800 CPU cluster It s the access patterns! Nobody ever got fired for using Hadoop on a cluster A. Rowstron et al., HotCDP 12
56 4 The bigger picture: a community Helping your future self G Varoquaux 31
57 G Varoquaux 32 4 Community-based development in scikit-learn Huge feature set: benefits of a large team Project growth: More than 200 contributors 12 core contributors 1 full-time INRIA programmer from the start Estimated cost of development: $ 6 millions COCOMO model,
58 G Varoquaux 33 4 The economics of open source Code maintenance too expensive to be alone scikit-learn 300 /month nipy 45 /month joblib 45 /month mayavi 30 /month Hey Gael, I take it you re too busy. That s okay, I spent a day trying to install XXX and I think I ll succeed myself. Next time though please don t ignore my s, I really don t like it. You can say, sorry, I have no time to help you. Just don t ignore.
59 G Varoquaux 33 4 The economics of open source Code maintenance too expensive to be alone scikit-learn 300 /month nipy 45 /month joblib 45 /month mayavi 30 /month Your benefits come from a fraction of the code Data loading? Maybe? Standard algorithms? Nah Share the common code......to avoid dying under code Code becomes less precious with time And somebody might contribute features
60 4 Many eyes makes code fast L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer G Varoquaux 34
61 4 Communities increase the knowledge pool Even if you don t do software, you should worry about communities More experts in the package Easier recruitement The future is open Enhancements are possible Meetups, conferences, are where new ideas are born G Varoquaux 35
62 4 6 steps to a community-driven project scikit-learn-dveloppement-communautaire G Varoquaux 36 1 Focus on quality 2 Build great docs and examples 3 Use github 4 Limit the technicality of your codebase 5 Releasing and packaging matter 6 Focus on your contributors, give them credit, decision power
63 4 Core project contributors Credit: Fernando Perez, Gist G Varoquaux 37
64 4 The tragedy of the commons Individuals, acting independently and rationally according to each one s self-interest, behave contrary to the whole group s long-term best interests by depleting some common resource. Wikipedia Make it work, make it right, make it boring Core projects (boring) taken for granted Hard to fund, less excitement They need citation, in papers & on corporate web pages G Varoquaux 38
65 Simple big data, in Python Beyond the lie Machine learning gives value to (big) data Python + scikit-learn = - from interactive data processing (IPython notebook) - to crazy big problems (Python + spark) Big data will require you to understand the data-flow patterns (access, parallelism, statistics) Big data community addresses the human
Big Data Paradigms in Python
Big Data Paradigms in Python San Diego Data Science and R Users Group January 2014 Kevin Davenport! http://kldavenport.com [email protected] @KevinLDavenport Thank you to our sponsors: Setting up
Machine Learning in Python with scikit-learn. O Reilly Webcast Aug. 2014
Machine Learning in Python with scikit-learn O Reilly Webcast Aug. 2014 Outline Machine Learning refresher scikit-learn How the project is structured Some improvements released in 0.15 Ongoing work for
DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2
DATA SCIENCE CURRICULUM Before class even begins, students start an at-home pre-work phase. When they convene in class, students spend the first eight weeks doing iterative, project-centered skill acquisition.
MACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
Maschinelles Lernen mit MATLAB
Maschinelles Lernen mit MATLAB Jérémy Huard Applikationsingenieur The MathWorks GmbH 2015 The MathWorks, Inc. 1 Machine Learning is Everywhere Image Recognition Speech Recognition Stock Prediction Medical
Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre Pinto [email protected] @alexcpsec @MLSecProject
Defending Networks with Incomplete Information: A Machine Learning Approach Alexandre Pinto [email protected] @alexcpsec @MLSecProject Agenda Security Monitoring: We are doing it wrong Machine Learning
BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376
Course Director: Dr. Kayvan Najarian (DCM&B, [email protected]) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
Big-data Analytics: Challenges and Opportunities
Big-data Analytics: Challenges and Opportunities Chih-Jen Lin Department of Computer Science National Taiwan University Talk at 台 灣 資 料 科 學 愛 好 者 年 會, August 30, 2014 Chih-Jen Lin (National Taiwan Univ.)
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
Parallel and Large Scale Learning with scikit-learn
Parallel and Large Scale Learning with scikit-learn Data Science London Meetup - Mar. 2013 About me Regular contributor to scikit-learn Interested in NLP, Computer Vision, Predictive Modeling & ML in general
Is a Data Scientist the New Quant? Stuart Kozola MathWorks
Is a Data Scientist the New Quant? Stuart Kozola MathWorks 2015 The MathWorks, Inc. 1 Facts or information used usually to calculate, analyze, or plan something Information that is produced or stored by
Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify
Big Data at Spotify Anders Arpteg, Ph D Analytics Machine Learning, Spotify Quickly about me Quickly about Spotify What is all the data used for? Quickly about Spark Hadoop MR vs Spark Need for (distributed)
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Analytics on Big Data
Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Machine Learning with MATLAB David Willingham Application Engineer
Machine Learning with MATLAB David Willingham Application Engineer 2014 The MathWorks, Inc. 1 Goals Overview of machine learning Machine learning models & techniques available in MATLAB Streamlining the
Machine learning for algo trading
Machine learning for algo trading An introduction for nonmathematicians Dr. Aly Kassam Overview High level introduction to machine learning A machine learning bestiary What has all this got to do with
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
SURVEY REPORT DATA SCIENCE SOCIETY 2014
SURVEY REPORT DATA SCIENCE SOCIETY 2014 TABLE OF CONTENTS Contents About the Initiative 1 Report Summary 2 Participants Info 3 Participants Expertise 6 Suggested Discussion Topics 7 Selected Responses
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
Analysis Tools and Libraries for BigData
+ Analysis Tools and Libraries for BigData Lecture 02 Abhijit Bendale + Office Hours 2 n Terry Boult (Waiting to Confirm) n Abhijit Bendale (Tue 2:45 to 4:45 pm). Best if you email me in advance, but I
Advanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
Car Insurance. Prvák, Tomi, Havri
Car Insurance Prvák, Tomi, Havri Sumo report - expectations Sumo report - reality Bc. Jan Tomášek Deeper look into data set Column approach Reminder What the hell is this competition about??? Attributes
Using Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
Why is Internal Audit so Hard?
Why is Internal Audit so Hard? 2 2014 Why is Internal Audit so Hard? 3 2014 Why is Internal Audit so Hard? Waste Abuse Fraud 4 2014 Waves of Change 1 st Wave Personal Computers Electronic Spreadsheets
MACHINE LEARNING BRETT WUJEK, SAS INSTITUTE INC.
MACHINE LEARNING BRETT WUJEK, SAS INSTITUTE INC. AGENDA MACHINE LEARNING Background Use cases in healthcare, insurance, retail and banking Eamples: Unsupervised Learning Principle Component Analysis Supervised
Unlocking the True Value of Hadoop with Open Data Science
Unlocking the True Value of Hadoop with Open Data Science Kristopher Overholt Solution Architect Big Data Tech 2016 MinneAnalytics June 7, 2016 Overview Overview of Open Data Science Python and the Big
The Internet of Things and Big Data: Intro
The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific
Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY
Fast and Expressive Big Data Analytics with Python Matei Zaharia UC Berkeley / MIT UC BERKELEY spark-project.org What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop
Sense Making in an IOT World: Sensor Data Analysis with Deep Learning
Sense Making in an IOT World: Sensor Data Analysis with Deep Learning Natalia Vassilieva, PhD Senior Research Manager GTC 2016 Deep learning proof points as of today Vision Speech Text Other Search & information
Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic
Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the
Machine Learning Capacity and Performance Analysis and R
Machine Learning and R May 3, 11 30 25 15 10 5 25 15 10 5 30 25 15 10 5 0 2 4 6 8 101214161822 0 2 4 6 8 101214161822 0 2 4 6 8 101214161822 100 80 60 40 100 80 60 40 100 80 60 40 30 25 15 10 5 25 15 10
Grab some coffee and enjoy the pre-show banter before the top of the hour!
Grab some coffee and enjoy the pre-show banter before the top of the hour! Think Big: How to Design a Big Data Information Architecture Exploratory Webcast January 22, 2014 Guests Robin Bloor Chief Analyst,
Machine Learning for Data Science (CS4786) Lecture 1
Machine Learning for Data Science (CS4786) Lecture 1 Tu-Th 10:10 to 11:25 AM Hollister B14 Instructors : Lillian Lee and Karthik Sridharan ROUGH DETAILS ABOUT THE COURSE Diagnostic assignment 0 is out:
Spark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
Predicting borrowers chance of defaulting on credit loans
Predicting borrowers chance of defaulting on credit loans Junjie Liang ([email protected]) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
Apache Hama Design Document v0.6
Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came
Classification Techniques in Remote Sensing Research using Smart Data Analytics
Classification Techniques in Remote Sensing Research using Smart Data Analytics Federated Systems and Data Division Research Group High Productivity Data Processing Morris Riedel Juelich Supercomputing
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
Introduction to Machine Learning Using Python. Vikram Kamath
Introduction to Machine Learning Using Python Vikram Kamath Contents: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Introduction/Definition Where and Why ML is used Types of Learning Supervised Learning Linear Regression
Large Scale Electroencephalography Processing With Hadoop
Large Scale Electroencephalography Processing With Hadoop Matthew D. Burns I. INTRODUCTION Apache Hadoop [1] is an open-source implementation of the Google developed MapReduce [3] general programming model
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Rackspace Cloud Databases and Container-based Virtualization
Rackspace Cloud Databases and Container-based Virtualization August 2012 J.R. Arredondo @jrarredondo Page 1 of 6 INTRODUCTION When Rackspace set out to build the Cloud Databases product, we asked many
What is RAID? data reliability with performance
What is RAID? RAID is the use of multiple disks and data distribution techniques to get better Resilience and/or Performance RAID stands for: Redundant Array of Inexpensive / Independent Disks RAID can
CS 6290 I/O and Storage. Milos Prvulovic
CS 6290 I/O and Storage Milos Prvulovic Storage Systems I/O performance (bandwidth, latency) Bandwidth improving, but not as fast as CPU Latency improving very slowly Consequently, by Amdahl s Law: fraction
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski [email protected] Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training
MS1b Statistical Data Mining
MS1b Statistical Data Mining Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Administrivia and Introduction Course Structure Syllabus Introduction to
Big Graph Processing: Some Background
Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
Fall 2012 Q530. Programming for Cognitive Science
Fall 2012 Q530 Programming for Cognitive Science Aimed at little or no programming experience. Improve your confidence and skills at: Writing code. Reading code. Understand the abilities and limitations
CS 40 Computing for the Web
CS 40 Computing for the Web Art Lee January 20, 2015 Announcements Course web on Sakai Homework assignments submit them on Sakai Email me the survey: See the Announcements page on the course web for instructions
Predicting outcome of soccer matches using machine learning
Saint-Petersburg State University Mathematics and Mechanics Faculty Albina Yezus Predicting outcome of soccer matches using machine learning Term paper Scientific adviser: Alexander Igoshkin, Yandex Mobile
Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014
Four Orders of Magnitude: Running Large Scale Accumulo Clusters Aaron Cordova Accumulo Summit, June 2014 Scale, Security, Schema Scale to scale 1 - (vt) to change the size of something let s scale the
RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING
= + RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING Stefan Savev Berlin Buzzwords June 2015 KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data sparsity:
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Statistics, Data Mining and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data. and Alex Gray
Statistics, Data Mining and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data Željko Ivezić, Andrew J. Connolly, Jacob T. VanderPlas University of Washington and Alex
E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics
E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big
Machine Learning. CUNY Graduate Center, Spring 2013. Professor Liang Huang. [email protected]
Machine Learning CUNY Graduate Center, Spring 2013 Professor Liang Huang [email protected] http://acl.cs.qc.edu/~lhuang/teaching/machine-learning Logistics Lectures M 9:30-11:30 am Room 4419 Personnel
Similarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD
Predictive Analytics Techniques: What to Use For Your Big Data March 26, 2014 Fern Halper, PhD Presenter Proven Performance Since 1995 TDWI helps business and IT professionals gain insight about data warehousing,
Microsoft Private Cloud Fast Track
Microsoft Private Cloud Fast Track Microsoft Private Cloud Fast Track is a reference architecture designed to help build private clouds by combining Microsoft software with Nutanix technology to decrease
The Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder
APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
CSci 538 Articial Intelligence (Machine Learning and Data Analysis)
CSci 538 Articial Intelligence (Machine Learning and Data Analysis) Course Syllabus Fall 2015 Instructor Derek Harter, Ph.D., Associate Professor Department of Computer Science Texas A&M University - Commerce
II. RELATED WORK. Sentiment Mining
Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract
Big data in R EPIC 2015
Big data in R EPIC 2015 Big Data: the new 'The Future' In which Forbes magazine finds common ground with Nancy Krieger (for the first time ever?), by arguing the need for theory-driven analysis This future
10-601. Machine Learning. http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html
10-601 Machine Learning http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html Course data All up-to-date info is on the course web page: http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html
Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012
Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Outline Big Data How to extract information? Data clustering
INTRODUCING AZURE MACHINE LEARNING
David Chappell INTRODUCING AZURE MACHINE LEARNING A GUIDE FOR TECHNICAL PROFESSIONALS Sponsored by Microsoft Corporation Copyright 2015 Chappell & Associates Contents What is Machine Learning?... 3 The
Journée Thématique Big Data 13/03/2015
Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets
Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre Pinto [email protected] @alexcpsec @MLSecProject
Defending Networks with Incomplete Information: A Machine Learning Approach Alexandre Pinto [email protected] @alexcpsec @MLSecProject ** WARNING ** This is a talk about DEFENDING not attacking NO
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
DSS. Diskpool and cloud storage benchmarks used in IT-DSS. Data & Storage Services. Geoffray ADDE
DSS Data & Diskpool and cloud storage benchmarks used in IT-DSS CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it Geoffray ADDE DSS Outline I- A rational approach to storage systems evaluation
Pattern-Aided Regression Modelling and Prediction Model Analysis
San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Fall 2015 Pattern-Aided Regression Modelling and Prediction Model Analysis Naresh Avva Follow this and
A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML
www.bsc.es A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML Josep Ll. Berral, Nicolas Poggi, David Carrera Workshop on Big Data Benchmarks Toronto, Canada 2015 1 Context ALOJA: framework
Unsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
R YOU READY FOR PYTHON? Sunday 19th April, 2015
R YOU READY FOR PYTHON? Sunday 19th April, 2015 THIS IS NOT A PYTHON VS R TALK credits - https://meetmrholland.wordpress.com/2013/02/03/creative-5-tips-to-make-all-your-meetings-exactly-the-same/ WHO ARE
Machine Learning over Big Data
Machine Learning over Big Presented by Fuhao Zou [email protected] Jue 16, 2014 Huazhong University of Science and Technology Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
8. Machine Learning Applied Artificial Intelligence
8. Machine Learning Applied Artificial Intelligence Prof. Dr. Bernhard Humm Faculty of Computer Science Hochschule Darmstadt University of Applied Sciences 1 Retrospective Natural Language Processing Name
Processing of Big Data. Nelson L. S. da Fonseca IEEE ComSoc Summer Scool Trento, July 9 th, 2015
Processing of Big Data Nelson L. S. da Fonseca IEEE ComSoc Summer Scool Trento, July 9 th, 2015 Acknowledgement Some slides in this set of slides were provided by EMC Corporation and Sandra Avila, University
