Analysis Tools and Libraries for BigData

Similar documents
Scalable Developments for Big Data Analytics in Remote Sensing

How To Understand How Weka Works

Azure Machine Learning, SQL Data Mining and R

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Active Learning SVM for Blogs recommendation

Introduction to Machine Learning Using Python. Vikram Kamath

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Machine learning for algo trading

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Bringing Big Data Modelling into the Hands of Domain Experts

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

Challenges and Lessons from NIST Data Science Pre-pilot Evaluation in Introduction to Data Science Course Fall 2015

The Scientific Data Mining Process

Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence

CHAPTER 6 IMPLEMENTATION OF CONVENTIONAL AND INTELLIGENT CLASSIFIER FOR FLAME MONITORING

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC Politecnico di Milano)

High-dimensional labeled data analysis with Gabriel graphs

8. Machine Learning Applied Artificial Intelligence

Data Exploration Data Visualization

Server Load Prediction

An Introduction to Data Mining

2015 The MathWorks, Inc. 1

Fast Analytics on Big Data with H20

3F3: Signal and Pattern Processing

High Productivity Data Processing Analytics Methods with Applications

Social Media Mining. Data Mining Essentials

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Machine Learning What, how, why?

Analytics on Big Data

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data

Data Mining - Evaluation of Classifiers

Machine Learning Logistic Regression

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

Car Insurance. Prvák, Tomi, Havri

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Origins, Evolution, and Future Directions of MATLAB Loren Shure

Maschinelles Lernen mit MATLAB

Final Project Report

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Data Analysis with MATLAB The MathWorks, Inc. 1

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

The Artificial Prediction Market

Introduction Predictive Analytics Tools: Weka

Open-Source Machine Learning: R Meets Weka

Content-Based Recommendation

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Web Document Clustering

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

An Introduction to WEKA. As presented by PACE

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Predicting Flight Delays

Machine Learning Capacity and Performance Analysis and R

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Spark: Cluster Computing with Working Sets

Machine Learning with MATLAB David Willingham Application Engineer

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Distributed Computing and Big Data: Hadoop and MapReduce

COC131 Data Mining - Clustering

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

IT services for analyses of various data samples

CS 40 Computing for the Web

Machine Learning in Spam Filtering

Office: LSK 5045 Begin subject: [ISOM3360]...

CSCI6900 Assignment 2: Naïve Bayes on Hadoop

: Introduction to Machine Learning Dr. Rita Osadchy

Chapter 6. The stacking ensemble approach

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Comparison of K-means and Backpropagation Data Mining Algorithms

The Data Mining Process

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Machine Learning in Python with scikit-learn. O Reilly Webcast Aug. 2014

Predict Influencers in the Social Network

Data Mining & Data Stream Mining Open Source Tools

Machine Learning Final Project Spam Filtering

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Data Mining. Nonlinear Classification

IST565 M001 Yu Spring 2015 Syllabus Data Mining

Machine Learning. Hands-On for Developers and Technical Professionals

Spark and the Big Data Library

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

NetView 360 Product Description

Supervised Learning (Big Data Analytics)

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Supervised DNA barcodes species classification: analysis, comparisons and results. Tutorial. Citations

Big-data Analytics: Challenges and Opportunities

Learning is a very general term denoting the way in which agents:

CSE 6040 Computing for Data Analytics: Methods and Tools. Lecture 1 Course Overview

TIETS34 Seminar: Data Mining on Biometric identification

Comparison of Data Mining Techniques used for Financial Data Analysis

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Transcription:

+ Analysis Tools and Libraries for BigData Lecture 02 Abhijit Bendale

+ Office Hours 2 n Terry Boult (Waiting to Confirm) n Abhijit Bendale (Tue 2:45 to 4:45 pm). Best if you email me in advance, but I will try to be in my office in this time. n abendale@vast.uccs.edu, abhijitbendale@gmail.com

+ Agenda 3 n Introduction to various Statistics and Machine Learning libraries n Interface Languages: Python, Matlab, R, Java, C++ n Machine Learning Tools : Scikits, Weka, R, Matlab n Case Study with Scikits n Dataset Repositories n Amazon Public Datasets n Government Data Resources n Tips for getting started on your Semester Project

+ Statistics and Machine Learning 4 Tools n What do you want to accomplish? n Basic Statistical Function n Advanced statistical and learning algorithm n Using plotting tools to understand trends in Data n Performance Evaluation Metrics n Using the right tools for right task with right amount of abstraction

+ The speed/flexibility tradeoff 5 Matlab Code Java Code Python Code Flexibility Machine code Digital Hardware Analog Hardware Speed

+ Theory Vs. Practice 6 n Theoretician: I want a polynomial-time algorithm which is guaranteed to perform arbitrarily well in all situations. - I prove theorems. n Practitioner: I want a real-time algorithm that performs well on my problem. - I experiment. n Approach for BigData: I want combining algorithms whose performance and speed is guaranteed relative to the performance and speed of their components. - You want to do both, or atleast the latter

+ 7 Tools at hand..! For use with Python GUI based, for use with Java Machine Learning With C++ De-facto Scientific Computing Tool

+ Introduction to Scikits-Learn 8

+ Objectives 9 n Understanding Features and Feature Extraction n Knowing basics of Classification / Regression n Supervised Classification n Unsupervised Classification n Understand Difference between linearly separable and nonlinearly separable data

+ What is Machine Learning? 10 n A sub-field of Artificial Intelligence n Often called as applied statistics n Goal is to learn (understand) nature of given set of observation and build a predictive model for new observation

+ Features and Feature Extraction 11 n Most machine learning algorithms implemented in scikitlearn expect a numpy array as input X. The expected shape of X is (n_samples, n_features). [10 11 25 128 220 245.. ] nrows x ncols

+ Why feature extraction? 12 n Data often unstructured: n Text Documents n Sound n Climate measurements n Images n Videos n Transform into more structured format

+ Feature Space and Linear Decision Boundary 13 Class A Linear Decision boundary Dimension 2 Class B Dimension 1

+ Iris Flower Dataset Iris Setosa Iris Versicolor Measurements: Sepal length in cm Sepal width in cm Petal length in cm Petal width in cm 14 Iris Virginica Target Classes 0 - Iris Setosa 1 Iris Versicolor 2 Iris Virginica

+ Iris Flower Dataset Iris Setosa Iris Versicolor 15 Iris Virginica?

+ 2D PCA on Iris Dataset 16 Transforming the data so that it is easy to operate on

+ Re-using existing code-base 17

+ 18

+ 19

+ 20 No Labels..!!!

+ 21

+ 22

+ 23 More details on sklearn website There is ready made code for you to try out..!

+ 24

+ 25 Reading the documentation

+ 26

+ 27

+ 28 Which method shall I use?

+ 29

+ 30

+ Things to do: 31 n Setup your laptop/desktop with tools of choice n Install Machine Learning Library n Try out demo examples given with that library n Try out different algorithms n E.g. Support Vector Machines n Principal Component Analysis n Performance Evaluation Metrics (e.g. Area Under the Curve, Average Precision, Average Classification Error etc.) n Familiarize yourself with reading documentation and understanding parameters

+ Introduction to Weka 32

+ Machine Learning with Weka n Comprehensive set of tools: n Pre-processing and data analysis n Learning algorithms (for classification, clustering, etc.) n Evaluation metrics n Three modes of operation: n GUI n command-line (not discussed today) n Java API (not discussed today) 33

+ Sample database: the sensus data ( adult ) n Binary classification: n Task: predict whether a person earns > $50K a year n Attributes: age, education level, race, gender, etc. n Attribute types: nominal and numeric n Training/test instances: 32,000/16,300 n Original UCI data available at: ftp.ics.uci.edu/pub/machine-learning-databases/adult n Data already converted to ARFF: http://www1.cs.columbia.edu/~galley/weka/datasets/ 34

+ Weka Explorer What we will use today in Weka: I. Pre-process: n Load, analyze, and filter data II. III. n n n Visualize: Compare pairs of attributes Plot matrices Classify: All algorithms seem in class (Naive Bayes, etc.) 35

+ filter load analyze 36

+ visualize attributes 37

+ Linear classifiers n Prediction is a linear function of the input n in the case of binary predictions, a linear classifier splits a high-dimensional input space with a hyperplane (i.e., a plane in 3D, or a straight line in 2D). n Many popular effective classifiers are linear: perceptron, linear SVM, logistic regression (a.k.a. maximum entropy, exponential model). 38

+ Comparing classifiers n Results on adult data n Majority-class baseline: 76.51% (always predict <=50K) weka.classifier.rules.zeror n Naive Bayes: 79.91% weka.classifier.bayes.naivebayes n Linear classifier: 78.88% weka.classifier.function.logistic n Decision trees: 79.97% weka.classifier.trees.j48 39

+ Why this difference? n A linear classifier in a 2D space: n it can classify correctly ( shatter ) any set of 3 points; n not true for 4 points; n we say then that 2D-linear classifiers have capacity 3. n A decision tree in a 2D space: n can shatter as many points as leaves in the tree; n potentially unbounded capacity! (e.g., if no tree pruning) 40

+ Weka Experimenter n If you need to perform many experiments: n Experimenter makes it easy to compare the performance of different learning schemes n Results can be written into file or database n Evaluation options: cross-validation, learning curve, etc. n Can also iterate over different parameter settings n Significance-testing built in. 41

42

43

44

45

46

47

48

49

50

51

+ Beyond the GUI n How to reproduce experiments with the command-line/api n GUI, API, and command-line all rely on the same set of Java classes n Generally easy to determine what classes and parameters were used in the GUI. n Tree displays in Weka reflect its Java class hierarchy. > java -cp ~galley/weka/weka.jar weka.classifiers.trees.j48 C 0.25 M 2 -t <train_arff> -T <test_arff> 52

+ Matlab and BigData 53

+ BigData with Matlab 54 Advantages Disadvantages n Easy to use n Great for data analysis: Really nice plotting tools n Access to wide array of toolboxes n Parallel Computing toolbox n GPU computing n Almost all machine learning algorithms are available n Costly n Parallelizing means need multiple licenses n Rarely used as a backend in production n Memory heavy n Proprietary

+ Machine Learning in Matlab 55

+ Duct-taping external libraries 56 Bit tedious and time consuming. You need lot of experience doing this to get good performance.

+ Matlab Tools for BigData 57 Single Machine Single Processor Single Machine Use cores / processors efficiently Harness power of GPUs Use all processors on a machines: e.g. efficiently using 8-core machine Distribute across the cloud: e.g. using Amazon EC2 cloud

+ Statistics Toolbox in MATLAB 58 n Iris Data Analysis using matlab demo

+ 59

+ 60

+ Recap 61

+ Don t need to know everything..! 62 n Pick language of your choice n Use machine learning tools related to that n Install ML library n Try out examples given with the library n That s it..!

+ Publicly Available Large Datasets 63

+ Amazon Public Datasets 64

+ Amazon Public Datasets 65 n Wide Range of Domains n Astronomy, Biology, Climate, Economics, Mathematics, Encyclopedic, Geographic etc n Wide Range of Sizes n 5 GB to 100s of TB n Wide Range of Variations n Raw Data n Annotated Data

+ Example Datasets 66 Human Microbiome Project University of Florida Sparse Matrix Collection Wikipedia Traffic Statistics OpenStreet Map

+ 67

+ 68 From the same Quora page

+ Tips for Getting Started with your 69 semester Project Data Source Understand Nature of Data Identify Problem Domain Identify dataset you want to work with

+ Tips for Getting Started with your semester Project 70 Identify Problem Domain Data Source Identify dataset you want to work with Understand Nature of Data Develop Analysis Techniques Iterate till you get it right Can start right away Performance Analysis, Conclusions, Deploy

+ Tips for Getting Started with your semester Project 71 Identify Problem Domain Data Source Identify dataset you want to work with Understand Nature of Data Develop Analysis Techniques Iterate till you get it right Get Feedback from instructors BEFORE Project Proposal Presentation Performance Analysis, Conclusions, Deploy

+ Tips for Getting Started with your semester Project 72 Identify Problem Domain Data Source Identify dataset you want to work with Understand Nature of Data Develop Analysis Techniques Iterate till you get it right Develop a detailed design strategy, partial implementation. Get Feedback DURING proposal presentation Performance Analysis, Conclusions, Deploy

+ Tips for Getting Started with your semester Project 73 Identify Problem Domain Data Source Identify dataset you want to work with Understand Nature of Data Develop Analysis Techniques Iterate till you get it right Read data, develop parsing scripts. Understand disk i/o issues related to your data when reading large files Performance Analysis, Conclusions, Deploy

+ 74

+ Tips for Getting Started with your semester Project 75 Identify Problem Domain Data Source Identify dataset you want to work with Understand Nature of Data Develop Analysis Techniques Iterate till you get it right Create a small subset of the data. Quickly prototype your idea Test on small subset. Iterate Deploy it to scale Performance Analysis, Conclusions, Deploy

+ 76 Highly Recommended..!

+ Resources 77 http://snap.stanford.edu/class/cs341-2012/reports/index.html

+ Example Project Report 78

+ Example Project Report 79 Add concrete results from your analysis

+ Write Report in IEEE Conference Format 80

+ Deliverables 81 n During Proposal Presentation n 1 Page Report n Project Proposal Presentation n During Final Presentation n Detailed Report (IEEE Conference Format) n Presentation n You might be asked to show demo, data, results, code

+ 82

+ Use emails aggressively 83 n Terry Boult: tboult@vast.uccs.edu n Abhijit Bendale abendale@vast.uccs.edu, abhijitbendale@gmail.com n We know it is a tough course n Start early n Seek more help earlier in the semester and gain more independence later on n There is not much homework: doesn t mean you should not do anything.. n Use code given in documentation to develop your understanding n For each library/tool there are tons of tutorials/videos available online.