W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Similar documents
Lecture 13: Validation

L13: cross-validation

Evaluation & Validation: Credibility: Evaluating what has been learned

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Data Mining - Evaluation of Classifiers

Performance Metrics for Graph Mining Tasks

Data Mining. Nonlinear Classification

Cross-Validation. Synonyms Rotation estimation

Knowledge Discovery and Data Mining

Cross Validation. Dr. Thomas Jensen Expedia.com

Knowledge Discovery and Data Mining

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

Data Mining Practical Machine Learning Tools and Techniques

Social Media Mining. Data Mining Essentials

Discovering process models from empirical data

Cross-validation for detecting and preventing overfitting

Experiments in Web Page Classification for Semantic Web

Distributed forests for MapReduce-based machine learning

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach

Predicting borrowers chance of defaulting on credit loans

Supervised Learning (Big Data Analytics)

Predict Influencers in the Social Network

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap

Performance Metrics. number of mistakes total number of observations. err = p.1/1

A Content based Spam Filtering Using Optical Back Propagation Technique

Maschinelles Lernen mit MATLAB

Model Selection. Introduction. Model Selection

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Knowledge Discovery and Data Mining

Environmental Remote Sensing GEOG 2021

Machine Learning using MapReduce

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Algorithms Part 1. Dejan Sarka

Towards better accuracy for Spam predictions

Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition Data

Azure Machine Learning, SQL Data Mining and R

Chapter 6. The stacking ensemble approach

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Data Mining Methods: Applications for Institutional Research

Map/Reduce Affinity Propagation Clustering Algorithm

Fast Analytics on Big Data with H20

Distances, Clustering, and Classification. Heatmaps

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Big Data & Scripting Part II Streaming Algorithms

1. Classification problems

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Introduction to Clustering

Evaluating Data Mining Models: A Pattern Language

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Final Project Report

CLUSTER ANALYSIS FOR SEGMENTATION

How To Cluster

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Journée Thématique Big Data 13/03/2015

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/ / 34

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

SUCCESSFUL PREDICTION OF HORSE RACING RESULTS USING A NEURAL NETWORK

More Data Mining with Weka

Machine Learning Big Data using Map Reduce

Chapter 7. Feature Selection. 7.1 Introduction

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Constrained Clustering of Territories in the Context of Car Insurance

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Feature Subset Selection in Spam Detection

Map-Reduce for Machine Learning on Multicore

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Hadoop Operations Management for Big Data Clusters in Telecommunication Industry

Making Sense of the Mayhem: Machine Learning and March Madness

Logistic Regression for Spam Filtering

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Active Learning SVM for Blogs recommendation

Tutorial Segmentation and Classification

Music Mood Classification

Analysis of MapReduce Algorithms

Neural Network Add-in

6.2.8 Neural networks for data mining

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Programming Exercise 3: Multi-class Classification and Neural Networks

Applying Data Analysis to Big Data Benchmarks. Jazmine Olinger

A CLASSIFIER FUSION-BASED APPROACH TO IMPROVE BIOLOGICAL THREAT DETECTION. Palaiseau cedex, France; 2 FFI, P.O. Box 25, N-2027 Kjeller, Norway.

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Machine Learning Capacity and Performance Analysis and R

Big Data and Scripting map/reduce in Hadoop

2 Decision tree + Cross-validation with R (package rpart)

Data Mining: Overview. What is Data Mining?

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

Mining the Software Change Repository of a Legacy Telephony System

The Data Mining Process

Sub-class Error-Correcting Output Codes

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Transcription:

http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer Science, Colorado State University http://wwwcscolostateedu/~cs535 W6B3 W6B4 oday s topics Running k-means algorithm using Canopy algorithm and MapReduce Evaluation methods for classification models Validation techniques for models General Canopy Clustering Algorithm Using two thresholds (the loose distance) and (the tight distance), where > Begin with the set of data points to be clustered 2 Remove a point from the set, beginning a new canopy 3 For each point left in the set, assign it to the new canopy if the distance is less than the loose distance 4 If the distance of the point is additionally less than the tight distance, remove it from the original set 5 Repeat steps 2-3-4, until there are no more data points in the set to cluster W6B5 W6B6 Canopy Clustering using MapReduce Generating Input data (/2) Each mapper performs canopy clustering on the points in its input set on-overlapping sampled points Reducer clusters the canopy centers to produce the final canopy centers Performs canopy clustering over the canopy centers

http://wwwcscolostateedu/~cs535 W6B7 W6B8 Generating Input data (2/2) Generate samples -- green and red Generating Canopy centers (Red) For the red data performed in a Mapper W6B9 W6B0 Generating Canopy centers (Green) Collecting Canopy Centers (Reducer) For the green data performed in a Mapper 2 W6B W6B2 Perform Canopy Clustering (Reducer) Final Canopy centers 2

http://wwwcscolostateedu/~cs535 W6B3 W6B4 Creating Canopies Running k-means over Canopies Is this good enough? Selecting k-means centroids Elbow method Performs k-means over each canopy Centroids outside canopy will not be considered Iterate until the centroids location converges What if the centroid includes multiple canopies? our computation should consider the merging and selection process W6B5 W6B6 Plain Accuracy Classifier accuracy General measure of classifier performance Evaluating Classifiers Accuracy = (umber of correct decisions made) / (otal number of decision made) error rate Pros Very easy to measure Cons Cannot consider realistic cases W6B7 W6B8 he Confusion Matrix Problems with Unbalanced Classes A type of contingency table classes n x n matrix he columns labeled with actual classes he rows with predicted classes Consider a classification problem where one class is rare Sifting through a large population of normal entities to find a relatively small number of unusual ones Looking for defrauded customers, or defective parts he class distribution is unbalanced or skewed Separates out the decisions made by the classifier How one class is being confused for another Different sorts of errors may be dealt with separately p (positive) n (negative) (predicted) rue positive False positive (predicted) False negative rue negative Confusion Matrix of A p n 500 200 0 300 Which model is better? Confusion Matrix of B p n 300 0 200 500 3

http://wwwcscolostateedu/~cs535 W6B9 W6B20 Why accuracy is misleading 50% 50% Which model is better? Balanced Population rue Population P A B Confusion Matrix of A p n 500 200 0 300 errors P Confusion Matrix of B p n 300 0 200 500 A B 0% 90% F-measure (F score) Summarizes confusion matrix rue positives (P), False Positives (FP), rue egatives (), and False egatives (F) rue positive rate = P/(P+F) False negative rate = F/(P+F) F-measure = 2(precision x recall)/(precision + recall) precision = P / (P+FP) recall = P / (P+F) Accuracy = (P + ) / (P + ) W6B2 W6B22 Why validation? (/2) Process for model selection and performance estimation Validation techniques Model selection (fitting the model) Most of the pattern recognition techniques have one or more free parameters he number of neighbors in a k classification rule he network size, learning parameters and weights in MLPs (multi-layer perceptrons) How do we select the optimal parameter(s) or model for a given classification problem? W6B23 W6B24 Why validation? (2/2) Performance estimation Once we have chosen a model, how do we estimate its performance? Performance is typically measured by the true error rate the classifier s error rate on the entire population Challenges (/2) If we had access to an unlimited number of examples these questions have a straightforward answer Choose the model that provides the lowest error rate on the entire population Of course, that error rate is the true error rate In real applications we only have access to a subset of examples, usually smaller than we wanted What if we use the entire training data to select our classifier and estimate the error rate? he final model will normally overfit the training data We already used the test dataset to train the data 4

http://wwwcscolostateedu/~cs535 W6B25 W6B26 Challenges (2/2) his problem is more pronounced with models that have a large number of parameters he error rate estimate will be overly optimistic (lower than the true error rate) In fact, it is not uncommon to have 00% correct classification on training data he Holdout Method Split dataset into two groups raining set Used to train the model est set Used to estimate the error rate of the trained model raining Set est set A much better approach is to split the training data into disjoint subsets: the holdout method otal umber of Examples A typical application the holdout method is determining a stopping point for the back propagation error W6B27 W6B28 Drawbacks of the holdout method Random Subsampling Drawbacks For a sparse dataset, we may not be able to set aside a portion of the dataset for testing Based on the where split happens, the estimate of error can be misleading Sample might not be representative he limitations of the holdout can be overcome with a family of resampling methods More computational expense Stratified sampling Cross Validation Random subsampling K-Fold cross validation Leave-one-out Cross-Validation K data splits of the dataset Each split randomly selects a (fixed) number of examples without replacement For each data split, retain the classifier from scratch with the training examples and estimate E i with the test examples est example Experiment Experiment 2 Experiment 3 otal number of examples W6B29 W6B30 rue Error Estimate k-fold Cross-validataion he true error estimate is obtained as the average of the separate estimates E i his estimate is significantly better than the holdout estimate E = K K E i i= Create a k-fold partition of the dataset For each of the k experiments use K- folds for training he remaining one for testing est example Experiment Experiment 2 Experiment 3 Experiment 4 otal number of examples 5

http://wwwcscolostateedu/~cs535 W6B3 W6B32 rue error estimate k-fold cross validation is similar to random subsampling he advantage of k-fold Cross validation All the examples in the dataset are eventually used for both training and testing he true error is estimated as the average error rate E = K K E i i= Leave-one-out Cross Validation Leave-one-out is the degenerate case of k-fold Cross validation k is chosen as the total number of examples For a dataset with examples, perform experiments Use - examples for training, the remaining example for testing Single est example Experiment Experiment 2 Experiment otal number of examples W6B33 W6B34 rue error estimate he average error rate on test examples E = E i i= How many folds are needed? (/2) With a large number of folds he bias of the true error rate estimator will be small he estimate will be very accurate he variance of the true error rate estimator will be large he computational time will be very large Many experiments With a small number of folds he number of experiments are low Computation time is reduced he variance of the estimator will be small he bias of the estimator will be large W6B35 W6B36 How many folds are needed? (2/2) he choice of the number of folds depends on the size of the dataset For large datasets, even 3-Fold Cross Validation will be quite accurate For very sparse datasets, you may have to consider leave-one-out o get maximum number of experiments A common choice for k-fold Cross Validation is k=0 hree-way data splits If model selection and true error estimates are computed simultaneously he data needs to be divided into three disjoint sets raining set Eg to find the optimal weights Validation set A set of examples used to tune the parameters of a model o find the optimal number of hidden units or determine a stopping point for the back propagation algorithm est set Used only to assess the performance of a fully-trained model After assessing the final model with the test set, you must not further tune the model 6

http://wwwcscolostateedu/~cs535 W6B37 W6B38 Why separate test and validation sets? he error rate estimate of the final model on validation data will be biased Smaller than the true error rate he validation set is used to select the final model Procedure Divide the available data into training, validation and test data set 2 Select architecture and training parameters 3 rain the model using the training set 4 Evaluate the model using the validation set 5 Repeat steps 2 through 4 using different architectures and training parameters 6 Select the best model and train it using data from the training and validation set 7 Assess this final model using the test set erm Project Deliverable : Proposal W6B39 W6B40 erm Project Proposal Contents: itle of your project 2 Problem formulation 3 our strategy to solve the problem 4 Functions targeted by your software 5 Plan for testing 6 Evaluation method 7 Project timeline (weekly plan) 8 Bibliography itle itle should be concise and self-descriptive W6B4 W6B42 2 Problem formulation he proposal should clearly identify the problem It should include at least one or two carefully crafted paragraphs that states and highlights the problem he problem formulation should be able to answer following questions: What is the problem you are solving? his should also include the background for the problem Why is it interesting as a Big Data problem and who would use it if it were solved? 3 our strategy to solve the problem Describe your proposed approach to solve the problem he description of the strategy should include: he algorithms/techniques/models you plan to use in this project he framework you plan to use in this project he dataset you plan to use in this project Please note that you are also required to produce software as the final output of this project ou are O ALLOWED to reuse or extend your projects from any other courses (even the 400-level BigData course) 7

http://wwwcscolostateedu/~cs535 W6B43 W6B44 4 Functions targeted by your software our proposal should include a software design to provide a more specific view of your project A simple description of major functions should be enough for this section As your development proceeds, this may be updated to reflect updates to these functions including additions, modifications, and removals What functions does your software provide to your users? What will be the input and output of each function? 5 Plan for testing our software should be tested before you provide the final results and presentation What is your plan for testing your software? What will be your test data? What will be your testing scenario? How will you collect your test data? How will you deploy your software? his is different from the evaluation of your project Functional testing Eg Creating testing file (xyz%) of your dataset using random sampling and test the software W6B45 W6B46 6 Evaluation method he proposal should include an evaluation plan including metrics that you will use to identify if you have succeeded or not 7 Project timeline (weekly plan) ou should provide a table with a weekly plan to complete the term project If you come up with a metric, also provide an intuitive feel for what this metric captures and why you think this is appropriate If you have teammate, the plan should also include information about the respective roles For example, if your project involves classification, you can list accuracy measures that will be used and provide justification Also, you should provide what your target accuracy with your project W6B47 W6B48 8 Bibliography All references must be cited in the report Submission Please submit only one copy per team he authors' names he titles of the works he names of publisher he date (or year) the copies were published he page numbers of your sources (if available) his document should be,200 ~,800 words Do not exceed the limit 8