http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer Science, Colorado State University http://wwwcscolostateedu/~cs535 W6B3 W6B4 oday s topics Running k-means algorithm using Canopy algorithm and MapReduce Evaluation methods for classification models Validation techniques for models General Canopy Clustering Algorithm Using two thresholds (the loose distance) and (the tight distance), where > Begin with the set of data points to be clustered 2 Remove a point from the set, beginning a new canopy 3 For each point left in the set, assign it to the new canopy if the distance is less than the loose distance 4 If the distance of the point is additionally less than the tight distance, remove it from the original set 5 Repeat steps 2-3-4, until there are no more data points in the set to cluster W6B5 W6B6 Canopy Clustering using MapReduce Generating Input data (/2) Each mapper performs canopy clustering on the points in its input set on-overlapping sampled points Reducer clusters the canopy centers to produce the final canopy centers Performs canopy clustering over the canopy centers
http://wwwcscolostateedu/~cs535 W6B7 W6B8 Generating Input data (2/2) Generate samples -- green and red Generating Canopy centers (Red) For the red data performed in a Mapper W6B9 W6B0 Generating Canopy centers (Green) Collecting Canopy Centers (Reducer) For the green data performed in a Mapper 2 W6B W6B2 Perform Canopy Clustering (Reducer) Final Canopy centers 2
http://wwwcscolostateedu/~cs535 W6B3 W6B4 Creating Canopies Running k-means over Canopies Is this good enough? Selecting k-means centroids Elbow method Performs k-means over each canopy Centroids outside canopy will not be considered Iterate until the centroids location converges What if the centroid includes multiple canopies? our computation should consider the merging and selection process W6B5 W6B6 Plain Accuracy Classifier accuracy General measure of classifier performance Evaluating Classifiers Accuracy = (umber of correct decisions made) / (otal number of decision made) error rate Pros Very easy to measure Cons Cannot consider realistic cases W6B7 W6B8 he Confusion Matrix Problems with Unbalanced Classes A type of contingency table classes n x n matrix he columns labeled with actual classes he rows with predicted classes Consider a classification problem where one class is rare Sifting through a large population of normal entities to find a relatively small number of unusual ones Looking for defrauded customers, or defective parts he class distribution is unbalanced or skewed Separates out the decisions made by the classifier How one class is being confused for another Different sorts of errors may be dealt with separately p (positive) n (negative) (predicted) rue positive False positive (predicted) False negative rue negative Confusion Matrix of A p n 500 200 0 300 Which model is better? Confusion Matrix of B p n 300 0 200 500 3
http://wwwcscolostateedu/~cs535 W6B9 W6B20 Why accuracy is misleading 50% 50% Which model is better? Balanced Population rue Population P A B Confusion Matrix of A p n 500 200 0 300 errors P Confusion Matrix of B p n 300 0 200 500 A B 0% 90% F-measure (F score) Summarizes confusion matrix rue positives (P), False Positives (FP), rue egatives (), and False egatives (F) rue positive rate = P/(P+F) False negative rate = F/(P+F) F-measure = 2(precision x recall)/(precision + recall) precision = P / (P+FP) recall = P / (P+F) Accuracy = (P + ) / (P + ) W6B2 W6B22 Why validation? (/2) Process for model selection and performance estimation Validation techniques Model selection (fitting the model) Most of the pattern recognition techniques have one or more free parameters he number of neighbors in a k classification rule he network size, learning parameters and weights in MLPs (multi-layer perceptrons) How do we select the optimal parameter(s) or model for a given classification problem? W6B23 W6B24 Why validation? (2/2) Performance estimation Once we have chosen a model, how do we estimate its performance? Performance is typically measured by the true error rate the classifier s error rate on the entire population Challenges (/2) If we had access to an unlimited number of examples these questions have a straightforward answer Choose the model that provides the lowest error rate on the entire population Of course, that error rate is the true error rate In real applications we only have access to a subset of examples, usually smaller than we wanted What if we use the entire training data to select our classifier and estimate the error rate? he final model will normally overfit the training data We already used the test dataset to train the data 4
http://wwwcscolostateedu/~cs535 W6B25 W6B26 Challenges (2/2) his problem is more pronounced with models that have a large number of parameters he error rate estimate will be overly optimistic (lower than the true error rate) In fact, it is not uncommon to have 00% correct classification on training data he Holdout Method Split dataset into two groups raining set Used to train the model est set Used to estimate the error rate of the trained model raining Set est set A much better approach is to split the training data into disjoint subsets: the holdout method otal umber of Examples A typical application the holdout method is determining a stopping point for the back propagation error W6B27 W6B28 Drawbacks of the holdout method Random Subsampling Drawbacks For a sparse dataset, we may not be able to set aside a portion of the dataset for testing Based on the where split happens, the estimate of error can be misleading Sample might not be representative he limitations of the holdout can be overcome with a family of resampling methods More computational expense Stratified sampling Cross Validation Random subsampling K-Fold cross validation Leave-one-out Cross-Validation K data splits of the dataset Each split randomly selects a (fixed) number of examples without replacement For each data split, retain the classifier from scratch with the training examples and estimate E i with the test examples est example Experiment Experiment 2 Experiment 3 otal number of examples W6B29 W6B30 rue Error Estimate k-fold Cross-validataion he true error estimate is obtained as the average of the separate estimates E i his estimate is significantly better than the holdout estimate E = K K E i i= Create a k-fold partition of the dataset For each of the k experiments use K- folds for training he remaining one for testing est example Experiment Experiment 2 Experiment 3 Experiment 4 otal number of examples 5
http://wwwcscolostateedu/~cs535 W6B3 W6B32 rue error estimate k-fold cross validation is similar to random subsampling he advantage of k-fold Cross validation All the examples in the dataset are eventually used for both training and testing he true error is estimated as the average error rate E = K K E i i= Leave-one-out Cross Validation Leave-one-out is the degenerate case of k-fold Cross validation k is chosen as the total number of examples For a dataset with examples, perform experiments Use - examples for training, the remaining example for testing Single est example Experiment Experiment 2 Experiment otal number of examples W6B33 W6B34 rue error estimate he average error rate on test examples E = E i i= How many folds are needed? (/2) With a large number of folds he bias of the true error rate estimator will be small he estimate will be very accurate he variance of the true error rate estimator will be large he computational time will be very large Many experiments With a small number of folds he number of experiments are low Computation time is reduced he variance of the estimator will be small he bias of the estimator will be large W6B35 W6B36 How many folds are needed? (2/2) he choice of the number of folds depends on the size of the dataset For large datasets, even 3-Fold Cross Validation will be quite accurate For very sparse datasets, you may have to consider leave-one-out o get maximum number of experiments A common choice for k-fold Cross Validation is k=0 hree-way data splits If model selection and true error estimates are computed simultaneously he data needs to be divided into three disjoint sets raining set Eg to find the optimal weights Validation set A set of examples used to tune the parameters of a model o find the optimal number of hidden units or determine a stopping point for the back propagation algorithm est set Used only to assess the performance of a fully-trained model After assessing the final model with the test set, you must not further tune the model 6
http://wwwcscolostateedu/~cs535 W6B37 W6B38 Why separate test and validation sets? he error rate estimate of the final model on validation data will be biased Smaller than the true error rate he validation set is used to select the final model Procedure Divide the available data into training, validation and test data set 2 Select architecture and training parameters 3 rain the model using the training set 4 Evaluate the model using the validation set 5 Repeat steps 2 through 4 using different architectures and training parameters 6 Select the best model and train it using data from the training and validation set 7 Assess this final model using the test set erm Project Deliverable : Proposal W6B39 W6B40 erm Project Proposal Contents: itle of your project 2 Problem formulation 3 our strategy to solve the problem 4 Functions targeted by your software 5 Plan for testing 6 Evaluation method 7 Project timeline (weekly plan) 8 Bibliography itle itle should be concise and self-descriptive W6B4 W6B42 2 Problem formulation he proposal should clearly identify the problem It should include at least one or two carefully crafted paragraphs that states and highlights the problem he problem formulation should be able to answer following questions: What is the problem you are solving? his should also include the background for the problem Why is it interesting as a Big Data problem and who would use it if it were solved? 3 our strategy to solve the problem Describe your proposed approach to solve the problem he description of the strategy should include: he algorithms/techniques/models you plan to use in this project he framework you plan to use in this project he dataset you plan to use in this project Please note that you are also required to produce software as the final output of this project ou are O ALLOWED to reuse or extend your projects from any other courses (even the 400-level BigData course) 7
http://wwwcscolostateedu/~cs535 W6B43 W6B44 4 Functions targeted by your software our proposal should include a software design to provide a more specific view of your project A simple description of major functions should be enough for this section As your development proceeds, this may be updated to reflect updates to these functions including additions, modifications, and removals What functions does your software provide to your users? What will be the input and output of each function? 5 Plan for testing our software should be tested before you provide the final results and presentation What is your plan for testing your software? What will be your test data? What will be your testing scenario? How will you collect your test data? How will you deploy your software? his is different from the evaluation of your project Functional testing Eg Creating testing file (xyz%) of your dataset using random sampling and test the software W6B45 W6B46 6 Evaluation method he proposal should include an evaluation plan including metrics that you will use to identify if you have succeeded or not 7 Project timeline (weekly plan) ou should provide a table with a weekly plan to complete the term project If you come up with a metric, also provide an intuitive feel for what this metric captures and why you think this is appropriate If you have teammate, the plan should also include information about the respective roles For example, if your project involves classification, you can list accuracy measures that will be used and provide justification Also, you should provide what your target accuracy with your project W6B47 W6B48 8 Bibliography All references must be cited in the report Submission Please submit only one copy per team he authors' names he titles of the works he names of publisher he date (or year) the copies were published he page numbers of your sources (if available) his document should be,200 ~,800 words Do not exceed the limit 8