Data Mining III: Numeric Estimation



Similar documents
Data Mining Part 5. Prediction

Dealing with Data in Excel 2010

Spreadsheets and Laboratory Data Analysis: Excel 2003 Version (Excel 2007 is only slightly different)

Data Mining: STATISTICA

Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence

Chapter 6. The stacking ensemble approach

Slope-Intercept Equation. Example

Using Microsoft Excel to Plot and Analyze Kinetic Data

Data Mining: Introduction

Tutorial for the TI-89 Titanium Calculator

Using Excel for Statistical Analysis

Scatter Plots with Error Bars

Regression Clustering

SPSS Guide: Regression Analysis

College Tuition: Data mining and analysis

Hands-On Microsoft Windows Server 2008

Simple Linear Regression, Scatterplots, and Bivariate Correlation

Data exploration with Microsoft Excel: analysing more than one variable

Writing the Equation of a Line in Slope-Intercept Form

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Chapter 7: Simple linear regression Learning Objectives

MCTS Guide to Microsoft Windows 7. Chapter 10 Performance Tuning

Binary Logistic Regression

Directions for using SPSS

Part 1: Background - Graphing

How To Run Statistical Tests in Excel

Classification of Learners Using Linear Regression

January 26, 2009 The Faculty Center for Teaching and Learning

FIGURE Selecting properties for the event log.

DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7

Linear functions Increasing Linear Functions. Decreasing Linear Functions

13 Managing Devices. Your computer is an assembly of many components from different manufacturers. LESSON OBJECTIVES

Assignment objectives:

The North Carolina Health Data Explorer

Microsoft Excel Tutorial

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Lin s Concordance Correlation Coefficient

Data Analysis. Using Excel. Jeffrey L. Rummel. BBA Seminar. Data in Excel. Excel Calculations of Descriptive Statistics. Single Variable Graphs

Calibration and Linear Regression Analysis: A Self-Guided Tutorial

Florida Math for College Readiness

(Least Squares Investigation)

Back Propagation Neural Networks User Manual

Bill Burton Albert Einstein College of Medicine April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

WEKA. Machine Learning Algorithms in Java

Data Mining and Visualization

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

Figure 1. An embedded chart on a worksheet.

Microsoft Excel Tutorial

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Example: Boats and Manatees

Data Mining - Evaluation of Classifiers

Univariate Regression

MATH 60 NOTEBOOK CERTIFICATIONS

COC131 Data Mining - Clustering

A Logistic Regression Approach to Ad Click Prediction

PERFORMANCE TOOLS DEVELOPMENTS

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Experiment 6 ~ Joule Heating of a Resistor

Curve Fitting in Microsoft Excel By William Lee

Instructions for SPSS 21

Copyright 2013 by Laura Schultz. All rights reserved. Page 1 of 7

Module 3: Correlation and Covariance

A Demonstration of Hierarchical Clustering

A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries

Brunswick High School has reinstated a summer math curriculum for students Algebra 1, Geometry, and Algebra 2 for the school year.

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Chapter 12 Discovering New Knowledge Data Mining

WEKA Explorer User Guide for Version 3-4-3

Lecture 11: Chapter 5, Section 3 Relationships between Two Quantitative Variables; Correlation

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Factoring Trinomials: The ac Method

Module 5: Statistical Analysis

Figure 1: Saving Project

C:\Users\<your_user_name>\AppData\Roaming\IEA\IDBAnalyzerV3

Data Mining Lab 5: Introduction to Neural Networks

Biomarker Discovery and Data Visualization Tool for Ovarian Cancer Screening

Predictive Dynamix Inc

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

Audience Response System (Turning Point) A Quick Start Guide

Credit Risk Analysis Using Logistic Regression Modeling

Cross Validation. Dr. Thomas Jensen Expedia.com

UCINET Visualization and Quantitative Analysis Tutorial

Data Mining for Business Analytics

Multiple Regression. Page 24

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

What are the place values to the left of the decimal point and their associated powers of ten?

In this tutorial, we try to build a roc curve from a logistic regression.

Texas Instruments BAII PLUS Tutorial

WEB APPENDIX. Calculating Beta Coefficients. b Beta Rise Run Y X

Experiment: Static and Kinetic Friction

MATH 110 College Algebra Online Families of Functions Transformations

Data Mining Analysis (breast-cancer data)

Scatter Plot, Correlation, and Regression on the TI-83/84

Worksheet A5: Slope Intercept Form

x x y y Then, my slope is =. Notice, if we use the slope formula, we ll get the same thing: m =

Big-data Analytics: Challenges and Opportunities

Below is a very brief tutorial on the basic capabilities of Excel. Refer to the Excel help files for more information.

Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:

Lecture 6: Logistic Regression

Transcription:

Data Mining III: Numeric Estimation Computer Science 105 Boston University David G. Sullivan, Ph.D. Review: Numeric Estimation Numeric estimation is like classification learning. it involves learning a model that works like this: input attributes model output attribute/class the model is learned from a set of training examples that include the output attribute In numeric estimation, the output attribute is numeric. we want to be able to estimate its value

Example Problem: CPU Performance We want to predict how well a CPU will perform on some task, given the following info. about the CPU and the task: CTIME: the processor's cycle time (in nanosec) MMIN: minimum amount of main memory used (in KB) MMAX: maximum amount of main memory used (in KB) CACHE: cache size (in KB) CHMIN: minimum number of CPU channels used CHMAX: maximum number of CPU channels used We need a model that will estimate a performance score for a given combination of values for these attributes. CTIME MMIN, MMAX CACHE CHMIN, CHMAX model performance (PERF) Example Problem: CPU Performance (cont.) The data was originally published in a 1987 article in the Communications of the ACM by Phillip Ein-Dor and Jacob Feldmesser of Tel-Aviv University. There are 209 training examples. Here are five of them: input attributes: class/ output attribute CTIME MMIN MMAX CACHE CHMIN CHMAX PERF 125 256 6000 256 16 128 198 29 8000 32000 32 8 32 269 29 8000 32000 32 8 32 172 125 2000 8000 0 2 14 52 480 512 8000 32 0 0 67

Linear Regression The classic approach to numeric estimation is linear regression. It produces a model that is a linear function (i.e., a weighted sum) of the input attributes. example for the CPU data: PERF = 0.066CTIME + 0.0143MMIN + 0.0066MMAX + 0.4945CACHE 0.1723CHMIN + 1.2012CHMAX 66.48 this type of model is known as a regression equation The general format of a linear regression equation is: y = w 1 x 1 + w 2 x 2 + + w n x n + c where y is the output attribute x 1,, x n are the input attributes w 1,, w n are numeric weights c is an additional numeric constant linear regression learns these values Linear Regression (cont.) Once the regression equation is learned, it can estimate the output attribute for previously unseen instances. example: to estimate CPU performance for the instance CTIME MMIN MMAX CACHE CHMIN CHMAX PERF 480 1000 4000 0 0 0? we plug the attribute values into the regression equation: PERF = 0.066CTIME + 0.0143MMIN + 0.0066MMAX + 0.4945CACHE 0.1723CHMIN + 1.2012CHMAX 66.48 = 0.066 * 480 + 0.0143 * 1000 + 0.0066 * 4000 + 0.4945 * 0 0.1723 * 0 + 1.2012 * 0 66.48 = 5.9

Linear Regression with One Input Attribute Linear regression is easier to understand when there's only one input attribute, x 1. y In that case: the training examples are ordered pairs of the form (x 1, y) shown as points in the graph above the regression equation has the form y = w 1 x 1 + c shown as the line in the graph above w 1 is the slope of the line; c is the y-intercept Linear regression finds the line that "best fits" the training examples. c x 1 Linear Regression with One Input Attribute (cont.) y y = w 1 x 1 + c c x 1 The dotted vertical bars show the differences between: the actual y values (the ones from the training examples) the estimated y values (the ones given by the equation) Why do these differences exist? Linear regression finds the parameter values (w 1 and c) that minimize the sum of the squares of these differences.

Linear Regression with Multiple Input Attributes When there are k input attributes, linear regression finds the equation of a line in (k+1) dimensions. here again, it is the line that "best fits" the training examples The equation has the form we mentioned earlier: y = w 1 x 1 + w 2 x 2 + + w n x n + c Here again, linear regression finds the parameter values (for the weights w 1,, w n and constant c) that minimize the sum of the squares of the differences between the actual and predicted y values. Linear Regression in Weka Use the Classify tab in the Weka Explorer. Click the Choose button to change the algorithm. linear regression is in the folder labelled functions By default, Weka employs attribute selection, which means it may not include all of the attributes in the regression equation.

Linear Regression in Weka (cont.) On the CPU dataset with M5 attribute selection, Weka learns the following equation: PERF = 0.0661CTIME + 0.0142MMIN + 0.0066MMAX + 0.4871CACHE + 1.1868CHMAX 66.60 it does not include the CHMIN attribute To eliminate attribute selection, you can click on the name of the algorithm and change the attributeselectionmethod parameter to "No attribute selection". doing so produces our earlier equation: PERF = 0.066CTIME + 0.0143MMIN + 0.0066MMAX + 0.4945CACHE 0.1723CHMIN + 1.2012CHMAX 66.48 Notes about the coefficients: what do the signs of the coefficients mean? what about their magnitudes? Evaluating a Regression Equation To evaluate the goodness of a regression equation, we again set aside some of the examples for testing. do not use these examples when learning the equation use the equation on the test examples and see how well it does Weka provides a variety of error measures, which are based on the differences between the actual and estimated y values. we want to minimize them The correlation coefficient measures the degree of correlation between the input attributes and the output attribute. its absolute value is between 0.0 and 1.0 we want to maximize its absolute value

Simple Linear Regression This algorithm in Weka creates a regression equation that uses only one of the input attributes. even when there are multiple inputs Like 1R, simple linear regression can serve as a baseline. compare the models from more complex algorithms to the model it produces It also gives insight into which of the inputs has the largest impact on the output. Handling Non-Numeric Input Attributes We employ numeric estimation when the output attribute is numeric. Some algorithms for numeric estimation also require that the input attributes be numeric. If we have a non-numeric input attribute, it may be possible to convert it to a numeric one. example: if we have a binary attribute (yes/no or true/false), we can convert the two values to 0 and 1 In Weka, many algorithms including linear regression will automatically adapt to non-numeric inputs.

Handling Non-Numeric Input Attributes (cont.) There are algorithms for numeric estimation that are specifically designed to handle both numeric and nominal attributes. One option: build a decision tree have each classification be a numeric value that is the average of the values for the training examples in that subgroup the result is called a regression tree Another option is to have a separate regression equation for each classification in the tree based on the training examples in that subgroup. this is called a model tree Regression Tree for the CPU Dataset

Regression and Model Trees in Weka To build a tree for estimation in Weka, select the M5P algorithm in the trees folder. by default, it builds model trees you can click on the name of the algorithm and tell it to build regression trees