An Overview and Evaluation of Decision Tree Methodology

Save this PDF as:

Size: px
Start display at page:

Transcription

1 An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX Carole Jesse Cargill, Inc. Wayzata, MN Richard Kittler Yield Dynamics, Inc. Santa Clara, CA

2 An Overview and Evaluation of Decision Tree Methodology Outline Background What are decision trees? How are decision trees constructed? What are their goals and how are they assessed? Pros and Cons over Traditional methods Software Section Data Set Specifics used for the Evaluation SAS Enterprise Miner v 4.0 (NT) [sas.com] S-PLUS 2000 Professional Release 2 (NT) [insightful.com] Yield Dynamics Yield Mine v 2.0 (NT) [yielddynamics.com] A comparison of results and features 2

3 What are Decision Trees? A flow chart or diagram representing a classification system or a predictive model. The tree is structured as a sequence of simple questions. The answers to these questions trace a path down the tree. The end product is a collection of hierarchical rules that segment the data into groups, where a decision (classification or prediction) is made for each group. The hierarchy is called a tree, and each segment is called a node. The original segment contains the entire data set, referred to as the root node of the tree. A node with all of its successors forms a branch of the node that created it. The final nodes (terminal nodes) are called leaves. For each leaf, a decision is made and applied to all observations in the leaf. Classification Tree: Categorical Response / Target Regression Tree: Continuous Response / Target Predictor variables => Inputs 3

4 Example: Classification Tree Root Node: All Data Tree Assessment: (Proportion Correctly Classified): Yes X 1 < 2.8? X 2 < -10.0? X 3 < 0.003? Yes No Yes No No Leaf 1: 100% Leaf 2: 74% Leaf 3: 69% Leaf 4: 100% Overall: 82% Terminal Nodes with classifications Leaf 1: MAX Leaf 2: MED Leaf 3: MED Leaf 4: MIN Rules: if X1 < 2.8 and X2 < -10 then class = MAX if (X1 < 2.8 and X2 > -10) or (X1 > 2.8 and X3 < 0.003) then class = MED if X1 > 2.8 and X3 > then class = MIN 4

5 Basic Elements in Tree Construction: 1. Splitting each node in a tree How are Decision Trees Constructed? How to select each split so that descendent subsets are purer than the data in the parent subset? 2. Deciding when a tree is complete (two approaches) How large to grow the tree? What nodes to prune off the tree? 3. Assigning Classes (or predicted values) to terminal nodes Which model to use in the leaves? 5

6 Basic Elements in Tree Construction 1. Splitting each node in a tree (two steps): Generate a set of candidate splitting rules - Look at all possible splits for all variables included in the analysis - Example: a data set with 215 observations and 19 possible input variables would consider up to (215)(90) = 4085 possible splits Choose the best split from among the candidate set - Rank order each splitting rule on the basis of some quality-of-split criterion ( purity function). The most frequently used ones are: - Entropy reduction (nominal / binary targets) - Gini-index (nominal / binary targets) - Chi-square tests (nominal / binary targets) - F-test (interval targets) - Variance reduction (interval targets) 6

7 Basic Elements in Tree Construction 2. Deciding when a tree is complete (two approaches): Continue splitting nodes until some goodness-of-split criterion fails to be met. - when the quality of a particular split falls below the threshold, the tree is not grown further along that branch. - when all branches from the root reach terminal nodes then tree is complete. Grow the tree too large and then prune the nodes off. - after tree construction stops create a sequence of subtrees from the original tree - choose one subtree for each possible number of leaves (the subtree chosen with p leaves has the best assessment value of all candidate subtrees with p leaves) - once the sequence of subtrees is established select which subtree to use according to some criterion. -best assessment value 7

8 Basic Elements in Tree Construction 3. Assigning classes (or predictions) to terminal nodes: The plurality rule is one criterion: - the group with the greatest representation determines the class assignment. - the conditional mean within the node determines the predicted value. Other criterion modify simple plurality to account for - costs of making a mistake in classification. - to adjust for over- or under-sampling from certain classes. 8

9 What are their goals and how are they assessed? Goal of Decision Trees: To obtain the most accurate prediction possible. Since these models are not used in isolation but in conjunction with subject matter experts, the most desirable trees are: 1. Accurate (low generalization error rates) 2. Parsimonious (representing and generalizing the relationships succinctly) 3. Non-trivial (producing interesting results) 4. Feasible (time and resources) 5. Transparent and interpretable (providing high level representations of and insights into the data relationships, regularities, or trends) - Tree Assessment: Estimation of the error or missclassification rate is derived from both a Training data set (used to train the model) and an independent Test / Validation data set (used for testing the model). - Resubstitution Estimate (internal estimate, biased) - Test Sample Estimation (independent estimate) - V-fold and N-fold Cross-Validation (resampling techniques) 9

10 Pros and Cons Over Traditional Methods Pros: Ease of interpretation and graphical in nature Nonparametric based (does not require specification of functional form) Allow more general interactions and non-additive behavior among inputs Ability to learn step functions (discontinuities) Easier to interpret when predictors are both continuous and categorical Robust to the effects of outliers (for both input and target variables) Invariant to monotone transformations of input variables Can process inputs w/missing values (surrogate splits / alternative splits) Cons: Can become overly complex Based on univariate splits Require many branches to approximate linear functions accurately (unlike regression or neural nets) 10

11 Software: Data Set Description LOW MED HI Target Response Variable (YCLASS): - Lot level yield classification lots - Categorical response: HI, MED, LOW Questions: Input (predictor) variables: - 70 Electrical test parameters (e.g. P_221342) - Lot level median values - All continuous Which combination of parameters are best at predicting lot level yield? What is the process space of this best combination? How good is this prediction? 11

12 SAS Enterprise Miner: GUI Interface Project Diagram Node Types Window EM Workspace Window Available Projects Window - Process Flow Diagrams (PFD) are built in the workspace. - These are specific to a particular data mining project. - This is the flow built for this decision tree analysis 12

13 Enterprise Miner: Components of the Flow Diagram Input Data Source: Data Set Attributes: Data partition: Tree: Assessment: Reporter: Define data set to be analyzed - Define target and input variables. - Define target profile (priors and/or cost/profit functions) - Define and create training, validation, and test data sets - Define sampling method used for partitioning - Specify splitting criteria - Specify tree characteristics (depth, leaf size, branches,etc.) - Specify model assessment criteria - Run decision tree analysis, view results - Additional model assessment measures - Used to compare models from different analyses types - Journaling function - Results and documentation from each node stored in html format with links to data sets, results, graphics, etc. 13

14 Enterprise Miner: Summary of Decision Tree Analysis Results Summary Table Tree Ring Assessment Table Assessment Plot Default Model Fitting Criteria: Note: YCLASS YCLASS2 HI = 2 MED = 1 LOW = 0 This criteria set to 5 14

15 Enterprise Miner: Summary of Decision Tree Analysis Results For categorical targets: The selected subtree has the best classification rate for the validation data and fewest number of leaves Color denotes node purity: Mix (Yellow) Pure (Red) Proportion correctly classified Table Leaf 3 Training Data Statistics Leaf 4 Leaf 5 Validation Data Statistics Leaf statistics summary table 15

16 Enterprise Miner: Summary of Decision Tree Analysis Results Performance Matrix for Training and Validation: Actual vs. Predicted class values Diagnostic Charts for Validation Data Predictor rankings with surrogate rules 16

17 Enterprise Miner: Analysis Summary from Reporter Node 17

18 Enterprise Miner: Summary of Features Pros: No scripting required Easy definition of input and target variables Easy definition of target profiles (use of priors and cost matrices) Option for automatic training and validation data set creation Supports both automatic and interactive training of the model Has various types of tree and model assessment graphics Automatic scripting option available Automatic updating of results for different trees Generation of surrogate rules Calculates overall model assessment values as well as leaf specific assessment values Journaling option available in web / html format (Reporter Node) Cons: Available only as part of SAS v.8 or higher Requires some knowledge of data acquisition and manipulation in SAS Classification Chart was incomplete for the predicted levels Does not label final classification in the tree Could not figure out how to get level of importance of predictors in the model 18

19 S-PLUS Tree Model - Model and Results Tabs Note: Omit Rows with Missing Values does not appear to be optional. 358 complete records present in 372 Unable to find TrainMod. 19

20 S-PLUS Tree Model - Plot and Prune/Shrink Tabs Note: can specify Validation data Unable to find ValidateRes 20

21 S-PLUS Tree Model - Predict Tab Note: Saved predicted values not easily merged with actual values 21

22 S-PLUS: Report Window 22

23 S-PLUS: Graphical Tree Window Uniformly Sized Proportional to Node Deviance 23

24 S-PLUS: Relationship of Report to Tree 1) root MED ( ) 2) P < HI ( ) 4) P < HI ( ) 8) P < LOW ( ) * 9) P > HI ( ) 18) P < HI ( ) * 19) P > MED ( ) * 5) P > MED ( ) * 3) P > MED ( ) 6) P < MED ( ) * 7) P > LOW ( ) * Rule for branch Nobs Branch Class Indention reflects branch levels Deviance (homogeneity) HI s LOW s MED s In branch * = terminal node 2) 3) 7) 24

25 S-PLUS: Model Assessment Graphics Pruning Method = misclass Pruning Method = deviance Note: This graphic based on no restriction on final tree size, (number of terminal nodes). We believe the assessment is Training (or Validate?). 25

26 S-PLUS: Summary of Features Pros: No scripting required All relevant summary information succinctly provided in Report window Basic tree graphic with option for branch lengths proportional to split significance Model Assessment Graphics based on misclass or deviance Option for specifying validation data simultaneously to running training data Seems to have ability to adjust for cost of misclassification (but not intuitive) Calculates overall model assessment value File Import Data feature is extensive, most file types recognized Cons: No dynamic linking between Report and Graphics (user intensive interpretation) Journaling option for Report Window only Automatic scripting option not obvious Generation of surrogate rules not obvious Unable to find output/results for Validation data Unable to find saved Model Object Only uses subset of the data with complete records for the independent variables Predicted values not easily merged with the data (Training or Validation) 26

27 Genesis Yield Mine Histogram of YIELD_1 partitioned by YCLASS YIELD_1 YCLASS=HI YCLASS=LOW YCLASS=MED Normal Density Mean: Std Dev: Count: 372 Norm: No at signif. 5.00E-02 (Chi-Sq stat. = on 6 df; p-value = E-5) Setting up Yield Mine for YCLASS 27

28 Yield Mine Results for YCLASS Genesis Yield Mine Interactive options available at each node (Datatype Dependent) 28

29 Graphical representation of topmost rule: Distribution of variable P_ partitioned by YCLASS [Histogram Chart(Partition)] P_ Genesis Yield Mine YCLASS=HI YCLASS=LOW YCLASS=MED Mean: Std Dev: Count: 366 Norm: Yes at signif. 5.00E-02 (Chi-Sq stat. = on 7 df; p-value = ) Box-Whisker chart showing that the P_ variable selected as most significant. [Box-Whisker Chart] P_ [YCLASS] HI Mean= Std. Dev.= Count=124 Range= LOW Mean= Std. Dev.= Count=82 Range= YCLASS MED Mean= Std. Dev.= Count=160 Range=

30 Genesis Yield Mine Option to override variable selection by double-clicking on a node 30

31 Yield Mine: Summary of Features Pros: Easily setup and run by novice user Integrated within a full interactive data analysis environment No scripting required Can be run interactively or in batch mode Tree display is an interactive graphic Supports linear models at nodes Ability to work with missing data Options to select a surrogate variable by double-clicking on a node Options to view explanatory graphics or run linear models by double-clicking on node Options to create a subset of all observations used at a node Ability to save a model and apply to classify future datasets Ability to control depth of tree from setup panel Some ability to look ahead to the next layer and overcome greediness Tuned for use in semiconductor manufacturing Cons: Few adjustable parameters available for access by a statistician No option for automatic training set creation and use No option to specify the minimum number of observations in a terminal node Data repair and outlier screening routines must be run manually beforehand Available only as part of the Genesis Yield Management software system 31

32 A Comparison of Models Enterprise Yield Mine S-PLUS Miner Final # of predictors Final # of leaves Overall Assessment 0.56 / /? 0.61 /? Combination of parameters for predicting lot level yield class: P P P P P P P P P P Note: P and P highly correlated P and P highly correlated 32

33 A Comparison of Process Spaces Enterprise Miner Process Space Yield Mine Process Space S-PLUS Process Space LOW HI HI HI MED MED LOW MED LOW MED LOW MED LOW 33

34 Summary of Features Enterprise Miner S-PLUS Yield Mine General Tree Graphics excellent excellent excellent Model Assessment graphics excellent good good Journaling options excellent fair excellent GUI Interface (user friendliness) excellent excellent excellent On Line Help good good good Good balance btw ease of use and knowledge of methodology excellent excellent excellent Scripting options yes yes yes 'on-the fly' graphics fair fair excellent Liscensing Cost \$\$\$\$\$\$ \$ \$\$\$\$\$\$ Specifics of Model Flexibility of model criteria excellent good fair Ranking of Model predictors fair good good Surrogate Rules yes no? yes Robustness to Missing Data good poor good 34

35 Summary Decision tree methods are now well proven as a form of data mining The techniques provide an intuitive and efficient method for discovering relationships across a broad range of data sets. The software implementations in SAS Enterprise Miner, S-PLUS, and Genesis Yield Mine were tested on a sample data set from IC manufacturing and produced comparable results The software evaluations were an initial look at the potential each product offers in decision tree methodology. The strengths and weaknesses of each package should be weighted according to the audience (engineer or statistician) and the problem domain (semiconductor or other). These methods can be a valuable tool in one s arsenal of data analysis tools and we encourage you to consider their use in your practice. 35

36 References Classification and Regression Trees, 1984, L. Breiman, J.H. Friedman, R.A.Olsen, C.J. Stone, Pacific Grove, Wadsworth. R. Kittler, and W. Wang, 2000, Data Mining for Yield Improvements, Proceedings from MASM Salford Systems White Paper Series, 2001, An Overview of CART Methodology, SAS Institute Inc., 1999, SAS Institute White Paper: Finding the Solution to Data Mining, Cary, NC: SAS Institute Inc. SAS Institute Inc., 2000, Enterprise Miner Version 4.0 On-line Reference Help, Cary, NC: SAS Institute Inc. S-PLUS, 2000, Guide to Statistics, Vol. 1 Statsoft, The Statistics Homepage, 2000, 36

!"!!"#\$\$%&'()*+\$(,%!"#\$%\$&'()*""%(+,'-*&./#-\$&'(-&(0*".\$#-\$1"(2&."3\$'45"

!"!!"#\$\$%&'()*+\$(,%!"#\$%\$&'()*""%(+,'-*&./#-\$&'(-&(0*".\$#-\$1"(2&."3\$'45"!"#"\$%&#'()*+',\$\$-.&#',/"-0%.12'32./4'5,5'6/%&)\$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:

Data Mining Techniques Chapter 6: Decision Trees

Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation

Gerry Hobbs, Department of Statistics, West Virginia University

Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

A fast, powerful data mining workbench designed for small to midsize organizations

FACT SHEET SAS Desktop Data Mining for Midsize Business A fast, powerful data mining workbench designed for small to midsize organizations What does SAS Desktop Data Mining for Midsize Business do? Business

Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling

MS4424 Data Mining & Modelling MS4424 Data Mining & Modelling Lecturer : Dr Iris Yeung Room No : P7509 Tel No : 2788 8566 Email : msiris@cityu.edu.hk 1 Aims To introduce the basic concepts of data mining

6 Classification and Regression Trees, 7 Bagging, and Boosting

hs24 v.2004/01/03 Prn:23/02/2005; 14:41 F:hs24011.tex; VTEX/ES p. 1 1 Handbook of Statistics, Vol. 24 ISSN: 0169-7161 2005 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(04)24011-1 1 6 Classification

Classification/Decision Trees (II)

Classification/Decision Trees (II) Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Right Sized Trees Let the expected misclassification rate of a tree T be R (T ).

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

CART 6.0 Feature Matrix

CART 6.0 Feature Matri Enhanced Descriptive Statistics Full summary statistics Brief summary statistics Stratified summary statistics Charts and histograms Improved User Interface New setup activity window

Lecture 10: Regression Trees

Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA

Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA An Overview of SAS Enterprise Miner The following article is in regards to Enterprise Miner v.4.3 that is available in SAS v9.1.3.

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised

TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP

TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models

Modeling Lifetime Value in the Insurance Industry

Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND

Paper D02-2009 A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND ABSTRACT This paper applies a decision tree model and logistic regression

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

Enterprise Miner - Regression 1 ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node 1. Some background: Linear attempts to predict the value of a continuous

Leveraging Ensemble Models in SAS Enterprise Miner

ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

Data Mining. SPSS Clementine 12.0. 1. Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

Data Mining SPSS 12.0 1. Overview Spring 2010 Instructor: Dr. Masoud Yaghini Introduction Types of Models Interface Projects References Outline Introduction Introduction Three of the common data mining

M15_BERE8380_12_SE_C15.7.qxd 2/21/11 3:59 PM Page 1. 15.7 Analytics and Data Mining 1

M15_BERE8380_12_SE_C15.7.qxd 2/21/11 3:59 PM Page 1 15.7 Analytics and Data Mining 15.7 Analytics and Data Mining 1 Section 1.5 noted that advances in computing processing during the past 40 years have

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise

2015 Workshops for Professors

SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market

A Hybrid Modeling Platform to meet Basel II Requirements in Banking Jeffery Morrision, SunTrust Bank, Inc.

A Hybrid Modeling Platform to meet Basel II Requirements in Banking Jeffery Morrision, SunTrust Bank, Inc. Introduction: The Basel Capital Accord, ready for implementation in force around 2006, sets out

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data

Decision Tree Learning on Very Large Data Sets

Decision Tree Learning on Very Large Data Sets Lawrence O. Hall Nitesh Chawla and Kevin W. Bowyer Department of Computer Science and Engineering ENB 8 University of South Florida 4202 E. Fowler Ave. Tampa

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

Agenda Introduktion till Prediktiva modeller Beslutsträd Beslutsträd och andra prediktiva modeller Mathias Lanner Sas Institute Pruning Regressioner Neurala Nätverk Utvärdering av modeller 2 Predictive

Clustering Via Decision Tree Construction

Clustering Via Decision Tree Construction Bing Liu 1, Yiyuan Xia 2, and Philip S. Yu 3 1 Department of Computer Science, University of Illinois at Chicago, 851 S. Morgan Street, Chicago, IL 60607-7053.

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.

Data mining and statistical models in marketing campaigns of BT Retail

Data mining and statistical models in marketing campaigns of BT Retail Francesco Vivarelli and Martyn Johnson Database Exploitation, Segmentation and Targeting group BT Retail Pp501 Holborn centre 120

Better credit models benefit us all

Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis

Word Count: Body Text = 5,500 + 2,000 (4 Figures, 4 Tables) = 7,500 words

PRIORITIZING ACCESS MANAGEMENT IMPLEMENTATION By: Grant G. Schultz, Ph.D., P.E., PTOE Assistant Professor Department of Civil & Environmental Engineering Brigham Young University 368 Clyde Building Provo,

Data Mining for Knowledge Management. Classification

1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

Use Data Mining Techniques to Assist Institutions in Achieving Enrollment Goals: A Case Study

Use Data Mining Techniques to Assist Institutions in Achieving Enrollment Goals: A Case Study Tongshan Chang The University of California Office of the President CAIR Conference in Pasadena 11/13/2008

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

A Property & Casualty Insurance Predictive Modeling Process in SAS

Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing

Decision Trees What Are They?

Decision Trees What Are They? Introduction...1 Using Decision Trees with Other Modeling Approaches...5 Why Are Decision Trees So Useful?...8 Level of Measurement... 11 Introduction Decision trees are a

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge

Interactive Data Mining and Design of Experiments: the JMP Partition and Custom Design Platforms

: the JMP Partition and Custom Design Platforms Marie Gaudard, Ph. D., Philip Ramsey, Ph. D., Mia Stephens, MS North Haven Group March 2006 Table of Contents Abstract... 1 1. Data Mining... 1 1.1. What

Decision-Tree Learning

Decision-Tree Learning Introduction ID3 Attribute selection Entropy, Information, Information Gain Gain Ratio C4.5 Decision Trees TDIDT: Top-Down Induction of Decision Trees Numeric Values Missing Values

DATA ANALYTICS USING R

DATA ANALYTICS USING R Duration: 90 Hours Intended audience and scope: The course is targeted at fresh engineers, practicing engineers and scientists who are interested in learning and understanding data

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University AIRPO

APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING

Wrocław University of Technology Internet Engineering Henryk Maciejewski APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING PRACTICAL GUIDE Wrocław (2011) 1 Copyright by Wrocław University of Technology

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

Structural Health Monitoring Tools (SHMTools)

Structural Health Monitoring Tools (SHMTools) Getting Started LANL/UCSD Engineering Institute LA-CC-14-046 c Copyright 2014, Los Alamos National Security, LLC All rights reserved. May 30, 2014 Contents

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

Didacticiel Études de cas

1 Theme Data Mining with R The rattle package. R (http://www.r project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is completely justified (see

White Paper. Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics

White Paper Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics Contents Self-service data discovery and interactive predictive analytics... 1 What does

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

Unit 4 DECISION ANALYSIS. Lesson 37. Decision Theory and Decision Trees. Learning objectives:

Unit 4 DECISION ANALYSIS Lesson 37 Learning objectives: To learn how to use decision trees. To structure complex decision making problems. To analyze the above problems. To find out limitations & advantages

A Data Mining Tutorial

A Data Mining Tutorial Presented at the Second IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN 98) 14 December 1998 Graham Williams, Markus Hegland and Stephen

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

SAS Rule-Based Codebook Generation for Exploratory Data Analysis Ross Bettinger, Senior Analytical Consultant, Seattle, WA

SAS Rule-Based Codebook Generation for Exploratory Data Analysis Ross Bettinger, Senior Analytical Consultant, Seattle, WA ABSTRACT A codebook is a summary of a collection of data that reports significant

Data Mining Practical Machine Learning Tools and Techniques

Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

Paper AA-08-2015. Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM

Paper AA-08-2015 Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM Delali Agbenyegah, Alliance Data Systems, Columbus, Ohio 0.0 ABSTRACT Traditional

Exploratory data analysis (Chapter 2) Fall 2011

Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,

SPSS Explore procedure

SPSS Explore procedure One useful function in SPSS is the Explore procedure, which will produce histograms, boxplots, stem-and-leaf plots and extensive descriptive statistics. To run the Explore procedure,

Data Mining and Visualization

Data Mining and Visualization Jeremy Walton NAG Ltd, Oxford Overview Data mining components Functionality Example application Quality control Visualization Use of 3D Example application Market research

SELF-ORGANISING MAPPING NETWORKS (SOM) WITH SAS E-MINER

SELF-ORGANISING MAPPING NETWORKS (SOM) WITH SAS E-MINER C.Sarada, K.Alivelu and Lakshmi Prayaga Directorate of Oilseeds Research, Rajendranagar, Hyderabad saradac@yahoo.com Self Organising mapping networks

The Data Mining Process

Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether

Systat: Statistical Visualization Software

Systat: Statistical Visualization Software Hilary R. Hafner Jennifer L. DeWinter Steven G. Brown Theresa E. O Brien Sonoma Technology, Inc. Petaluma, CA Presented in Toledo, OH October 28, 2011 STI-910019-3946

Technology Step-by-Step Using StatCrunch

Technology Step-by-Step Using StatCrunch Section 1.3 Simple Random Sampling 1. Select Data, highlight Simulate Data, then highlight Discrete Uniform. 2. Fill in the following window with the appropriate

Benefits of Upgrading to Phoenix WinNonlin 6.2

Benefits of Upgrading to Phoenix WinNonlin 6.2 Pharsight, a Certara Company 5625 Dillard Drive; Suite 205 Cary, NC 27518; USA www.pharsight.com March, 2011 Benefits of Upgrading to Phoenix WinNonlin 6.2

Identifying and Overcoming Common Data Mining Mistakes Doug Wielenga, SAS Institute Inc., Cary, NC

Identifying and Overcoming Common Data Mining Mistakes Doug Wielenga, SAS Institute Inc., Cary, NC ABSTRACT Due to the large amount of data typically involved, data mining analyses can exacerbate some

A New Approach for Evaluation of Data Mining Techniques

181 A New Approach for Evaluation of Data Mining s Moawia Elfaki Yahia 1, Murtada El-mukashfi El-taher 2 1 College of Computer Science and IT King Faisal University Saudi Arabia, Alhasa 31982 2 Faculty

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Directions in Statistical Methodology for Multivariable Predictive Modeling Frank E Harrell Jr University of Virginia Seattle WA 19May98 Overview of Modeling Process Model selection Regression shape Diagnostics

Classification and Prediction

Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools Occam s razor.......................................................... 2 A look at data I.........................................................

Model Validation Techniques

Model Validation Techniques Kevin Mahoney, FCAS kmahoney@ travelers.com CAS RPM Seminar March 17, 2010 Uses of Statistical Models in P/C Insurance Examples of Applications Determine expected loss cost

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

Multivariate Analysis of Variance (MANOVA)

Chapter 415 Multivariate Analysis of Variance (MANOVA) Introduction Multivariate analysis of variance (MANOVA) is an extension of common analysis of variance (ANOVA). In ANOVA, differences among various

Smart Grid Data Analytics for Decision Support

1 Smart Grid Data Analytics for Decision Support Prakash Ranganathan, Department of Electrical Engineering, University of North Dakota, Grand Forks, ND, USA Prakash.Ranganathan@engr.und.edu, 701-777-4431

Data analysis process

Data analysis process Data collection and preparation Collect data Prepare codebook Set up structure of data Enter data Screen data for errors Exploration of data Descriptive Statistics Graphs Analysis

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups Achim Zeileis, Torsten Hothorn, Kurt Hornik http://eeecon.uibk.ac.at/~zeileis/ Overview Motivation: Trees, leaves, and

Getting Started with SAS Enterprise Miner 7.1

Getting Started with SAS Enterprise Miner 7.1 SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc 2011. Getting Started with SAS Enterprise Miner 7.1.

Classification and regression trees

Classification and regression trees December 9 Introduction We ve seen that local methods and splines both operate by partitioning the sample space of the regression variable(s), and then fitting separate/piecewise

A Property and Casualty Insurance Predictive Modeling Process in SAS

Paper 11422-2016 A Property and Casualty Insurance Predictive Modeling Process in SAS Mei Najim, Sedgwick Claim Management Services ABSTRACT Predictive analytics is an area that has been developing rapidly

A Hybrid Decision Tree Approach for Semiconductor. Manufacturing Data Mining and An Empirical Study

A Hybrid Decision Tree Approach for Semiconductor Manufacturing Data Mining and An Empirical Study 1 C. -F. Chien J. -C. Cheng Y. -S. Lin 1 Department of Industrial Engineering, National Tsing Hua University

Gamma Distribution Fitting

Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics

MTH 140 Statistics Videos

MTH 140 Statistics Videos Chapter 1 Picturing Distributions with Graphs Individuals and Variables Categorical Variables: Pie Charts and Bar Graphs Categorical Variables: Pie Charts and Bar Graphs Quantitative

Why do statisticians "hate" us?

Why do statisticians "hate" us? David Hand, Heikki Mannila, Padhraic Smyth "Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data

Data Mining Using SAS Enterprise Miner : A Case Study Approach, Second Edition

Data Mining Using SAS Enterprise Miner : A Case Study Approach, Second Edition The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2003. Data Mining Using SAS Enterprise