Model Validation Techniques
|
|
|
- Caitlin Cameron
- 10 years ago
- Views:
Transcription
1 Model Validation Techniques Kevin Mahoney, FCAS travelers.com CAS RPM Seminar March 17, 2010 Uses of Statistical Models in P/C Insurance Examples of Applications Determine expected loss cost for an account (by line-of-business, peril, etc.) Determine likelihood to defect for an account Determine effectiveness of advertising Identify ripe targets for cross-sell attempts Triage for further treatment (risk engineering, inspection, etc.) Identify claims that may be fraud Identify claims that need experienced adjusters Examples of Techniques Generalized Linear Models Generalized Additive Models Cox Regression Decision Trees Ensemble Methods Text Mining Spatial Models Mixed Models Neural Networks What Models Need to Be Validated? All models need to be validated However, unlike many other statistical diagnostics, THE SAME CONCEPTS APPLY REGARDLESS OF THE TYPE OF MODEL You can apply the concepts you learn today to any of the above, plus just about any other type of predictive model you may encounter 1
2 The Many Meanings of Model Validation Primary Meaning/Use Quantifying Model Performance How well can we expect this model to perform in the future The only objective test is unseen data Secondary Meanings/Uses Using Similar Procedures for Other Goals Looking at out-of-sample data during the modeling process to determine: the right choice of predictor variables [feature selection], and/or the right type of model, and/or The right value of a tuning parameter Important Caution The same out-of-sample data cannot serve both purposes above If used in feature selection, then it influenced the model So need additional out-of-sample data to quantify performance Training Test Hold-Out Important Caution Data terminology Data used in building the model (in-sample) are training data Out-of-sample data used in guiding the modeling process are test data Out-of-sample data against which the predictive power of the ultimately chosen model is tested are validation data or holdout data. It is sometimes important for this data to be outof-time as well. E.g., if you are modeling severities of homeowners losses, you don t want claims from the same storms in the training/test data and the validation data 2
3 Danger Scale High Data- Mining Type of Search Lookup Table Neural Network Adaptive methods: MARS, GAMS, splines Automated feature searches (stepwise) School stats Search by hand Low No search Gains Charts for Direct Marketing Model Gain % 20% 40% 60% 80% 100% Top % of Scores Response Rate 8% 6% 4% 2% 0% 0% 20% 40% 60% 80% 100% Top % of Scores Classic chart to diagnose a response model Use the model to score the validation data I.e., assign the modeled probability of a response SORT by the modeled probability from most likely to least likely Plot the actual response rate in the validation among the highest scoring x% Typically do NOT even plot the training data goal is purely to show objective performance Decile Charts Loss Ratio 90% 80% 70% 60% 50% 40% 30% 20% Training Validation Decile 3
4 Decile Charts Again, you sort by predicted value You show actual value The validation line is key This shows the actual predictive power of the model The discrepancy between the validation and actual lines is useful In modeling (using test rather than final validation data), to diagnose overfit In implementation: If implementing as a rating algorithm, discrepancies between the training line and validation line suggest shrinking extreme estimates Nothing magical about deciles: Use quintiles, vingtiles, whatever your data will support The Many Meanings of Model Validation Primary Meaning Quantifying Model Performance How well can we expect this model to perform in the future The only objective test is unseen data Secondary Meanings Using Similar Procedures for Other Goals Looking at out-of-sample data during the modeling process to determine: the right choice of predictor variables [feature selection], and/or the right type of model, and/or The right value of a tuning parameter Choice of Predictors Do NOT use validation data for this Just training and test Divide dataset into training and test data Check that predictors still show up as significant if you model the test data Or divide the training data into many pieces Say 5 pieces Model each 1/5 (or each 4/5) Only include predictors that were significant in at least 2 (or 3, or 4) of these 5 models 4
5 The Right-sized Model Why not use only seen data but penalize the goodness-of-fit measure for the number of parameters and/or degrees of freedom? The information criteria, AIC, BIC, etc., do this Limitations: The number of parameters may be the wrong basis for the penalty E.g., if using shrinkage techniques, like ridge regression, or credibility, or hierarchical or mixed models, the effective dfs may be much smaller than the number of parameters Even if you have a good way to compute the effective degrees of freedom, that doesn t penalize for the size of the search If you have 20 features, the best 8 feature model implies a search of 125,970 models; an 8 feature model implies a search of *1* model. The Right-sized Model Why not use significance tests to decide whether to include a variable: Yes, but... Raw significance tests also do nothing to adjust for the size of the search Tests that are directionally correct may not function correctly in absolute terms when modeling assumptions are violated (i.e., always) For example, widths of confidence intervals are very sensitive to the scale parameter in most GLMs But the scale parameter has to be estimated from the data and may not itself be very certain Cross-Validation Divide the data into N pieces N=5 or 10 typical; N=2 convenient if hurried 5
6 Cross-Validation Run the model on each 4/5 or each 9/10 This results in N models, each on a high percentage of the data Each datapoint has been left out in building exactly 1 model Compare each actual observed value to the value predicted by the model that didn t see it Use this to compute goodness-of-fit (squared error, misclassification rate, etc.) Use this to compare models of varying complexity Fewer or more predictors Different values of a tuning parameter (e.g., K in a Bayesian credibility setup) Cross-Validation In the data mining and machine learning community, often used to do the objective validation of model power Only works because the model-building process is entirely automated Each 9/10 model and the model on the entire dataset are built without knowledge of the other 10 models Not just the fitting of parameters is independent So is the choice of variables, indeed the entire process If the process was open to alternate feature-selection methods (e.g., CART or MARS) before looking at the first 9/10, technically even that decision must be remade 10x This is not possible when human beings are part of the modeling process A+Bx 2 or A+Bx 2 +Cx 4?
7 Parameter Estimates Estimate Std Dev p-value A+Bx 2 A B E-37 A+Bx 2 +Cx 4 A B E-11 C So you need an x 4 term, right? Hypothesis Test is OK? Or do you need an x 4 term? Do the errors look identically distributed? Or are the data heteroskedastic? So assumptions are violated that may severely impact the hypothesis tests Let s look at another comparison of the two models: 1/3 of the data vs models fit on other 2/ This is done 3 times, once for each 1/3 7
8 Cross-Validation to the Rescue Sum of Squared Errors A+Bx 2 A+Bx 2 +Cx 4 Full Model Fold Cross-Validation Mean Squared Error A+Bx 2 A+Bx 2 +Cx 4 Full Model Fold Cross-Validation Note the optimism of the full model (in-sample) errors Note that the 4 th power term is completely unnecessary Data generated from a quadratic (with added heteroskedastic errors) Simplified Version of Property-Casualty Insurance Data Split into two and added lines 8
9 Zooming on the lines Lesson of the above Insurance data often has a few observations that are outliers But we can t throw them out because they are the observations that matter most Therefore: Choose your model carefully (linear regression without transforming y should be an obviously bad idea with the above) Remember that you don t have as much information as the size of your dataset might indicate Remember that you can overcome optimism in classical confidence intervals using cross-validation Testing on Seen vs Unseen Data In-sample tests Must adjust for degrees of freedom Many tests oriented toward inferential power Tests sensitive to fussy statistical assumptions May need deep statistical knowledge to interpret Difficult to present results to management May require adjustments if observations are correlated Out-of-sample tests No need to adjust for degrees of freedom Tests typically oriented toward predictive power Tests purely empirical; only simple assumptions involved Have commonsense interpretations With modest effort, usually presentable In some cases, may need to be an out-of-time as well as out-ofsample test 9
10 Model Validation Today Model validation is a serious topic Regulators require some financial institutions to have a separate department that validates, for example, consumer creditworthiness models Should there be an actuarial standard of practice addressing validation of statistical models Topics such a standard might address When is out-of-time validation rather than just out-of-sample validation critical? What steps should be taken to ensure knowledge of the validation data has not crept into the model-building process? For instance, split off the validation data before or after EDA? Splitting it too early makes balancing to control-totals difficult Q&A 10
BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING
BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING Xavier Conort [email protected] Session Number: TBR14 Insurance has always been a data business The industry has successfully
Risk pricing for Australian Motor Insurance
Risk pricing for Australian Motor Insurance Dr Richard Brookes November 2012 Contents 1. Background Scope How many models? 2. Approach Data Variable filtering GLM Interactions Credibility overlay 3. Model
Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013
Model selection in R featuring the lasso Chris Franck LISA Short Course March 26, 2013 Goals Overview of LISA Classic data example: prostate data (Stamey et. al) Brief review of regression and model selection.
Penalized regression: Introduction
Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood
Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection
Directions in Statistical Methodology for Multivariable Predictive Modeling Frank E Harrell Jr University of Virginia Seattle WA 19May98 Overview of Modeling Process Model selection Regression shape Diagnostics
Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model
Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Xavier Conort [email protected] Motivation Location matters! Observed value at one location is
PREDICTIVE MODELLING FOR COMMERCIAL INSURANCE
PREDICTIVE MODELLING FOR COMMERCIAL INSURANCE General Insurance Pricing Seminar 13 June 2008 London James Guszcza, FCAS, MAAA [email protected] General Themes Predictive modelling: 3 Levels of Discussion
Introduction to Predictive Modeling Using GLMs
Introduction to Predictive Modeling Using GLMs Dan Tevet, FCAS, MAAA, Liberty Mutual Insurance Group Anand Khare, FCAS, MAAA, CPCU, Milliman 1 Antitrust Notice The Casualty Actuarial Society is committed
New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction
Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.
Predictive Modeling and Big Data
Predictive Modeling and Presented by Eileen Burns, FSA, MAAA Milliman Agenda Current uses of predictive modeling in the life insurance industry Potential applications of 2 1 June 16, 2014 [Enter presentation
Data Mining Techniques Chapter 6: Decision Trees
Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................
5. Multiple regression
5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
Section 6: Model Selection, Logistic Regression and more...
Section 6: Model Selection, Logistic Regression and more... Carlos M. Carvalho The University of Texas McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Model Building
Cross Validation. Dr. Thomas Jensen Expedia.com
Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Regression Modeling Strategies
Frank E. Harrell, Jr. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis With 141 Figures Springer Contents Preface Typographical Conventions
Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar
Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar Prepared by Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc. www.data-mines.com [email protected]
Chapter 7: Simple linear regression Learning Objectives
Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -
Evaluation & Validation: Credibility: Evaluating what has been learned
Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model
Recall this chart that showed how most of our course would be organized:
Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical
Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.
Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics
Elements of statistics (MATH0487-1)
Elements of statistics (MATH0487-1) Prof. Dr. Dr. K. Van Steen University of Liège, Belgium December 10, 2012 Introduction to Statistics Basic Probability Revisited Sampling Exploratory Data Analysis -
How To Build A Predictive Model In Insurance
The Do s & Don ts of Building A Predictive Model in Insurance University of Minnesota November 9 th, 2012 Nathan Hubbell, FCAS Katy Micek, Ph.D. Agenda Travelers Broad Overview Actuarial & Analytics Career
Credibility and Pooling Applications to Group Life and Group Disability Insurance
Credibility and Pooling Applications to Group Life and Group Disability Insurance Presented by Paul L. Correia Consulting Actuary [email protected] (207) 771-1204 May 20, 2014 What I plan to cover
SOA 2013 Life & Annuity Symposium May 6-7, 2013. Session 30 PD, Predictive Modeling Applications for Life and Annuity Pricing and Underwriting
SOA 2013 Life & Annuity Symposium May 6-7, 2013 Session 30 PD, Predictive Modeling Applications for Life and Annuity Pricing and Underwriting Moderator: Barry D. Senensky, FSA, FCIA, MAAA Presenters: Jonathan
How To Make A Credit Risk Model For A Bank Account
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző [email protected] 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
Introduction to Regression and Data Analysis
Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it
PREDICTIVE MODELS IN LIFE INSURANCE
PREDICTIVE MODELS IN LIFE INSURANCE Date: 17 June 2010 Philip L. Adams, ASA, MAAA Agenda 1. Predictive Models Defined 2. Predictive models past and present 3. Actuarial perspective 4. Application to Life
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
Predictive Modeling Techniques in Insurance
Predictive Modeling Techniques in Insurance Tuesday May 5, 2015 JF. Breton Application Engineer 2014 The MathWorks, Inc. 1 Opening Presenter: JF. Breton: 13 years of experience in predictive analytics
Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign
Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign Arun K Mandapaka, Amit Singh Kushwah, Dr.Goutam Chakraborty Oklahoma State University, OK, USA ABSTRACT Direct
Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.
Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are
Data Mining Applications in Higher Education
Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
Issues in Credit Scoring
Issues in Credit Scoring Model Development and Validation Dennis Glennon Risk Analysis Division Economics Department The Office of the Comptroller of the Currency The opinions expressed are those of the
An Overview and Evaluation of Decision Tree Methodology
An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX [email protected] Carole Jesse Cargill, Inc. Wayzata, MN [email protected]
Insurance Fraud Detection: MARS versus Neural Networks?
Insurance Fraud Detection: MARS versus Neural Networks? Louise A Francis FCAS, MAAA [email protected] 1 Objectives Introduce a relatively new data mining method which can be used as an alternative
Modeling Lifetime Value in the Insurance Industry
Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting
How To Understand The Theory Of Probability
Graduate Programs in Statistics Course Titles STAT 100 CALCULUS AND MATR IX ALGEBRA FOR STATISTICS. Differential and integral calculus; infinite series; matrix algebra STAT 195 INTRODUCTION TO MATHEMATICAL
Nonparametric statistics and model selection
Chapter 5 Nonparametric statistics and model selection In Chapter, we learned about the t-test and its variations. These were designed to compare sample means, and relied heavily on assumptions of normality.
Session 54 PD, Credibility and Pooling for Group Life and Disability Insurance Moderator: Paul Luis Correia, FSA, CERA, MAAA
Session 54 PD, Credibility and Pooling for Group Life and Disability Insurance Moderator: Paul Luis Correia, FSA, CERA, MAAA Presenters: Paul Luis Correia, FSA, CERA, MAAA Brian N. Dunham, FSA, MAAA Credibility
Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.
Prof. Dr. J. Franke All of Statistics 1.52 Binary response variables - logistic regression Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit
Moderation. Moderation
Stats - Moderation Moderation A moderator is a variable that specifies conditions under which a given predictor is related to an outcome. The moderator explains when a DV and IV are related. Moderation
Ridge Regression. Patrick Breheny. September 1. Ridge regression Selection of λ Ridge regression in R/SAS
Ridge Regression Patrick Breheny September 1 Patrick Breheny BST 764: Applied Statistical Modeling 1/22 Ridge regression: Definition Definition and solution Properties As mentioned in the previous lecture,
Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set
Overview Evaluation Connectionist and Statistical Language Processing Frank Keller [email protected] Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification
Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010
Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010 Week 1 Week 2 14.0 Students organize and describe distributions of data by using a number of different
Credit Risk Analysis Using Logistic Regression Modeling
Credit Risk Analysis Using Logistic Regression Modeling Introduction A loan officer at a bank wants to be able to identify characteristics that are indicative of people who are likely to default on loans,
Simple Linear Regression Inference
Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation
Data Mining Introduction
Data Mining Introduction Bob Stine Dept of Statistics, School University of Pennsylvania www-stat.wharton.upenn.edu/~stine What is data mining? An insult? Predictive modeling Large, wide data sets, often
Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank
Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Application of SAS! Enterprise Miner in Credit Risk Analytics. Presented by Minakshi Srivastava, VP, Bank of America
Application of SAS! Enterprise Miner in Credit Risk Analytics Presented by Minakshi Srivastava, VP, Bank of America 1 Table of Contents Credit Risk Analytics Overview Journey from DATA to DECISIONS Exploratory
Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.
Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month
Cross-validation for detecting and preventing overfitting
Cross-validation for detecting and preventing overfitting Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures.
A Short Tour of the Predictive Modeling Process
Chapter 2 A Short Tour of the Predictive Modeling Process Before diving in to the formal components of model building, we present a simple example that illustrates the broad concepts of model building.
Car Insurance. Prvák, Tomi, Havri
Car Insurance Prvák, Tomi, Havri Sumo report - expectations Sumo report - reality Bc. Jan Tomášek Deeper look into data set Column approach Reminder What the hell is this competition about??? Attributes
THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell
THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether
t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon
t-tests in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. [email protected] www.excelmasterseries.com
Weight of Evidence Module
Formula Guide The purpose of the Weight of Evidence (WoE) module is to provide flexible tools to recode the values in continuous and categorical predictor variables into discrete categories automatically,
NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal
Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether
Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:
Doing Multiple Regression with SPSS Multiple Regression for Data Already in Data Editor Next we want to specify a multiple regression analysis for these data. The menu bar for SPSS offers several options:
11. Analysis of Case-control Studies Logistic Regression
Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:
A Property & Casualty Insurance Predictive Modeling Process in SAS
Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing
> plot(exp.btgpllm, main = "treed GP LLM,", proj = c(1)) > plot(exp.btgpllm, main = "treed GP LLM,", proj = c(2)) quantile diff (error)
> plot(exp.btgpllm, main = "treed GP LLM,", proj = c(1)) > plot(exp.btgpllm, main = "treed GP LLM,", proj = c(2)) 0.4 0.2 0.0 0.2 0.4 treed GP LLM, mean treed GP LLM, 0.00 0.05 0.10 0.15 0.20 x1 x1 0.4
Predictive modelling around the world 28.11.13
Predictive modelling around the world 28.11.13 Agenda Why this presentation is really interesting Introduction to predictive modelling Case studies Conclusions Why this presentation is really interesting
CART 6.0 Feature Matrix
CART 6.0 Feature Matri Enhanced Descriptive Statistics Full summary statistics Brief summary statistics Stratified summary statistics Charts and histograms Improved User Interface New setup activity window
(Refer Slide Time: 01:52)
Software Engineering Prof. N. L. Sarda Computer Science & Engineering Indian Institute of Technology, Bombay Lecture - 2 Introduction to Software Engineering Challenges, Process Models etc (Part 2) This
Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression
Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Objectives: To perform a hypothesis test concerning the slope of a least squares line To recognize that testing for a
Data Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
Some Essential Statistics The Lure of Statistics
Some Essential Statistics The Lure of Statistics Data Mining Techniques, by M.J.A. Berry and G.S Linoff, 2004 Statistics vs. Data Mining..lie, damn lie, and statistics mining data to support preconceived
Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics
Course Text Business Statistics Lind, Douglas A., Marchal, William A. and Samuel A. Wathen. Basic Statistics for Business and Economics, 7th edition, McGraw-Hill/Irwin, 2010, ISBN: 9780077384470 [This
Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence
Government of Russian Federation Federal State Autonomous Educational Institution of High Professional Education National Research University «Higher School of Economics» Faculty of Computer Science School
Testing for Lack of Fit
Chapter 6 Testing for Lack of Fit How can we tell if a model fits the data? If the model is correct then ˆσ 2 should be an unbiased estimate of σ 2. If we have a model which is not complex enough to fit
Descriptive Statistics
Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize
Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.
Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing
Regression 3: Logistic Regression
Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic regression Logistic regression in R Outline Logistic regression Introduction The model Looking at and comparing
Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller
Agenda Introduktion till Prediktiva modeller Beslutsträd Beslutsträd och andra prediktiva modeller Mathias Lanner Sas Institute Pruning Regressioner Neurala Nätverk Utvärdering av modeller 2 Predictive
Outline. Definitions Descriptive vs. Inferential Statistics The t-test - One-sample t-test
The t-test Outline Definitions Descriptive vs. Inferential Statistics The t-test - One-sample t-test - Dependent (related) groups t-test - Independent (unrelated) groups t-test Comparing means Correlation
Multiple Regression: What Is It?
Multiple Regression Multiple Regression: What Is It? Multiple regression is a collection of techniques in which there are multiple predictors of varying kinds and a single outcome We are interested in
Data Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.7 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Linear Regression Other Regression Models References Introduction Introduction Numerical prediction is
" Y. Notation and Equations for Regression Lecture 11/4. Notation:
Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through
Chapter 15. Mixed Models. 15.1 Overview. A flexible approach to correlated data.
Chapter 15 Mixed Models A flexible approach to correlated data. 15.1 Overview Correlated data arise frequently in statistical analyses. This may be due to grouping of subjects, e.g., students within classrooms,
Innovations and Value Creation in Predictive Modeling. David Cummings Vice President - Research
Innovations and Value Creation in Predictive Modeling David Cummings Vice President - Research ISO Innovative Analytics 1 Innovations and Value Creation in Predictive Modeling A look back at the past decade
TURKISH ORACLE USER GROUP
TURKISH ORACLE USER GROUP Data Mining in 30 Minutes Husnu Sensoy Global Maksimum Data & Information Tech. Founder VLDB Expert Agenda Who am I? Different problems of Data Mining In database data mining?!?
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
Using Predictive Analytics to Detect Fraudulent Claims
Using Predictive Analytics to Detect Fraudulent Claims May 17, 211 Roosevelt C. Mosley, Jr., FCAS, MAAA CAS Spring Meeting Palm Beach, FL Experience the Pinnacle Difference! Predictive Analysis for Fraud
Bootstrapping Big Data
Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu
Combining Linear and Non-Linear Modeling Techniques: EMB America. Getting the Best of Two Worlds
Combining Linear and Non-Linear Modeling Techniques: Getting the Best of Two Worlds Outline Who is EMB? Insurance industry predictive modeling applications EMBLEM our GLM tool How we have used CART with
Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD
Predictive Analytics Techniques: What to Use For Your Big Data March 26, 2014 Fern Halper, PhD Presenter Proven Performance Since 1995 TDWI helps business and IT professionals gain insight about data warehousing,
Data Mining Lab 5: Introduction to Neural Networks
Data Mining Lab 5: Introduction to Neural Networks 1 Introduction In this lab we are going to have a look at some very basic neural networks on a new data set which relates various covariates about cheese
