Predicting Defaults of Loans using Lending Club s Loan Data

Size: px
Start display at page:

Download "Predicting Defaults of Loans using Lending Club s Loan Data"

Transcription

1 Predicting Defaults of Loans using Lending Club s Loan Data Oleh Dubno Fall 2014 General Assembly Data Science Link to my Developer Notebook (ipynb) - Background and Hypothesis: The data is coming from Lending Club, a peer- to- peer lending company, headquartered in San Francisco. LC began by operating as an online consumer- lending platform that enables borrowers to obtain a loan that s funded by individuals and institutions. LC, just recently made their loans available to small businesses. I will be focusing on the prior. The dataset and the associated description of its features are downloadable on the LC site. It comes equipped with 188,127 values and 31 features. Goal: Discover the features that are indicative of someone paying or defaulting on their loan. Tools: Logistic regression, Naïve Bayes, Decision Tree To determine which features of the data set contribute towards someone repaying or defaulting on his or her loan and using the Decision Tree to see how well the model performs against a test set. Folium To map the features of the dataset. By initially mapping a bar chart of the loan statuses, seven unique values become discoverable. To do the logistic regression only two are required. (see figure below) The focus is around predicting who repays or defaults on their loan. As a result, the Current column will be removed, the Fully Paid column will remain and the rest of the columns will be grouped and characterized as Unpaid. This is then converted to Boolean values: Unpaid 0 and Paid 1.

2 The data has now been drastically reduced. Given that Current is a heavy hitter, removing it reduces the dataset to 54,419 entries. This is necessary, provided the goal is not to focus on current loans. Data Overview The average funded amount of an individual loan is $13, The minimum loan given out is $1,00.00 with a median amount of $12,000 and a maximum amount of just $35, The funded amount is normally distributed and the numbers do not appear to be skewed. Good! The average annual income is $71, with a minimum income of $4,800, a median of $62,000 and a maximum income of $7,141,778. The maximum value serves as a definite outlier and the set will be limited to $200,000. Not surprisingly, as Annual Income goes up so does the Funded Amount. The sweet spot, after which Annual Income does not predict Funded Amount, seems to be at about the mean of the annual income itself of $72,000. I suppose the mean annual income of $72,000 matches the cut off for loans at $35,000 for good reasons. Interestingly, Lending club seems to have a strict policy, limiting the Amount Funded according to the individual Annual Income, up to $72,000, after which it begins to vary.

3 Lets run an OLS regression using Annual Income (predictor) to predict Amount Funded (the explained variable). OLS (Ordinary Least Squares) attempts to predict the dependent variable, Amount Funded, using the independent variable Annual Income. The regression algorithm learns from this data to predict the right Amount Funded given the Annual Income. The OLS regression with Annual Income is set to predict Amount Funded (limiting the dataset to income <= $200,000) shows an R^2 of.201 This means that 20% of the variance in Funded Amount is explained by Annual Income. This, however, is a low R^2. With the assistance of the scatter plot, we do see that Annual Income is suggestive in determining the Funded Amount only up until the Annual Income of $72,000. Logistic Regression Next: 4 Logistic Regressions Determining Loan Status The first logistic regression is using the time of employment and the grade that the loan received from LC to predict loan status. Below is a chart highlighting the coefficients. Coefficients represent the mean change in the response variable for one unit of change in the predictor variable. In other words, a 1 year increase in employment length increases the chance of the loan being paid back by A 2 year increase in employment length increases the chance of the loan being paid back by , and so on. It would be interesting to see how effective the grade, that LC provides their loans, is at predicting loan status. Some background. The provided grades range from A G : A being the highest and G the lowest. As a result I mapped 7, the highest value, to A, 6 to B, 5 to C and so on until 1 as G. As the grade increases by 1 grade value the chance of the loan being paid off increases by Given we re using binary output of 0 as unpaid and 1 as paid. The closer the multiple of the grade and the coefficient is to 1 the higher the likelihood of the loan being paid off. Pretty much, if the grade is E or 3 the chance of payback is very high.

4 The second logistic regression is using funded amount and annual income to predict loan status. The reason for such low coefficients, for funded amount and annual income, is that the numbers are in thousands, granted they're in dollar amounts, and the explained variable, loan status, is binary ranging from 0 to 1. Let's look at the amount funded. As the amount funded increases by $10,000 the chance of it getting paid back decreases by = (10,000 x ). Similar, as annual income increases so does the chance of the loan being paid off. Intuitive, right? This is understandable and supported by the positive coefficient In other words as the annual income increases by $10,000 so does the chance of the loan being paid back by (10,000 x ) The third logistic regression is using home ownership status (Rent, Mortgage, Own, None, Other) to predict loan status. My understanding for someone putting OTHER for home ownership on the loan application is that they either did not want to reveal their home ownership situation, are hiding something, or are bad at filling out applications. None could be an honest answer, from someone that may be living with their parents. Regardless, it seems that if someone checks off OTHER and gets funded, then there s a very good chance of that individual defaulting on his or her loan.

5 The fourth logistic regression is using employment length (<1 year 10+ years) to predict loan status. There doesn t immediately appear to be too much variance between the generated coefficients of years employed. It looks like; so long as the person is employed they will be paying back their loan. However, it holds true, that if someone is unemployed or has less than a year of employment then they ll have a lower chance of repaying their loan. I didn t investigate which percentage of <1 year is employed or unemployed. Interestingly, and probably just a coincidence, because the results are really marginal, if a person is employed for 4 years they have the same coefficient of paying back their loan as someone employed for one year or less. Just an observation. I will not be pursuing that point any further. To conclude the work on logistic regression: the data set is deficient in explored features that I lacked, in experience leveraged with time, to explore. From the findings that I got, I can t speak definitively, but I would say avoid giving loans to people that don t specify home ownership and do give loans to people with higher income. Decision Tree and The Confusion Matrix Confusion Matrix allows for more detailed analysis than mere proportion of correct guesses. For instance 177 loans from paid loans were incorrectly predicted as unpaid. Based on the entries in the confusion matrix, the total number of correct predictions made by the model is (177 loans + 31,594 loans) and the total number of incorrect predictions is (177 loans + 8,920 loans). The confusion matrix provides the information needed to determine how well a classification model performs. The performance metric, accuracy, summarizes this information with a single number.777 Accuracy takes the total number of correct predictions and divides it by the total number of all predictions made.

6 Mapping Paid and Unpaid Loans The above map is referred to as the choropleth map, "a thematic map in which areas are shade patterned in proportion to the measurement of the statistical variable being displayed." (wikipedia) As the intensity of the color increases (gets closer to 1), on average the majority of the people residing in that state have paid of their loan. The number near the point references the amount of loans given in that state. By the looks of the map Nebraska, Missouri, Oregon, Virginia, Montana, Wyoming and South Dakota are not the states that are too fortunate in repaying their loans. Of course this an average of individual loans, per state, discounting specific regions of the state, and is not the best estimate for whether a funded individual in that state is likely to repay their loan. However, maybe the other features could help determine which state is less likelier to pay off a loan.

7 Mapping Amount Funded Understanding that as the amount funded increases so does the chance of the loan not being paid back, we could see that Mississippi is a state with a fairly large funded amount. Mississippi is also a state, according to the map on loan status, a state that doesn t do too well in repaying their loans. On average, individuals receiving a loan in Mississippi are much more likelier to default on their loan as they are also likelier to receive bigger loans. Lets look further.

8 Mapping Annual Income There are several outliers in the data that have been removed, in terms of annual income. Before removing the outliers, the income ranges from $33, to $7,241,778. Which is an obscene amount. I limit it to $200, The map ranges reflects the annual income up to $120,000. Interestingly, Mississippi is the state with an average income, between 60k 80k with the lowest payback rate and on average the state that takes out the highest loans.

9 Mapping The Grade Assigned to Individual Loans Keeping on track with Mississippi, a state I'm not too familiar with, it also happens to have a terrible rating for loans according to the data. I could understand why Lending Club, on average, would give a pretty poor grade to loans in Oregon. The average population there a fairly good income, but I guess it s not too predictive of a good grade. We could see that by looking at the income map presented before.

10 Mapping Employment Length Mississippi appears to have fairly good employment. It doesn t appear to be too predictive of their faulty loans. Conclusion: Avoid Mississippi. Wish I could go further into this. Don t give a loan to someone that doesn t know his or her homeownership status. Lending Club data download site: https://www.lendingclub.com/info/download- data.action

Using Excel for Statistical Analysis

Using Excel for Statistical Analysis Using Excel for Statistical Analysis You don t have to have a fancy pants statistics package to do many statistical functions. Excel can perform several statistical tests and analyses. First, make sure

More information

Lending Club Interest Rate Data Analysis

Lending Club Interest Rate Data Analysis Lending Club Interest Rate Data Analysis 1. Introduction Lending Club is an online financial community that brings together creditworthy borrowers and savvy investors so that both can benefit financially

More information

Presentation of Data

Presentation of Data ECON 2A: Advanced Macroeconomic Theory I ve put together some pointers when assembling the data analysis portion of your presentation and final paper. These are not all inclusive, but are things to keep

More information

Title: Lending Club Interest Rates are closely linked with FICO scores and Loan Length

Title: Lending Club Interest Rates are closely linked with FICO scores and Loan Length Title: Lending Club Interest Rates are closely linked with FICO scores and Loan Length Introduction: The Lending Club is a unique website that allows people to directly borrow money from other people [1].

More information

Introduction to Regression. Dr. Tom Pierce Radford University

Introduction to Regression. Dr. Tom Pierce Radford University Introduction to Regression Dr. Tom Pierce Radford University In the chapter on correlational techniques we focused on the Pearson R as a tool for learning about the relationship between two variables.

More information

Notes 5: More on regression and residuals ECO 231W - Undergraduate Econometrics

Notes 5: More on regression and residuals ECO 231W - Undergraduate Econometrics Notes 5: More on regression and residuals ECO 231W - Undergraduate Econometrics Prof. Carolina Caetano 1 Regression Method Let s review the method to calculate the regression line: 1. Find the point of

More information

Paying off a debt. Ethan D. Bolker Maura B. Mast. December 4, 2007

Paying off a debt. Ethan D. Bolker Maura B. Mast. December 4, 2007 Paying off a debt Ethan D. Bolker Maura B. Mast December 4, 2007 Plan Lecture notes Can you afford a mortgage? There s a $250,000 condominium you want to buy. You ve managed to scrape together $50,000

More information

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate

More information

TAX SALE ARBITRAGE SYSTEM

TAX SALE ARBITRAGE SYSTEM TAX SALE ARBITRAGE SYSTEM TAX SALE ARBITRAGE SYSTEM I am Not an attorney, Not giving legal or tax advise Best efforts to find legal answers Law always changing Giving you more information in the online

More information

MEASURES OF DISPERSION

MEASURES OF DISPERSION MEASURES OF DISPERSION Measures of Dispersion While measures of central tendency indicate what value of a variable is (in one sense or other) average or central or typical in a set of data, measures of

More information

Simple Regression Theory II 2010 Samuel L. Baker

Simple Regression Theory II 2010 Samuel L. Baker SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the

More information

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number 1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number A. 3(x - x) B. x 3 x C. 3x - x D. x - 3x 2) Write the following as an algebraic expression

More information

T O P I C 1 2 Techniques and tools for data analysis Preview Introduction In chapter 3 of Statistics In A Day different combinations of numbers and types of variables are presented. We go through these

More information

Equifax Risk Score 2.0

Equifax Risk Score 2.0 Equifax Risk Score 2.0 Customer Handbook Prepared by: Larry Macdonald, Sr. Product Manager 10-Jun-2014 Table of Contents Introducing the Equifax Risk Score 2.0... 3 1. Modeling Concepts... 3 1.1 Population

More information

Regression Analysis Using ArcMap. By Jennie Murack

Regression Analysis Using ArcMap. By Jennie Murack Regression Analysis Using ArcMap By Jennie Murack Regression Basics How is Regression Different from other Spatial Statistical Analyses? With other tools you ask WHERE something is happening? Are there

More information

Scatter Plots with Error Bars

Scatter Plots with Error Bars Chapter 165 Scatter Plots with Error Bars Introduction The procedure extends the capability of the basic scatter plot by allowing you to plot the variability in Y and X corresponding to each point. Each

More information

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.

More information

Premaster Statistics Tutorial 4 Full solutions

Premaster Statistics Tutorial 4 Full solutions Premaster Statistics Tutorial 4 Full solutions Regression analysis Q1 (based on Doane & Seward, 4/E, 12.7) a. Interpret the slope of the fitted regression = 125,000 + 150. b. What is the prediction for

More information

Everything you wanted to know about using Hexadecimal and Octal Numbers in Visual Basic 6

Everything you wanted to know about using Hexadecimal and Octal Numbers in Visual Basic 6 Everything you wanted to know about using Hexadecimal and Octal Numbers in Visual Basic 6 Number Systems No course on programming would be complete without a discussion of the Hexadecimal (Hex) number

More information

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic Report prepared for Brandon Slama Department of Health Management and Informatics University of Missouri, Columbia

More information

Understanding. What you need to know about the most widely used credit scores

Understanding. What you need to know about the most widely used credit scores Understanding What you need to know about the most widely used credit scores 300 850 2 The score lenders use. FICO Scores are the most widely used credit scores according to a recent CEB TowerGroup analyst

More information

Statistical Foundations: Measures of Location and Central Tendency and Summation and Expectation

Statistical Foundations: Measures of Location and Central Tendency and Summation and Expectation Statistical Foundations: and Central Tendency and and Lecture 4 September 5, 2006 Psychology 790 Lecture #4-9/05/2006 Slide 1 of 26 Today s Lecture Today s Lecture Where this Fits central tendency/location

More information

Models for Discrete Variables

Models for Discrete Variables Probability Models for Discrete Variables Our study of probability begins much as any data analysis does: What is the distribution of the data? Histograms, boxplots, percentiles, means, standard deviations

More information

Helpful Information for a First Time Mortgage

Helpful Information for a First Time Mortgage Helpful Information for a First Time Mortgage Getting Started Many people buying their first home are afraid lenders don't really want to work with them. But that's simply not true. Without you, there

More information

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name: Glo bal Leadership M BA BUSINESS STATISTICS FINAL EXAM Name: INSTRUCTIONS 1. Do not open this exam until instructed to do so. 2. Be sure to fill in your name before starting the exam. 3. You have two hours

More information

A short primer on residual plots

A short primer on residual plots Chapter 24 A short primer on residual plots Contents 24.1 Linear Regression................................... 1598 24.2 ANOVA residual plots................................. 1599 24.3 Logistic Regression

More information

Prediction of Car Prices of Federal Auctions

Prediction of Car Prices of Federal Auctions Prediction of Car Prices of Federal Auctions BUDT733- Final Project Report Tetsuya Morito Karen Pereira Jung-Fu Su Mahsa Saedirad 1 Executive Summary The goal of this project is to provide buyers who attend

More information

6th Grade Lesson Plan: Probably Probability

6th Grade Lesson Plan: Probably Probability 6th Grade Lesson Plan: Probably Probability Overview This series of lessons was designed to meet the needs of gifted children for extension beyond the standard curriculum with the greatest ease of use

More information

4. Introduction to Statistics

4. Introduction to Statistics Statistics for Engineers 4-1 4. Introduction to Statistics Descriptive Statistics Types of data A variate or random variable is a quantity or attribute whose value may vary from one unit of investigation

More information

Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011

Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011 Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011 Name: Section: I pledge my honor that I have not violated the Honor Code Signature: This exam has 34 pages. You have 3 hours to complete this

More information

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month

More information

X - Xbar : ( 41-50) (48-50) (50-50) (50-50) (54-50) (57-50) Deviations: (note that sum = 0) Squared :

X - Xbar : ( 41-50) (48-50) (50-50) (50-50) (54-50) (57-50) Deviations: (note that sum = 0) Squared : Review Exercises Average and Standard Deviation Chapter 4, FPP, p. 74-76 Dr. McGahagan Problem 1. Basic calculations. Find the mean, median, and SD of the list x = (50 41 48 54 57 50) Mean = (sum x) /

More information

Relative Risk, Odds, and Fisher s exact test

Relative Risk, Odds, and Fisher s exact test Relative Risk, Odds, and Fisher s exact test I) Relative Risk A) Simply, relative risk is the ratio of p 1 / p 2. For instance, suppose we wanted to take another look at our Seat belt safety data from

More information

Discriminant Function Analysis in SPSS To do DFA in SPSS, start from Classify in the Analyze menu (because we re trying to classify participants into

Discriminant Function Analysis in SPSS To do DFA in SPSS, start from Classify in the Analyze menu (because we re trying to classify participants into Discriminant Function Analysis in SPSS To do DFA in SPSS, start from Classify in the Analyze menu (because we re trying to classify participants into different groups). In this case we re looking at a

More information

JetBlue Airways Stock Price Analysis and Prediction

JetBlue Airways Stock Price Analysis and Prediction JetBlue Airways Stock Price Analysis and Prediction Team Member: Lulu Liu, Jiaojiao Liu DSO530 Final Project JETBLUE AIRWAYS STOCK PRICE ANALYSIS AND PREDICTION 1 Motivation Started in February 2000, JetBlue

More information

harpreet@utdallas.edu, {ram.gopal, xinxin.li}@business.uconn.edu

harpreet@utdallas.edu, {ram.gopal, xinxin.li}@business.uconn.edu Risk and Return of Investments in Online Peer-to-Peer Lending (Extended Abstract) Harpreet Singh a, Ram Gopal b, Xinxin Li b a School of Management, University of Texas at Dallas, Richardson, Texas 75083-0688

More information

THINGS TO KNOW WHEN SHOPPING FOR STUDENT LOANS BROUGHT TO YOU BY

THINGS TO KNOW WHEN SHOPPING FOR STUDENT LOANS BROUGHT TO YOU BY 10 THINGS TO KNOW WHEN SHOPPING FOR STUDENT LOANS BROUGHT TO YOU BY The College Ave Team WHAT S INSIDE 4 UNDERSTAND HOW LENDING (AND BORROWING) WORKS 5 THERE ARE TWO TYPES OF STUDENT LOANS: FEDERAL AND

More information

CAPSTONE ADVISOR: PROFESSOR MARY HANSEN

CAPSTONE ADVISOR: PROFESSOR MARY HANSEN STEVEN NWAMKPA GOVERNMENT INTERVENTION IN THE FINANCIAL MARKET: DOES AN INCREASE IN SMALL BUSINESS ADMINISTRATION GUARANTEE LOANS TO SMALL BUSINESSES INCREASE GDP PER CAPITA INCOME? CAPSTONE ADVISOR: PROFESSOR

More information

A Sample Portfolio Optimization Program in Excel PRELIMINARY USER INSTRUCTIONS

A Sample Portfolio Optimization Program in Excel PRELIMINARY USER INSTRUCTIONS A Sample Portfolio Optimization Program in Excel PRELIMINARY USER INSTRUCTIONS Copyright 2006 By Robert C. Smithson, Anava Capital Management LLC And Anava Corporation This program is free software; you

More information

The Dummy s Guide to Data Analysis Using SPSS

The Dummy s Guide to Data Analysis Using SPSS The Dummy s Guide to Data Analysis Using SPSS Mathematics 57 Scripps College Amy Gamble April, 2001 Amy Gamble 4/30/01 All Rights Rerserved TABLE OF CONTENTS PAGE Helpful Hints for All Tests...1 Tests

More information

Descriptive Statistics

Descriptive Statistics Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize

More information

Chapter 15 Multiple Choice Questions (The answers are provided after the last question.)

Chapter 15 Multiple Choice Questions (The answers are provided after the last question.) Chapter 15 Multiple Choice Questions (The answers are provided after the last question.) 1. What is the median of the following set of scores? 18, 6, 12, 10, 14? a. 10 b. 14 c. 18 d. 12 2. Approximately

More information

Simple Predictive Analytics Curtis Seare

Simple Predictive Analytics Curtis Seare Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

More information

Regression Analysis: Basic Concepts

Regression Analysis: Basic Concepts The simple linear model Regression Analysis: Basic Concepts Allin Cottrell Represents the dependent variable, y i, as a linear function of one independent variable, x i, subject to a random disturbance

More information

SPSS: Descriptive and Inferential Statistics. For Windows

SPSS: Descriptive and Inferential Statistics. For Windows For Windows August 2012 Table of Contents Section 1: Summarizing Data...3 1.1 Descriptive Statistics...3 Section 2: Inferential Statistics... 10 2.1 Chi-Square Test... 10 2.2 T tests... 11 2.3 Correlation...

More information

Credit Scorecards for SME Finance The Process of Improving Risk Measurement and Management

Credit Scorecards for SME Finance The Process of Improving Risk Measurement and Management Credit Scorecards for SME Finance The Process of Improving Risk Measurement and Management April 2009 By Dean Caire, CFA Most of the literature on credit scoring discusses the various modelling techniques

More information

STAB22 section 2.1. Figure 1: Scatterplot of price vs. size for Mocha Frappuccino

STAB22 section 2.1. Figure 1: Scatterplot of price vs. size for Mocha Frappuccino STAB22 section 2.1 2.3 Both ounces and price are quantitative variables, and so we could draw a scatterplot to see how they are related. We might expect that bigger sizes cost more, though a Venti (24

More information

The United States of Obesity

The United States of Obesity The United States of Obesity Matt Malloure Grand Valley State University matt.malloure@gmail.com Diann Reischman Grand Valley State University reischmd@gvsu.edu Mary Richardson Grand Valley State University

More information

Figure 1. An embedded chart on a worksheet.

Figure 1. An embedded chart on a worksheet. 8. Excel Charts and Analysis ToolPak Charts, also known as graphs, have been an integral part of spreadsheets since the early days of Lotus 1-2-3. Charting features have improved significantly over the

More information

The More Trees, the Better! Scaling Up Performance Using Random Forest in SAS Enterprise Miner

The More Trees, the Better! Scaling Up Performance Using Random Forest in SAS Enterprise Miner Paper 3361-2015 The More Trees, the Better! Scaling Up Performance Using Random Forest in SAS Enterprise Miner Narmada Deve Panneerselvam, Spears School of Business, Oklahoma State University, Stillwater,

More information

Inferential Statistics

Inferential Statistics Inferential Statistics Sampling and the normal distribution Z-scores Confidence levels and intervals Hypothesis testing Commonly used statistical methods Inferential Statistics Descriptive statistics are

More information

Week 4: Standard Error and Confidence Intervals

Week 4: Standard Error and Confidence Intervals Health Sciences M.Sc. Programme Applied Biostatistics Week 4: Standard Error and Confidence Intervals Sampling Most research data come from subjects we think of as samples drawn from a larger population.

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Approximately 45 minutes worth of materials for a Y8 9 Citizenship/PSHE lesson on Managing money / Personal finance.

Approximately 45 minutes worth of materials for a Y8 9 Citizenship/PSHE lesson on Managing money / Personal finance. Approximately 45 minutes worth of materials for a Y8 9 Citizenship/PSHE lesson on Managing money / Personal finance. Learning objectives: understanding that some money choices are risky evaluating the

More information

A better way your parents can help you into your first home.

A better way your parents can help you into your first home. A better way your parents can help you into your first home. exclusively from You ll never need to ask your parents to guarantee your home loan. Almost half of first home owners get financial help from

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

Digging Deeper into Safety and Injury Prevention Data

Digging Deeper into Safety and Injury Prevention Data Digging Deeper into Safety and Injury Prevention Data Amanda Schwartz: Have you ever wondered how you could make your center safer using information you already collect? I'm Amanda Schwartz from the Head

More information

Logistic Regression. Introduction. The Purpose Of Logistic Regression

Logistic Regression. Introduction. The Purpose Of Logistic Regression Logistic Regression...1 Introduction...1 The Purpose Of Logistic Regression...1 Assumptions Of Logistic Regression...2 The Logistic Regression Equation...3 Interpreting Log Odds And The Odds Ratio...4

More information

Diagrams and Graphs of Statistical Data

Diagrams and Graphs of Statistical Data Diagrams and Graphs of Statistical Data One of the most effective and interesting alternative way in which a statistical data may be presented is through diagrams and graphs. There are several ways in

More information

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

More information

by Peter Renton LendAcademy.com

by Peter Renton LendAcademy.com by Peter Renton LendAcademy.com www.lendacademy.com This e- book is provided for information purposes only. It is not intended to be financial advice; you should always seek a professional before making

More information

Multiple Regression: What Is It?

Multiple Regression: What Is It? Multiple Regression Multiple Regression: What Is It? Multiple regression is a collection of techniques in which there are multiple predictors of varying kinds and a single outcome We are interested in

More information

EECS 349 Titanic Machine Learning From Disaster

EECS 349 Titanic Machine Learning From Disaster EECS 349 Titanic Machine Learning From Disaster Xiaodong Yang Northwestern University Abstract In this project, we see how we can use machine-learning techniques to predict survivors of the Titanic. With

More information

Software User Experience and Likelihood to Recommend: Linking UX and NPS

Software User Experience and Likelihood to Recommend: Linking UX and NPS Software User Experience and Likelihood to Recommend: Linking UX and NPS Erin Bradner User Research Manager Autodesk Inc. One Market St San Francisco, CA USA erin.bradner@autodesk.com Jeff Sauro Founder

More information

Univariate Regression

Univariate Regression Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is

More information

Remortgage. A simple guide for Remortgage

Remortgage. A simple guide for Remortgage A simple guide for Barr Financial Services is regulated by the FSA. FAS no. 506976. INTRODUCTION This guide hopes to help you understand what a remortgage is and why it may be right for you. If you own

More information

Unit 1 Number Sense. In this unit, students will study repeating decimals, percents, fractions, decimals, and proportions.

Unit 1 Number Sense. In this unit, students will study repeating decimals, percents, fractions, decimals, and proportions. Unit 1 Number Sense In this unit, students will study repeating decimals, percents, fractions, decimals, and proportions. BLM Three Types of Percent Problems (p L-34) is a summary BLM for the material

More information

CREDIT SCORE A COMPREHENSIVE GUIDE TO YOUR. arborfcu.org

CREDIT SCORE A COMPREHENSIVE GUIDE TO YOUR. arborfcu.org A COMPREHENSIVE GUIDE TO YOUR CREDIT SCORE You hear a lot of things about your credit score and how important it is, but how much do you really know about it? Courtesy of Arbor Financial Credit Union This

More information

Unit 22 One-Sided and Two-Sided Hypotheses Tests

Unit 22 One-Sided and Two-Sided Hypotheses Tests Unit 22 One-Sided and Two-Sided Hypotheses Tests Objectives: To differentiate between a one-sided hypothesis test and a two-sided hypothesis test about a population proportion or a population mean To understand

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Homework #3 is due Friday by 5pm. Homework #4 will be posted to the class website later this week. It will be due Friday, March 7 th, at 5pm.

Homework #3 is due Friday by 5pm. Homework #4 will be posted to the class website later this week. It will be due Friday, March 7 th, at 5pm. Homework #3 is due Friday by 5pm. Homework #4 will be posted to the class website later this week. It will be due Friday, March 7 th, at 5pm. Political Science 15 Lecture 12: Hypothesis Testing Sampling

More information

Chapter 7: Simple linear regression Learning Objectives

Chapter 7: Simple linear regression Learning Objectives Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -

More information

Grade 6 Math Circles. Binary and Beyond

Grade 6 Math Circles. Binary and Beyond Faculty of Mathematics Waterloo, Ontario N2L 3G1 The Decimal System Grade 6 Math Circles October 15/16, 2013 Binary and Beyond The cool reality is that we learn to count in only one of many possible number

More information

Case Study in Data Analysis Does a drug prevent cardiomegaly in heart failure?

Case Study in Data Analysis Does a drug prevent cardiomegaly in heart failure? Case Study in Data Analysis Does a drug prevent cardiomegaly in heart failure? Harvey Motulsky hmotulsky@graphpad.com This is the first case in what I expect will be a series of case studies. While I mention

More information

Variables and Data A variable contains data about anything we measure. For example; age or gender of the participants or their score on a test.

Variables and Data A variable contains data about anything we measure. For example; age or gender of the participants or their score on a test. The Analysis of Research Data The design of any project will determine what sort of statistical tests you should perform on your data and how successful the data analysis will be. For example if you decide

More information

Credit Scoring Modelling for Retail Banking Sector.

Credit Scoring Modelling for Retail Banking Sector. Credit Scoring Modelling for Retail Banking Sector. Elena Bartolozzi, Matthew Cornford, Leticia García-Ergüín, Cristina Pascual Deocón, Oscar Iván Vasquez & Fransico Javier Plaza. II Modelling Week, Universidad

More information

Chapter 6: Answers. Omnibus Tests of Model Coefficients. Chi-square df Sig Step Block Model.

Chapter 6: Answers. Omnibus Tests of Model Coefficients. Chi-square df Sig Step Block Model. Task Chapter 6: Answers Recent research has shown that lecturers are among the most stressed workers. A researcher wanted to know exactly what it was about being a lecturer that created this stress and

More information

Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study)

Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study) Cairo University Faculty of Economics and Political Science Statistics Department English Section Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study) Prepared

More information

Real Estate Market Analysis Smith Realty, LLC Arlington, VA

Real Estate Market Analysis Smith Realty, LLC Arlington, VA Real Estate Market Analysis Smith Realty, LLC Arlington, VA Team 5 Monisha Banerjee Megahn Hallahan Dave Lake Tyler Morris Matt Welsh Thursday, May 11, 2006 Agenda I. Objective & Motivations II. Data Background

More information

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether

More information

Technology Step-by-Step Using StatCrunch

Technology Step-by-Step Using StatCrunch Technology Step-by-Step Using StatCrunch Section 1.3 Simple Random Sampling 1. Select Data, highlight Simulate Data, then highlight Discrete Uniform. 2. Fill in the following window with the appropriate

More information

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics. Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing

More information

Understanding Characteristics of Caravan Insurance Policy Buyer

Understanding Characteristics of Caravan Insurance Policy Buyer Understanding Characteristics of Caravan Insurance Policy Buyer May 10, 2007 Group 5 Chih Hau Huang Masami Mabuchi Muthita Songchitruksa Nopakoon Visitrattakul Executive Summary This report is intended

More information

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade Statistics Quiz Correlation and Regression -- ANSWERS 1. Temperature and air pollution are known to be correlated. We collect data from two laboratories, in Boston and Montreal. Boston makes their measurements

More information

Advanced analytics at your hands

Advanced analytics at your hands 2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously

More information

ONE HEN ACADEMY EDUCATOR GUIDE

ONE HEN ACADEMY EDUCATOR GUIDE ONE HEN ACADEMY EDUCATOR GUIDE 2013 One Hen, Inc. 3 OHA Module 3: Loans, Interest, & Borrowing Money This OHA Module introduces students to the common financial concepts of loans, loan interest, and the

More information

Hypothesis Testing. Chapter Introduction

Hypothesis Testing. Chapter Introduction Contents 9 Hypothesis Testing 553 9.1 Introduction............................ 553 9.2 Hypothesis Test for a Mean................... 557 9.2.1 Steps in Hypothesis Testing............... 557 9.2.2 Diagrammatic

More information

SMR Research Corporation Stuart A. Feldstein, President

SMR Research Corporation Stuart A. Feldstein, President SMR Research Corporation Stuart A. Feldstein, President 300 Valentine Street Hackettstown, NJ 07840 Phone 908-852-7677 Fax 908-852-6884 Visit www.smrresearch.com Home Equity Lending To Debt-Free Home Owners

More information

CRJ Doctoral Comprehensive Exam Statistics Friday August 23, :00pm 5:30pm

CRJ Doctoral Comprehensive Exam Statistics Friday August 23, :00pm 5:30pm CRJ Doctoral Comprehensive Exam Statistics Friday August 23, 23 2:pm 5:3pm Instructions: (Answer all questions below) Question I: Data Collection and Bivariate Hypothesis Testing. Answer the following

More information

Lecture 10: Regression Trees

Lecture 10: Regression Trees Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

More information

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools Occam s razor.......................................................... 2 A look at data I.........................................................

More information

Correlation and Regression (Exercises)

Correlation and Regression (Exercises) Chapter V Correlation and Regression (Exercises) 5.. The following table shows the annual income [ ] and years of education of persons: Person Income [ ] Years of Education 25 9 2 2 3 4 6 4 35 6 5 4 8

More information

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Sagarika Prusty Web Data Mining (ECT 584),Spring 2013 DePaul University,Chicago sagarikaprusty@gmail.com Keywords:

More information

Cash Rents Methodology and Quality Measures

Cash Rents Methodology and Quality Measures ISSN: 2167-129X Cash Rents Methodology and Quality Measures Released August 1, 2014, by the National Agricultural Statistics Service (NASS), Agricultural Statistics Board, United States Department of Agriculture

More information

Club Accounts. 2011 Question 6.

Club Accounts. 2011 Question 6. Club Accounts. 2011 Question 6. Anyone familiar with Farm Accounts or Service Firms (notes for both topics are back on the webpage you found this on), will have no trouble with Club Accounts. Essentially

More information

11.3 BREAK-EVEN ANALYSIS. Fixed and Variable Costs

11.3 BREAK-EVEN ANALYSIS. Fixed and Variable Costs 385 356 PART FOUR Capital Budgeting a large number of NPV estimates that we summarize by calculating the average value and some measure of how spread out the different possibilities are. For example, it

More information

INTRODUCING AZURE MACHINE LEARNING

INTRODUCING AZURE MACHINE LEARNING David Chappell INTRODUCING AZURE MACHINE LEARNING A GUIDE FOR TECHNICAL PROFESSIONALS Sponsored by Microsoft Corporation Copyright 2015 Chappell & Associates Contents What is Machine Learning?... 3 The

More information

VLSM CERTIFICATION OBJECTIVES Q&A. Two-Minute Drill Self Test VLSM 8.02 Route Summarization

VLSM CERTIFICATION OBJECTIVES Q&A. Two-Minute Drill Self Test VLSM 8.02 Route Summarization 8 VLSM CERTIFICATION OBJECTIVES 8.01 VLSM 8.02 Route Summarization Q&A Two-Minute Drill Self Test 228 Chapter 8: VLSM In Chapter 7, you were introduced to IP addressing and subnetting, including such topics

More information

Data Visualization. BUS 230: Business and Economic Research and Communication

Data Visualization. BUS 230: Business and Economic Research and Communication Data Visualization BUS 230: Business and Economic Research and Communication Data Visualization 1/ 16 Purpose of graphs and charts is to show a picture that can enhance a message, or quickly communicate

More information