Geographically Weighted Regression



Similar documents
Geographically Weighted Regression

Introduction of geospatial data visualization and geographically weighted reg

Spatial Analysis with GeoDa Spatial Autocorrelation

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

5. Multiple regression

Introduction to Longitudinal Data Analysis

Local classification and local likelihoods

EXPLORING SPATIAL PATTERNS IN YOUR DATA

Financial Risk Management Exam Sample Questions/Answers

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Environmental Remote Sensing GEOG 2021

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Is the person a permanent immigrant. A non permanent resident. Does the person identify as male. Person appearing Chinese

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Integrated Resource Plan

Multiple Linear Regression in Data Mining

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Obesity in America: A Growing Trend

Simple Predictive Analytics Curtis Seare

Introduction to Regression and Data Analysis

Spatial Data Analysis Using GeoDa. Workshop Goals

How To Understand The Theory Of Probability

NC Public Health and Cancer - Trends for 2014

The primary goal of this thesis was to understand how the spatial dependence of

HLM software has been one of the leading statistical packages for hierarchical

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

GREGORY SHARP Curriculum Vitae November 2015

Data Mining Lab 5: Introduction to Neural Networks

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Introduction to Multilevel Modeling Using HLM 6. By ATS Statistical Consulting Group

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

UNIVERSITY OF WAIKATO. Hamilton New Zealand

2. Simple Linear Regression

Geostatistics Exploratory Analysis

Introduction to Quantitative Methods

Latent Class Regression Part II

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Is the Forward Exchange Rate a Useful Indicator of the Future Exchange Rate?

Rob J Hyndman. Forecasting using. 11. Dynamic regression OTexts.com/fpp/9/1/ Forecasting using R 1

EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION

Calculating Effect-Sizes

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Models for Longitudinal and Clustered Data

Virtual Site Event. Predictive Analytics: What Managers Need to Know. Presented by: Paul Arnest, MS, MBA, PMP February 11, 2015

COURSES: 1. Short Course in Econometrics for the Practitioner (P000500) 2. Short Course in Econometric Analysis of Cointegration (P000537)

Evaluation & Validation: Credibility: Evaluating what has been learned

Time Series Analysis

Risk pricing for Australian Motor Insurance

Statistical Models in R

Chapter 5: Analysis of The National Education Longitudinal Study (NELS:88)

Analyzing Intervention Effects: Multilevel & Other Approaches. Simplest Intervention Design. Better Design: Have Pretest

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Practical. I conometrics. data collection, analysis, and application. Christiana E. Hilmer. Michael J. Hilmer San Diego State University

Data Analysis, Statistics, and Probability

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Introduction to spatial data analysis

IMPACT EVALUATION: INSTRUMENTAL VARIABLE METHOD

The Effects of Unemployment on Crime Rates in the U.S.

Introducing the Multilevel Model for Change

Rethinking the Cultural Context of Schooling Decisions in Disadvantaged Neighborhoods: From Deviant Subculture to Cultural Heterogeneity

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Mgmt 469. Fixed Effects Models. Suppose you want to learn the effect of price on the demand for back massages. You

OUTLIER ANALYSIS. Data Mining 1

SYSTEMS OF REGRESSION EQUATIONS

Course Objective This course is designed to give you a basic understanding of how to run regressions in SPSS.

Data Mining Practical Machine Learning Tools and Techniques

The Wondrous World of fmri statistics

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén Table Of Contents

A spreadsheet Approach to Business Quantitative Methods

Getting Correct Results from PROC REG

Power Calculation Using the Online Variance Almanac (Web VA): A User s Guide

Specifications for this HLM2 run

Univariate Regression

MGT 267 PROJECT. Forecasting the United States Retail Sales of the Pharmacies and Drug Stores. Done by: Shunwei Wang & Mohammad Zainal

Penalized regression: Introduction

Econometric Modelling for Revenue Projections

Using GIS to Identify Pedestrian- Vehicle Crash Hot Spots and Unsafe Bus Stops

Northumberland Knowledge

USING PREDICTIVE ANALYTICS TO UNDERSTAND HOUSING ENROLLMENTS

VOLATILITY AND DEVIATION OF DISTRIBUTED SOLAR

Module 4 - Multiple Logistic Regression

Statistics in Retail Finance. Chapter 2: Statistical models of default

Chapter 1 Introduction. 1.1 Introduction

OFFICIAL FILING BEFORE THE PUBLIC SERVICE COMMISSION OF WISCONSIN DIRECT TESTIMONY OF JANNELL E. MARKS

Applying Statistics Recommended by Regulatory Documents

Transcription:

Geographically Weighted Regression CSDE Statistics Workshop Christopher S. Fowler PhD. February 1 st 2011 Significant portions of this workshop were culled from presentations prepared by Fotheringham, Charleton and Brunsdon and presented at the 2010 Advanced Workshop on Spatial Analysis at the University of Santa Barbara. University of Washington Center for Studies in Demography and Ecology

Outline for the Session The motivation for GWR Examples from YOUR discipline Mapping OLS Residuals A good baseline for why we need GWR GWR Definitions, basic concepts Running GWR A straightforward implementation in ArcGIS GWR and some extensions

Basics of OLS y X Assumes a stationary process Same stimulus provokes the same response anywhere in the study area

Why might relationships vary spatially? Sampling variation Relationships intrinsically different across space (attitudes, preferences, contextual effects) Model misspecification

Applications: Ecology GWR works on trees Could have been differentiated sampling pattern creates predictable and changing levels of interaction among observations

Applications: Public Health Relationships vary systematically The relationship between mortality and occupational segregation and between mortality and unemployment varies across Tokyo

Applications: Sociology/Public Policy Missing variables (and they may very well be unknowable) The link between multifamily housing and residential burglaries varies widely even when controlling for numerous socioeconomic and neighborhood factors

Back up How do we know if we have nonstationarity in our model? Map residuals and test them for spatial autocorrelation if our model errs systematically with a spatial pattern then we may be on to something.

Poverty in the Southern U.S.

Our example Model Poverty Fem aleh eadedh ousehold U nem ployed Black 65andolder M etro AtLeastH ighschooleducation Based on the work of Paul Voss and Katherine Curtis These are all understood to be good predictors of poverty What kinds of spatial structures influence this data set?

Lab Part 1 Run our OLS model in ArcGIS Examine model output Map residuals Calculate Moran s I and Local Moran s I

Our best aspatial model

So what now? Add more missing variables and try again Repeat the steps from the lab Accept that there is something about certain places that makes them different (spatial heterogeneity) Try GWR Test variables meant to explore interactions taking place at short distances (spatial dependence) Try Spatial Regression (Likely a spatial lag model) Assume that the correlation is a nuisance and control for it in the error term Try Spatial Regression (Likely a spatial error model)

Outline for Part II What is GWR Weighting in GWR

Geographically Weighted Regression Local statistical technique to analyze spatial variations in relationships We are not content with global averages of spatial data (climate for example) Why should we be satisfied with global averages in a statistical analysis?

Put another way.simpson s Paradox If we think of these points as our data grouped into colors by region we can see that the global and local models differ significantly Source: Rücker and Schumacher BMC Medical Research Methodology 2008 8:34 doi:10.1186/1471-2288-8-34

Basic definitions Spatial nonstationarity exists when the same stimulus provokes a different response in different parts of the study region Global models are statements about processes that are assumed to be stationary and, as such, are location GWR independent greater detail Local models are spatial disaggregations of global models, the results of which are location specific Spatial heterogeneity refers to spatial patterns resulting from broad similarities usually over time Spatial dependence refers to spatial patterns that result from interactions among observations

Spatial Heterogeneity and Spatial Dependence

GWR and Spatial Processes GWR is excellent at picking up broad scale regional differences spatial heterogeneity Not as effective at dealing with small scale interaction processes Too much bias in each local model That doesn t mean it wont try (and give you misleading results)

GWR in a nutshell Global model y X becomes y X i i i i Where i indicates that there is a set of coefficients estimated for every observation in our data set

The Key Difference We estimate a set of regression coefficients for each observation To do so we weight near observations more heavily than more distant ones. We may also estimate coefficients based on some local subset of observations

Some advantages of GWR Excellent tool for testing model specification Where does model fit look good, where are you missing something? Residuals generally lower and not spatially autocorrelated

Real values for β.9.8.8.7.5.8.7.6.5.4.7.6.5.4.4.6.5.4.3.2.5.4.3.2.1

Estimated Values of β in global model.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5

Residuals from global model + + + + 0 + + + 0 - + + 0 - - + 0 - - - 0 - - - -

Reasons to use GWR Identify model misspecification Identify nonstationarity in relationships Improved model fit (R 2, AIC, etc) Reduced spatial autocorrelation Represent context Address spatial heterogeneity when precise variables may not exist

You ve convinced me, what next? Run your aspatial model (as we did in 1 st lab) We will want the results and diagnostics to compare with what comes next. Decide how you are going to weight your nearby locations Fixed bandwidth Variable bandwidth User-defined bandwidth

It all comes down to how you weight the observations We can use a fixed bandwidth h h Wij = exp[-((dij/h) 2 )/2] Number of observations will vary, but area they represent will remain constant

Weighting option 2 Or we can employ an adaptive bandwidth Wij = [1-(d ij2 / h 2 )] 2 if j is one of i s N nearest neighbors Number of observations will remain fixed, but area will not be the same

Kernels and Weights Bandwidth specifies shape of weights curve Kernel type tells us whether we will define our bandwidth based on distance (fixed) or number of neighbors (adaptive) So how do we know what bandwidth to use?

Judging the appropriate bandwidth A tradeoff between Bias: we include observations that are not part of the same spatial group and Variance: we don t have enough points in our model to say anything with conviction AIC Variance Optimum Bias AICc or CV measure model fit Optimize fit to obtain best bandwidth. Bandwidth

To sum Weighting assumptions are very important to outcomes in GWR Fixed distance kernel is more appropriate when the distribution of your observations is relatively stable across space (e.g. size, number of neighbors). Adaptive kernel is appropriate when distribution varies across space (e.g. events are clustered or polygons are heterogeneous) Once a kernel type is selected optimization takes some of the guesswork out of it, but robustness checks are still needed

Residuals from the OLS model from last lesson Looks reasonably good Moran s I is still.22 and highly significant

Lab Run GWR model Check Residuals Check variation in coefficients

Further topics/issues in GWR Where to go for next steps General troubleshooting Significance testing Outlier problems Poisson and Logistic model implementations Mixed form models

Other software implementations of GWR GWR 3.x (4.0 should be out soon) R (spgwr package) Stata Matlab Perhaps others I haven t heard of

General Troubleshooting Regional dummies BAD Eliminate them from model we are trying to show regional variation, not control for it Binary and low probability count variables Use caution, lack of variation may cause model to crash or have trouble finding a workable bandwidth

Significance Testing How do I know if the variation I see in my coefficients is meaningful? Could do t-test, but you will run into problems with multiple (1,387) tests Results in lots of false positives Standard correction (Bonferroni) will make any significance finding nearly impossible

Best Method: Monte Carlo simulation Randomly reassign all observation values (dependent and independent variables travel together) to different observation locations Each county s data gets assigned randomly to a different county Re-run GWR and record coefficients Repeat lots of times (at least 100) Define a distribution for coefficient values and compare your coefficients to this distribution

Other method: Fotheringham Significance Test Fotheringham 1 p e p e np p e is effective number of parameters p is the number of parameters

Fotheringham Significance Test F otheringham Fotheringham 1 p e p e np Type equation here..05 1 (37.97) 37.97 1387 8 In Excel we can find the significant T-statistic using: TINV(.001283,1379) In R we use: qt(1-(.001283/2),1379) Either way we get a value of ~3.23.001283

Results: Significant Nonstationarity for Percent Hispanic

Outlier problems Outliers cause problems for everybody, but their impact is greater for local regressions, particularly when bandwidth keeps number of observations low. In standard OLS Run model and identify observations with high or low residuals (~ +/- 4) Weight these observations less than 1 Re-run until none of the observations have extreme residuals Now do your GWR with weights assigned

Poisson and Logistic model forms Implementations exist in both R and GWR 3.x software Both require much greater care with respect to colinearity and lack of variation

Mixed-form models What if some of your variables are stationary and others have variation? Mixed-form models allow you to hold some coefficients constant while allowing others to vary Not yet implemented in any statistical package, but not that difficult from a technical standpoint

Concluding comments What comes next? Spatial regression Multilevel models