# Regression analysis of probability-linked data

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Regression analysis of probability-linked data Ray Chambers University of Wollongong James Chipperfield Australian Bureau of Statistics Walter Davis Statistics New Zealand 1

2 Overview 1. Probability linkage overview 2. A statistical framework for linkage errors 3. Linear regression with linked data 4. Extension to estimating equations / logistic regression 5. Linking samples to registers 6. Linking nested registers 7. Application: ABS Census Data Enhancement project 8. Future research 2

3 Research questions What impact do errors in probability linking of records from different sources have on subsequent statistical analysis of the linked data? How should standard statistical methods, in particular regression modelling, be modified in order to minimise this impact? 3

4 Background Population of N units, indexed by i = 1,K, N Response = scalar random variable Y Explanators = vector random variable X Linear regression model E X ( Y )= X T β Var X ( Y )= σ 2 Logistic regression model Pr X Y = 1 ( )= exp(xt β) 1+ exp(x T β) 4

5 Estimation of β straightforward given a random sample of values of (Y, X) for units from this population We do not have such a sample instead data are available from two linked population registers Y-register contains values of Y X-register contains values of X No unique identifier to link records from the two registers, so probabilistic record linkage method used 5

6 Probabilistic record linkage Fellegi & Sunter (1969) Record Linkage is a solution to the problem of recognizing those records in two files which represent identical persons, objects, or events... A set of observed variables present in both registers (matching variables) is used to link records in order to maximise the probability that they refer to the same unit 6

7 Applications of record linkage Merging of large databases (e.g. before data mining) Removing duplicates in registries Generating longitudinal records from cross-sectional data Combining data sources (e.g. survey and register) A very active research area, with a focus on: Developing efficient and accurate matching algorithms Alternatives to the statistical theory of Fellegi-Sunter Effects of linkage error on statistical analysis 7

8 An early example Neter et al. (1965) Focus on confounding of linkage errors and response errors, with application to Netherlands Validation Study Reported Bank Balance Observed Relationship Bank Balance from Linked Data 8

9 Can observed regression to the mean relationship between reported and linked bank balances be explained by linkage errors rather than response errors? Using a simple linkage error model (essentially the same as the one that we use later), they show that Pr(correct link) has to be less than 80% to explain this behaviour, from which they conclude that response errors underpin discrepancies 9

10 Analysis framework Both registers are complete, with no duplications, so each contains N records All records on both registers are linked Can define categorical variable Z on both registers measured without error on both takes Q distinct values 1,2,K,Q M q records in each register with Z = q (so N = ) M q q referred to as a Blocking variable in what follows 10

11 Implications Block structure: Linkage errors can only occur within blocks record A on Y-Register with Z = z A and record B on X- register with Z = z B must correspond to different units if z A z B, but z A = z B does not mean that A and B are the same unit Complete linkage: Every record in block q of the Y-register is linked with a unique record in the same block q of the X- register 11

12 Notation: We index the records on the linked data set in exactly the same way as we index the X-register For each block q we have M q linked data pairs (y i,x i ), where y i denotes the Y-value from the record on the Y-register in block q that is linked to the record on the X-register in block q with value X i y q = vector defined by the values (y i ;i q) X q = matrix with rows defined by the values (X i ;i q) y q = unknown vector corresponding to the true Y values (y i ;i q) associated with X q 12

13 A Model for linkage error y q = A q y q A q is an unknown random permutation matrix of order M q, i.e. entries of A q are either zero or one, with a value of one occurring just once in each row and column E X (A q ) = E q Non-informative linkage E ( X A q y q )= E ( X A q )E ( X y q )= E q X q β - i.e. records are mis-matched at random given X 13

14 Example y = y 1 y 2 y 3 y 4 y 5 while y = y 3 y 2 y 5 y 4 y 1 A =

15 An exchangeable linkage error model Since linkage process maximises the probability that a declared link is a true link, correct linkages should be more likely than incorrect linkages... Pr(correct linkage in block q) = λ q Pr(wrong linkage in block q) = γ q E q = E X ( A q )= λ q γ q L γ q γ q λ q L γ q M M O M γ q γ q L λ q 15

16 16 Implication Complete linkage within a block, so A q 1 q = A q T 1 q = 1 q, e.g. A q 1 q = =

17 17 Immediately follows that E q 1 q = E q T 1 q = 1 q, i.e. λ q γ q L γ q γ q λ q L γ q M M O M γ q γ q L λ q 1 1 M 1 = 1 1 M 1 or equivalently λ q + (M q 1)γ q = 1 γ q = 1 λ q M q 1

18 For example, when M = 50 and λ = 0.9 = L 0.1 E = L 0.1 = 49 M M O M L L L M M O M L

19 Linear regression Standard Approach: Estimates β as if y and y i.e. A is the identity matrix are identical, ˆβ = (X T X) 1 X T y = q X q T X q 1 q X T q A q y q Biased if linkage is not perfect... E X ( ˆβ ) = q X q T X q 1 q X T q E q X q β = Dβ β 19

20 Bias-correcting the naive OLS estimator Motivated by approach of Scheuren and Winkler (1993). If E is known, D is also known. Provided D 1 estimator of β is exists, an unbiased ˆβ SW = D 1 ˆβ = { q X q T X q 1 } 1 q X T q E q X q ˆβ Assuming that X T q E q X q is of full rank q ˆβ SW = ( X T q q E q X ) 1 ( q X T q q y ) q 20

21 An unbiased OLS estimator Lahiri and Larsen (2005) E X ( y q )= E ( X A q )E ( X y q )= E q X q β = H q β OLS estimator based on this corrected model matrix is unbiased... ˆβ LL = ( H T q q H ) 1 q ( H T q q y ) q = ( X T q E T q q E q X ) 1 ( q X T q E T q q y ) q 21

22 Efficient linear estimation using linked data Regression errors under corrected model are not homoskedastic. Their behaviour varies from block to block, reflecting the impact of different amounts of linkage error in different blocks Var X { ( T )A q }+ Var X A q E X y q ( y q )= E X A q Var X y q { ( )} = E { X A ( q σ 2 T I q )A q }+ Var ( X A q X q β) = σ 2 E ( T X A q A q )+ Var { X A q f } q = σ 2 I q + V q = Σ q 22

23 Best linear unbiased estimator ˆβ BL = ( H T q Σ 1 q q H ) 1 q ( H T q Σ 1 q q y ) q = ( X T q E T q Σ 1 q q E q X ) 1 q ( X T q E T q Σ 1 q q y ) q depends on Σ q, and hence on σ 2 and β substitute ˆσ 2 for σ 2, ˆV q for V q then iterate... 23

24 Estimating the finite population regression parameter Kovacevic (2008): If we knew y q, our best estimate of β would be its OLS estimate B = ( X T q q X ) 1 q ( q X T q y ) q Look for an estimator based on the linked data that is unbiased for B given the correctly linked population values... E YX ( ˆB )= B Note: None of ˆβ SW, ˆβ LL, ˆβ BL have this finite population property 24

25 Consider class of estimators that can be written in the form ˆB = ( X T q q X ) 1 q ( X T q q K q y ) q K q = E q 1 E YX ( ) 1 q ( ˆB )= X T q X q ( q X T q K q E q y q )= B Leads to ˆβ MK = ( X T q q X ) 1 ( q X T q E 1 q q y ) q 25

26 Some model-based simulation results Three blocks, M 1 = 1500, M 2 = 300 and M 3 = 200, with independent exchangeable linkage errors in each block Two scenarios: o λ q correctly specified (λ 1 = 1.0, λ 2 = 0.95, λ 3 = 0.75) o λ q estimated by ˆλ q = min{ m 1 ( q m q 0.5),max ( M 1 q,l ) q }, with l q equal to the number of correct links in a random sample of m q = 20 linked records in each of blocks 2 and 3 y i = 1+ 5x i + e i, with x i : U[0,1] and e i : N(0,1) 26

27 Estimator Relative Bias Relative RMSE Coverage Intercept Slope Intercept Slope Intercept Slope Scenario 1: Linkage probabilities correctly specified TR ST SW MK LL BL Scenario 2: Linkage probabilities estimated from audit sample TR ST SW MK LL BL

28 Estimation errors slope 28

29 Comments When λ q is known (or estimated unbiasedly) all adjusted estimators of β are unbiased, with BL (EBLUE) the most efficient and MK (finite population unbiased) the least Estimated variances for the adjusted estimators of β include an extra component in the filling of the sandwich estimator. This estimates the additional contribution to variance when λ q is estimated. Without it, CI coverage is not good! 29

30 Audit sample issues Do smaller audit samples increase bias? some evidence of small bias (1-2.5%) for SW & MK estimators only when clerical sample is as small as 10; LL and BL are fine. 95% CIs are solid Do smaller audit samples substantially increase the variance? some but not substantially (more in a bit) 30

31 What is the impact of the size (m) of the clerical sample? Increase in variance relative to true OLS Known λ q 7 11% Estimated λ q, m q = % Estimated λ q, m q = 25 Estimated λ q, m q = % (33% for MK) 15 30% (>60% MK) 31

32 How low can we go? 10 records out of 200 is a 5% sample and, in the real world, nobody wants to undertake a 5% sample of an administrative database... So for a population size of (blocks of 5000, 3000, 2000) and an audit sample of just 10 in each block: Estimators still look good in terms of bias (esp. LL and BL) Variances are 35 45% higher than true OLS (>100% for MK) 95% CIs are too wide (97 100% coverage) 32

33 Applying in practice: The LinkReg SAS macro Code applying theory initially developed in R In practice, analysis of probabilistically linked data will involve hundreds of thousands, even millions, of records... A means of efficiently fitting linear models to linked datasets of this magnitude is necessary for Statistics New Zealand to take advantage of theory developed in project 33

34 The details %LinkReg (INDATA=dataset, Y=y-variable, X=x-variable, BLOCK=block-variable, LAMBDA=λ-variable, ESTLAMBDA=0/1, MQSAMP=clerical-sample-size-variable, COVEST=0/1, OUTLIB=output-library) A linear model with 10 X variables and a million observations takes less than a minute to fit 34

35 What You Get Printed output: Model specifications Three experimental R-square values Coefficients, SEs, Z-values and p(z) Covariance matrix of the estimates by request Output datasets: _linkbeta, containing coeffs and SEs _covest_??, containing the cov matrix 35

36 Logistic regression Y is binary with E( Y = 1 X = x)= Pr( Y = 1 X = x)= exp(xt β) 1+ exp(x T β) Put f q ( β)= { E( y i x i );i q}= exp(x it β) ;i q. Then, given y q, 1+ exp(x it β) we usually estimate β as the solution to the ML estimating equation { y q f q β } = 0 T X q q ( ) 36

37 Estimating functions with linked data Unbiased estimating function given correctly-linked data H(θ) = N { } G i (θ) y i f i (θ) = G q (θ) y q f q (θ) i=1 q { } When used with probability-linked data, this becomes H (θ) = q G q (θ){ y q f q (θ)} 37

38 Estimating function is no longer unbiased... { } E { X H (θ 0 )}= G q (θ 0 ) ( E q I q )f q (θ 0 ) 0 q A bias-corrected estimating function H adj (θ) = H (θ) G q (θ) ( E q I q )f q (θ) q = G q (θ) y q E q f q (θ) { } q { } 38

39 Application to logistic regression Estimating equation { } G q (β) y q E q f q (β) = 0 q Choosing G q (β) M (defines MLE when data are correctly linked): G q (β) = X q T A (leads to LL estimator in linear model): G q (β) = X q T E q T C (second-order optimal, leads to BL estimator in linear model) 39

40 Some simulation results Same set up as in previous simulation i.e. three blocks, M 1 = 1500, M 2 = 300 and M 3 = 200, with independent exchangeable linkage errors in each block Same λ q s as before (λ 1 = 1.0, λ 2 = 0.95, λ 3 = 0.75) Results provided for two cases: λ q known and λ q estimated by { ( )} ˆλ q = min m 1 ( q m q 0.5),max M 1 q,l q logit{ E( y i x ) i }= 1 5x i, with x i : Uniform[0,1] y i = I{ U i E( y i x ) i }, with U i : Uniform[0,1] 40

41 Focus on estimation of slope parameter Estimator Relative Bias Relative RMSE Coverage Scenario 1: Linkage probabilities correctly specified TR ST M A C Scenario 2: Linkage probabilities estimated TR ST M A C

42 Estimation errors slope 42

43 Comments All three adjusted estimators are unbiased if λ q known or estimated unbiasedly, with C the most efficient performer and M the least efficient Differences in efficiency are not as pronounced as in linear case, most probably because linkage error does not automatically result in measurement error when Y is binary Again, estimated variances include extra component to allow for estimation of λ q. Without this component, coverage is reduced, but effect is not as pronounced as in linear case 43

44 Sample to register linkage First link Y and X-registers (at least conceptually), then take sample from X-register (equivalent to sampling from linked population register), so data = ( y sq,x ) sq with sample weights w sq sampling and linkage processes are independent may also have access to summary statistics from unlinked registers, so data could also include y q, x q weighted (pseudo-likelihood) methodology investigated (no population summary data) 44

45 ( X,Y )= x 1R x 2 R x 3R x 1S x 2S x 3S y 1R y 2 R y 3R y 1S y 2S y 3S = X R Y R but ( X,Y S S )= X S Y S x 1S x 2S x 3S y 3R y 3S y 2S Y = Y R Y S = AY = Y R Y S = A RR A SR A RS A SS Y R Y S 45

46 Nested linkage One register is a subset of the other, so only one register can be completely linked (1) X-register is a subset of the Y-register (2) Y-register is a subset of the X-register sampling and linkage processes are not independent sample to register linkage where sample is first selected, then linked to register is a special case of nested linkage... 46

47 ( X,Y )= x 1B x 2 B x 3B y 1A y 2 A y 3A y 1B y 2 B y 3B = X B Y A Y B but ( X,Y B B)= x 1B x 2 B x 3B y 3A y 3B y 2 B Y = Y A Y B = AY = Y A Y B = A AA A BA A AB A BB Y A Y B 47

48 ABS Census Data Enhancement project Statistical Longitudinal Census Dataset (SLCD) 5% sample of 2006 Census person records linked to their corresponding 2011 Census records without use of names and addresses Substantial opportunity for longitudinal analysis at a relatively small geographical level while maintaining the ABS strong commitment to maintain the confidentiality of its Census respondents 48

49 Problem 1: Will the linked records really be a random sample? who will SLCD data represent in 2011? Problem 2: How accurate will the linkage be? linkage errors are a particular type of measurement error and will induce biases in analysis 49

50 Simulated SLCD Census Dress Rehearsal (2005) linked to 2006 Census gold linking name, address, mesh block + census data items (treated as truth) bronze linking mesh block + census data items Regression models fitted to gold-linked and bronze-linked data give different results 50

51 Can methodology developed so far significantly reduce this discrepancy? Logistic model for Naive deviance Adjusted deviance Migration Employment Student Why? 51

52 Main components of Chi-square error for bronze migration model Source Naive fit Adjusted fit Sample bias Incorrect links (B-B) Incorrect links (B-A) Real problem is that bronze-linked sample and gold-linked sample are not representative of the same population... 52

53 Ongoing related research Extension to multilevel models ARC research project on linked longitudinal data Linking surveys to registers Canadian health survey database (with Milorad Kovacevic) Non-exchangeable models for linkage errors? Informative linkage (after blocking)? Errors in blocking variables? Linkage across multiple databases (longitudinal linkage)? 53

54 References Felligi, I.P. and Sunter, A.B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64, Lahiri, P. and Larsen, M.D. (2005). Regression analysis with linked data. Journal of the American Statistical Association, 100, Neter, J., Maynes, E.S. and Ramanathan, R. (1965). The effect of mismatching on the measurement of response error. Journal of the American Statistical Association, 60, Scheuren, F. and Winkler, W.E. (1993). Regression analysis of data files that are computer matched. Survey Methodology, 19,

### Regression Analysis of Probability-Linked Data

Official Statistics Research Series, Vol 4, 2009 ISSN 1177-5017; ISBN 978-0-478-31569-1 (Online) Regression Analysis of Probability-Linked Data Ray Chambers Centre for Statistical and Survey Methodology,

### Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

### 5. Linear Regression

5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4

### Regression Analysis: Basic Concepts

The simple linear model Regression Analysis: Basic Concepts Allin Cottrell Represents the dependent variable, y i, as a linear function of one independent variable, x i, subject to a random disturbance

### Multivariate Logistic Regression

1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation

### Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

### Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction

### Overview of Methods for Analyzing Cluster-Correlated Data. Garrett M. Fitzmaurice

Overview of Methods for Analyzing Cluster-Correlated Data Garrett M. Fitzmaurice Laboratory for Psychiatric Biostatistics, McLean Hospital Department of Biostatistics, Harvard School of Public Health Outline

### Statistical Models in R

Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Beckman HLM Reading Group: Questions, Answers and Examples Carolyn J. Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Linear Algebra Slide 1 of

### From the help desk: Bootstrapped standard errors

The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution

### Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses

Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses G. Gordon Brown, Celia R. Eicheldinger, and James R. Chromy RTI International, Research Triangle Park, NC 27709 Abstract

### University of Ljubljana Doctoral Programme in Statistics Methodology of Statistical Research Written examination February 14 th, 2014.

University of Ljubljana Doctoral Programme in Statistics ethodology of Statistical Research Written examination February 14 th, 2014 Name and surname: ID number: Instructions Read carefully the wording

### Chapter 3: The Multiple Linear Regression Model

Chapter 3: The Multiple Linear Regression Model Advanced Econometrics - HEC Lausanne Christophe Hurlin University of Orléans November 23, 2013 Christophe Hurlin (University of Orléans) Advanced Econometrics

### Least Squares Estimation

Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### Machine Learning Logistic Regression

Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.

### Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541 libname in1 >c:\=; Data first; Set in1.extract; A=1; PROC LOGIST OUTEST=DD MAXITER=100 ORDER=DATA; OUTPUT OUT=CC XBETA=XB P=PROB; MODEL

### SYSTEMS OF REGRESSION EQUATIONS

SYSTEMS OF REGRESSION EQUATIONS 1. MULTIPLE EQUATIONS y nt = x nt n + u nt, n = 1,...,N, t = 1,...,T, x nt is 1 k, and n is k 1. This is a version of the standard regression model where the observations

### Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

### SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

### Data Matching Optimal and Greedy

Chapter 13 Data Matching Optimal and Greedy Introduction This procedure is used to create treatment-control matches based on propensity scores and/or observed covariate variables. Both optimal and greedy

### The basic unit in matrix algebra is a matrix, generally expressed as: a 11 a 12. a 13 A = a 21 a 22 a 23

(copyright by Scott M Lynch, February 2003) Brief Matrix Algebra Review (Soc 504) Matrix algebra is a form of mathematics that allows compact notation for, and mathematical manipulation of, high-dimensional

### Statistics in Retail Finance. Chapter 2: Statistical models of default

Statistics in Retail Finance 1 Overview > We consider how to build statistical models of default, or delinquency, and how such models are traditionally used for credit application scoring and decision

### Structural Equation Models for Comparing Dependent Means and Proportions. Jason T. Newsom

Structural Equation Models for Comparing Dependent Means and Proportions Jason T. Newsom How to Do a Paired t-test with Structural Equation Modeling Jason T. Newsom Overview Rationale Structural equation

### Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples

### Chapter 6. The stacking ensemble approach

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

### ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

University of California, Berkeley Prof. Ken Chay Department of Economics Fall Semester, 005 ECON 14 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE # Question 1: a. Below are the scatter plots of hourly wages

### Random effects and nested models with SAS

Random effects and nested models with SAS /************* classical2.sas ********************* Three levels of factor A, four levels of B Both fixed Both random A fixed, B random B nested within A ***************************************************/

### 4. Simple regression. QBUS6840 Predictive Analytics. https://www.otexts.org/fpp/4

4. Simple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/4 Outline The simple linear model Least squares estimation Forecasting with regression Non-linear functional forms Regression

### E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

Random and Mixed Effects Models (Ch. 10) Random effects models are very useful when the observations are sampled in a highly structured way. The basic idea is that the error associated with any linear,

### What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling Jeff Wooldridge NBER Summer Institute, 2007 1. The Linear Model with Cluster Effects 2. Estimation with a Small Number of Groups and

### Question 2: How do you solve a matrix equation using the matrix inverse?

Question : How do you solve a matrix equation using the matrix inverse? In the previous question, we wrote systems of equations as a matrix equation AX B. In this format, the matrix A contains the coefficients

### Introduction to Analysis Methods for Longitudinal/Clustered Data, Part 3: Generalized Estimating Equations

Introduction to Analysis Methods for Longitudinal/Clustered Data, Part 3: Generalized Estimating Equations Mark A. Weaver, PhD Family Health International Office of AIDS Research, NIH ICSSC, FHI Goa, India,

### Assignments Analysis of Longitudinal data: a multilevel approach

Assignments Analysis of Longitudinal data: a multilevel approach Frans E.S. Tan Department of Methodology and Statistics University of Maastricht The Netherlands Maastricht, Jan 2007 Correspondence: Frans

### Poisson Models for Count Data

Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the

### Missing Data Dr Eleni Matechou

1 Statistical Methods Principles Missing Data Dr Eleni Matechou matechou@stats.ox.ac.uk References: R.J.A. Little and D.B. Rubin 2nd edition Statistical Analysis with Missing Data J.L. Schafer and J.W.

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### Introducing the Multilevel Model for Change

Department of Psychology and Human Development Vanderbilt University GCM, 2010 1 Multilevel Modeling - A Brief Introduction 2 3 4 5 Introduction In this lecture, we introduce the multilevel model for change.

### Internet Appendix to CAPM for estimating cost of equity capital: Interpreting the empirical evidence

Internet Appendix to CAPM for estimating cost of equity capital: Interpreting the empirical evidence This document contains supplementary material to the paper titled CAPM for estimating cost of equity

### Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

### Vector and Matrix Norms

Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a non-empty

### Methodological aspects of small area estimation from the National Electronic Health Records Survey (NEHRS). 1

Methodological aspects of small area estimation from the National Electronic Health Records Survey (NEHRS). 1 Vladislav Beresovsky (hvy4@cdc.gov), Janey Hsiao National Center for Health Statistics, CDC

Lecture 5: Linear least-squares Regression III: Advanced Methods William G. Jacoby Department of Political Science Michigan State University http://polisci.msu.edu/jacoby/icpsr/regress3 Simple Linear Regression

### Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers

Variance Reduction The statistical efficiency of Monte Carlo simulation can be measured by the variance of its output If this variance can be lowered without changing the expected value, fewer replications

### 1 Introduction to Matrices

1 Introduction to Matrices In this section, important definitions and results from matrix algebra that are useful in regression analysis are introduced. While all statements below regarding the columns

### NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

### International Statistical Institute, 56th Session, 2007: Phil Everson

Teaching Regression using American Football Scores Everson, Phil Swarthmore College Department of Mathematics and Statistics 5 College Avenue Swarthmore, PA198, USA E-mail: peverso1@swarthmore.edu 1. Introduction

### ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING BY OMID ROUHANI-KALLEH THESIS Submitted as partial fulfillment of the requirements for the degree of

### Electronic Thesis and Dissertations UCLA

Electronic Thesis and Dissertations UCLA Peer Reviewed Title: A Multilevel Longitudinal Analysis of Teaching Effectiveness Across Five Years Author: Wang, Kairong Acceptance Date: 2013 Series: UCLA Electronic

### 4. Matrix inverses. left and right inverse. linear independence. nonsingular matrices. matrices with linearly independent columns

L. Vandenberghe EE133A (Spring 2016) 4. Matrix inverses left and right inverse linear independence nonsingular matrices matrices with linearly independent columns matrices with linearly independent rows

### Random Effects Models for Longitudinal Survey Data

Analysis of Survey Data. Edited by R. L. Chambers and C. J. Skinner Copyright 2003 John Wiley & Sons, Ltd. ISBN: 0-471-89987-9 CHAPTER 14 Random Effects Models for Longitudinal Survey Data C. J. Skinner

### Multilevel Modeling of Complex Survey Data

Multilevel Modeling of Complex Survey Data Sophia Rabe-Hesketh, University of California, Berkeley and Institute of Education, University of London Joint work with Anders Skrondal, London School of Economics

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

### Getting Started with HLM 5. For Windows

For Windows August 2012 Table of Contents Section 1: Overview... 3 1.1 About this Document... 3 1.2 Introduction to HLM... 3 1.3 Accessing HLM... 3 1.4 Getting Help with HLM... 4 Section 2: Accessing Data

### Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components

Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components The eigenvalues and eigenvectors of a square matrix play a key role in some important operations in statistics. In particular, they

### Logistic Regression for Spam Filtering

Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used

### CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

### Imputing Missing Data using SAS

ABSTRACT Paper 3295-2015 Imputing Missing Data using SAS Christopher Yim, California Polytechnic State University, San Luis Obispo Missing data is an unfortunate reality of statistics. However, there are

### Financial Risk Management Exam Sample Questions/Answers

Financial Risk Management Exam Sample Questions/Answers Prepared by Daniel HERLEMONT 1 2 3 4 5 6 Chapter 3 Fundamentals of Statistics FRM-99, Question 4 Random walk assumes that returns from one time period

### Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Statistics Quiz Correlation and Regression -- ANSWERS 1. Temperature and air pollution are known to be correlated. We collect data from two laboratories, in Boston and Montreal. Boston makes their measurements

### HLM software has been one of the leading statistical packages for hierarchical

Introductory Guide to HLM With HLM 7 Software 3 G. David Garson HLM software has been one of the leading statistical packages for hierarchical linear modeling due to the pioneering work of Stephen Raudenbush

### Estimation and Inference in Cointegration Models Economics 582

Estimation and Inference in Cointegration Models Economics 582 Eric Zivot May 17, 2012 Tests for Cointegration Let the ( 1) vector Y be (1). Recall, Y is cointegrated with 0 cointegrating vectors if there

### VI. Introduction to Logistic Regression

VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models

### Life Table Analysis using Weighted Survey Data

Life Table Analysis using Weighted Survey Data James G. Booth and Thomas A. Hirschl June 2005 Abstract Formulas for constructing valid pointwise confidence bands for survival distributions, estimated using

### Generalized Linear Models

Generalized Linear Models We have previously worked with regression models where the response variable is quantitative and normally distributed. Now we turn our attention to two types of models where the

### MATLAB and Big Data: Illustrative Example

MATLAB and Big Data: Illustrative Example Rick Mansfield Cornell University August 19, 2014 Goals Use a concrete example from my research to: Demonstrate the value of vectorization Introduce key commands/functions

### Introduction to Longitudinal Data Analysis

Introduction to Longitudinal Data Analysis Longitudinal Data Analysis Workshop Section 1 University of Georgia: Institute for Interdisciplinary Research in Education and Human Development Section 1: Introduction

### Univariate Regression

Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is

### From the help desk: Swamy s random-coefficients model

The Stata Journal (2003) 3, Number 3, pp. 302 308 From the help desk: Swamy s random-coefficients model Brian P. Poi Stata Corporation Abstract. This article discusses the Swamy (1970) random-coefficients

### LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

### Models for Longitudinal and Clustered Data

Models for Longitudinal and Clustered Data Germán Rodríguez December 9, 2008, revised December 6, 2012 1 Introduction The most important assumption we have made in this course is that the observations

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

### AN INTRODUCTION TO ERROR CORRECTING CODES Part 1

AN INTRODUCTION TO ERROR CORRECTING CODES Part 1 Jack Keil Wolf ECE 154C Spring 2008 Noisy Communications Noise in a communications channel can cause errors in the transmission of binary digits. Transmit:

### We shall turn our attention to solving linear systems of equations. Ax = b

59 Linear Algebra We shall turn our attention to solving linear systems of equations Ax = b where A R m n, x R n, and b R m. We already saw examples of methods that required the solution of a linear system

### DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,

### MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

### Extensions of the Partial Least Squares approach for the analysis of biomolecular interactions. Nina Kirschbaum

Dissertation am Fachbereich Statistik der Universität Dortmund Extensions of the Partial Least Squares approach for the analysis of biomolecular interactions Nina Kirschbaum Erstgutachter: Prof. Dr. W.

### Keep It Simple: Easy Ways To Estimate Choice Models For Single Consumers

Keep It Simple: Easy Ways To Estimate Choice Models For Single Consumers Christine Ebling, University of Technology Sydney, christine.ebling@uts.edu.au Bart Frischknecht, University of Technology Sydney,

### L3: Statistical Modeling with Hadoop

L3: Statistical Modeling with Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 10, 2014 Today we are going to learn...

### Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes

Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, Discrete Changes JunXuJ.ScottLong Indiana University August 22, 2005 The paper provides technical details on

### Factorization Theorems

Chapter 7 Factorization Theorems This chapter highlights a few of the many factorization theorems for matrices While some factorization results are relatively direct, others are iterative While some factorization

### IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

IAPRI Quantitative Analysis Capacity Building Series Multiple regression analysis & interpreting results How important is R-squared? R-squared Published in Agricultural Economics 0.45 Best article of the

### Testing for Lack of Fit

Chapter 6 Testing for Lack of Fit How can we tell if a model fits the data? If the model is correct then ˆσ 2 should be an unbiased estimate of σ 2. If we have a model which is not complex enough to fit

### Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month

### Marketing Mix Modelling and Big Data P. M Cain

1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored

### Nominal and ordinal logistic regression

Nominal and ordinal logistic regression April 26 Nominal and ordinal logistic regression Our goal for today is to briefly go over ways to extend the logistic regression model to the case where the outcome

### Applied Multivariate Analysis - Big data analytics

Applied Multivariate Analysis - Big data analytics Nathalie Villa-Vialaneix nathalie.villa@toulouse.inra.fr http://www.nathalievilla.org M1 in Economics and Economics and Statistics Toulouse School of

### Approaches for Analyzing Survey Data: a Discussion

Approaches for Analyzing Survey Data: a Discussion David Binder 1, Georgia Roberts 1 Statistics Canada 1 Abstract In recent years, an increasing number of researchers have been able to access survey microdata

### An extension of the factoring likelihood approach for non-monotone missing data

An extension of the factoring likelihood approach for non-monotone missing data Jae Kwang Kim Dong Wan Shin January 14, 2010 ABSTRACT We address the problem of parameter estimation in multivariate distributions

### Multivariate Normal Distribution

Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

### To give it a definition, an implicit function of x and y is simply any relationship that takes the form:

2 Implicit function theorems and applications 21 Implicit functions The implicit function theorem is one of the most useful single tools you ll meet this year After a while, it will be second nature to

### 15.062 Data Mining: Algorithms and Applications Matrix Math Review

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

### 1/27/2013. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 Introduce moderated multiple regression Continuous predictor continuous predictor Continuous predictor categorical predictor Understand

### 2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system

1. Systems of linear equations We are interested in the solutions to systems of linear equations. A linear equation is of the form 3x 5y + 2z + w = 3. The key thing is that we don t multiply the variables

### Factors affecting online sales

Factors affecting online sales Table of contents Summary... 1 Research questions... 1 The dataset... 2 Descriptive statistics: The exploratory stage... 3 Confidence intervals... 4 Hypothesis tests... 4

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes