Regression analysis of probability-linked data

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Regression analysis of probability-linked data"

Transcription

1 Regression analysis of probability-linked data Ray Chambers University of Wollongong James Chipperfield Australian Bureau of Statistics Walter Davis Statistics New Zealand 1

2 Overview 1. Probability linkage overview 2. A statistical framework for linkage errors 3. Linear regression with linked data 4. Extension to estimating equations / logistic regression 5. Linking samples to registers 6. Linking nested registers 7. Application: ABS Census Data Enhancement project 8. Future research 2

3 Research questions What impact do errors in probability linking of records from different sources have on subsequent statistical analysis of the linked data? How should standard statistical methods, in particular regression modelling, be modified in order to minimise this impact? 3

4 Background Population of N units, indexed by i = 1,K, N Response = scalar random variable Y Explanators = vector random variable X Linear regression model E X ( Y )= X T β Var X ( Y )= σ 2 Logistic regression model Pr X Y = 1 ( )= exp(xt β) 1+ exp(x T β) 4

5 Estimation of β straightforward given a random sample of values of (Y, X) for units from this population We do not have such a sample instead data are available from two linked population registers Y-register contains values of Y X-register contains values of X No unique identifier to link records from the two registers, so probabilistic record linkage method used 5

6 Probabilistic record linkage Fellegi & Sunter (1969) Record Linkage is a solution to the problem of recognizing those records in two files which represent identical persons, objects, or events... A set of observed variables present in both registers (matching variables) is used to link records in order to maximise the probability that they refer to the same unit 6

7 Applications of record linkage Merging of large databases (e.g. before data mining) Removing duplicates in registries Generating longitudinal records from cross-sectional data Combining data sources (e.g. survey and register) A very active research area, with a focus on: Developing efficient and accurate matching algorithms Alternatives to the statistical theory of Fellegi-Sunter Effects of linkage error on statistical analysis 7

8 An early example Neter et al. (1965) Focus on confounding of linkage errors and response errors, with application to Netherlands Validation Study Reported Bank Balance Observed Relationship Bank Balance from Linked Data 8

9 Can observed regression to the mean relationship between reported and linked bank balances be explained by linkage errors rather than response errors? Using a simple linkage error model (essentially the same as the one that we use later), they show that Pr(correct link) has to be less than 80% to explain this behaviour, from which they conclude that response errors underpin discrepancies 9

10 Analysis framework Both registers are complete, with no duplications, so each contains N records All records on both registers are linked Can define categorical variable Z on both registers measured without error on both takes Q distinct values 1,2,K,Q M q records in each register with Z = q (so N = ) M q q referred to as a Blocking variable in what follows 10

11 Implications Block structure: Linkage errors can only occur within blocks record A on Y-Register with Z = z A and record B on X- register with Z = z B must correspond to different units if z A z B, but z A = z B does not mean that A and B are the same unit Complete linkage: Every record in block q of the Y-register is linked with a unique record in the same block q of the X- register 11

12 Notation: We index the records on the linked data set in exactly the same way as we index the X-register For each block q we have M q linked data pairs (y i,x i ), where y i denotes the Y-value from the record on the Y-register in block q that is linked to the record on the X-register in block q with value X i y q = vector defined by the values (y i ;i q) X q = matrix with rows defined by the values (X i ;i q) y q = unknown vector corresponding to the true Y values (y i ;i q) associated with X q 12

13 A Model for linkage error y q = A q y q A q is an unknown random permutation matrix of order M q, i.e. entries of A q are either zero or one, with a value of one occurring just once in each row and column E X (A q ) = E q Non-informative linkage E ( X A q y q )= E ( X A q )E ( X y q )= E q X q β - i.e. records are mis-matched at random given X 13

14 Example y = y 1 y 2 y 3 y 4 y 5 while y = y 3 y 2 y 5 y 4 y 1 A =

15 An exchangeable linkage error model Since linkage process maximises the probability that a declared link is a true link, correct linkages should be more likely than incorrect linkages... Pr(correct linkage in block q) = λ q Pr(wrong linkage in block q) = γ q E q = E X ( A q )= λ q γ q L γ q γ q λ q L γ q M M O M γ q γ q L λ q 15

16 16 Implication Complete linkage within a block, so A q 1 q = A q T 1 q = 1 q, e.g. A q 1 q = =

17 17 Immediately follows that E q 1 q = E q T 1 q = 1 q, i.e. λ q γ q L γ q γ q λ q L γ q M M O M γ q γ q L λ q 1 1 M 1 = 1 1 M 1 or equivalently λ q + (M q 1)γ q = 1 γ q = 1 λ q M q 1

18 For example, when M = 50 and λ = 0.9 = L 0.1 E = L 0.1 = 49 M M O M L L L M M O M L

19 Linear regression Standard Approach: Estimates β as if y and y i.e. A is the identity matrix are identical, ˆβ = (X T X) 1 X T y = q X q T X q 1 q X T q A q y q Biased if linkage is not perfect... E X ( ˆβ ) = q X q T X q 1 q X T q E q X q β = Dβ β 19

20 Bias-correcting the naive OLS estimator Motivated by approach of Scheuren and Winkler (1993). If E is known, D is also known. Provided D 1 estimator of β is exists, an unbiased ˆβ SW = D 1 ˆβ = { q X q T X q 1 } 1 q X T q E q X q ˆβ Assuming that X T q E q X q is of full rank q ˆβ SW = ( X T q q E q X ) 1 ( q X T q q y ) q 20

21 An unbiased OLS estimator Lahiri and Larsen (2005) E X ( y q )= E ( X A q )E ( X y q )= E q X q β = H q β OLS estimator based on this corrected model matrix is unbiased... ˆβ LL = ( H T q q H ) 1 q ( H T q q y ) q = ( X T q E T q q E q X ) 1 ( q X T q E T q q y ) q 21

22 Efficient linear estimation using linked data Regression errors under corrected model are not homoskedastic. Their behaviour varies from block to block, reflecting the impact of different amounts of linkage error in different blocks Var X { ( T )A q }+ Var X A q E X y q ( y q )= E X A q Var X y q { ( )} = E { X A ( q σ 2 T I q )A q }+ Var ( X A q X q β) = σ 2 E ( T X A q A q )+ Var { X A q f } q = σ 2 I q + V q = Σ q 22

23 Best linear unbiased estimator ˆβ BL = ( H T q Σ 1 q q H ) 1 q ( H T q Σ 1 q q y ) q = ( X T q E T q Σ 1 q q E q X ) 1 q ( X T q E T q Σ 1 q q y ) q depends on Σ q, and hence on σ 2 and β substitute ˆσ 2 for σ 2, ˆV q for V q then iterate... 23

24 Estimating the finite population regression parameter Kovacevic (2008): If we knew y q, our best estimate of β would be its OLS estimate B = ( X T q q X ) 1 q ( q X T q y ) q Look for an estimator based on the linked data that is unbiased for B given the correctly linked population values... E YX ( ˆB )= B Note: None of ˆβ SW, ˆβ LL, ˆβ BL have this finite population property 24

25 Consider class of estimators that can be written in the form ˆB = ( X T q q X ) 1 q ( X T q q K q y ) q K q = E q 1 E YX ( ) 1 q ( ˆB )= X T q X q ( q X T q K q E q y q )= B Leads to ˆβ MK = ( X T q q X ) 1 ( q X T q E 1 q q y ) q 25

26 Some model-based simulation results Three blocks, M 1 = 1500, M 2 = 300 and M 3 = 200, with independent exchangeable linkage errors in each block Two scenarios: o λ q correctly specified (λ 1 = 1.0, λ 2 = 0.95, λ 3 = 0.75) o λ q estimated by ˆλ q = min{ m 1 ( q m q 0.5),max ( M 1 q,l ) q }, with l q equal to the number of correct links in a random sample of m q = 20 linked records in each of blocks 2 and 3 y i = 1+ 5x i + e i, with x i : U[0,1] and e i : N(0,1) 26

27 Estimator Relative Bias Relative RMSE Coverage Intercept Slope Intercept Slope Intercept Slope Scenario 1: Linkage probabilities correctly specified TR ST SW MK LL BL Scenario 2: Linkage probabilities estimated from audit sample TR ST SW MK LL BL

28 Estimation errors slope 28

29 Comments When λ q is known (or estimated unbiasedly) all adjusted estimators of β are unbiased, with BL (EBLUE) the most efficient and MK (finite population unbiased) the least Estimated variances for the adjusted estimators of β include an extra component in the filling of the sandwich estimator. This estimates the additional contribution to variance when λ q is estimated. Without it, CI coverage is not good! 29

30 Audit sample issues Do smaller audit samples increase bias? some evidence of small bias (1-2.5%) for SW & MK estimators only when clerical sample is as small as 10; LL and BL are fine. 95% CIs are solid Do smaller audit samples substantially increase the variance? some but not substantially (more in a bit) 30

31 What is the impact of the size (m) of the clerical sample? Increase in variance relative to true OLS Known λ q 7 11% Estimated λ q, m q = % Estimated λ q, m q = 25 Estimated λ q, m q = % (33% for MK) 15 30% (>60% MK) 31

32 How low can we go? 10 records out of 200 is a 5% sample and, in the real world, nobody wants to undertake a 5% sample of an administrative database... So for a population size of (blocks of 5000, 3000, 2000) and an audit sample of just 10 in each block: Estimators still look good in terms of bias (esp. LL and BL) Variances are 35 45% higher than true OLS (>100% for MK) 95% CIs are too wide (97 100% coverage) 32

33 Applying in practice: The LinkReg SAS macro Code applying theory initially developed in R In practice, analysis of probabilistically linked data will involve hundreds of thousands, even millions, of records... A means of efficiently fitting linear models to linked datasets of this magnitude is necessary for Statistics New Zealand to take advantage of theory developed in project 33

34 The details %LinkReg (INDATA=dataset, Y=y-variable, X=x-variable, BLOCK=block-variable, LAMBDA=λ-variable, ESTLAMBDA=0/1, MQSAMP=clerical-sample-size-variable, COVEST=0/1, OUTLIB=output-library) A linear model with 10 X variables and a million observations takes less than a minute to fit 34

35 What You Get Printed output: Model specifications Three experimental R-square values Coefficients, SEs, Z-values and p(z) Covariance matrix of the estimates by request Output datasets: _linkbeta, containing coeffs and SEs _covest_??, containing the cov matrix 35

36 Logistic regression Y is binary with E( Y = 1 X = x)= Pr( Y = 1 X = x)= exp(xt β) 1+ exp(x T β) Put f q ( β)= { E( y i x i );i q}= exp(x it β) ;i q. Then, given y q, 1+ exp(x it β) we usually estimate β as the solution to the ML estimating equation { y q f q β } = 0 T X q q ( ) 36

37 Estimating functions with linked data Unbiased estimating function given correctly-linked data H(θ) = N { } G i (θ) y i f i (θ) = G q (θ) y q f q (θ) i=1 q { } When used with probability-linked data, this becomes H (θ) = q G q (θ){ y q f q (θ)} 37

38 Estimating function is no longer unbiased... { } E { X H (θ 0 )}= G q (θ 0 ) ( E q I q )f q (θ 0 ) 0 q A bias-corrected estimating function H adj (θ) = H (θ) G q (θ) ( E q I q )f q (θ) q = G q (θ) y q E q f q (θ) { } q { } 38

39 Application to logistic regression Estimating equation { } G q (β) y q E q f q (β) = 0 q Choosing G q (β) M (defines MLE when data are correctly linked): G q (β) = X q T A (leads to LL estimator in linear model): G q (β) = X q T E q T C (second-order optimal, leads to BL estimator in linear model) 39

40 Some simulation results Same set up as in previous simulation i.e. three blocks, M 1 = 1500, M 2 = 300 and M 3 = 200, with independent exchangeable linkage errors in each block Same λ q s as before (λ 1 = 1.0, λ 2 = 0.95, λ 3 = 0.75) Results provided for two cases: λ q known and λ q estimated by { ( )} ˆλ q = min m 1 ( q m q 0.5),max M 1 q,l q logit{ E( y i x ) i }= 1 5x i, with x i : Uniform[0,1] y i = I{ U i E( y i x ) i }, with U i : Uniform[0,1] 40

41 Focus on estimation of slope parameter Estimator Relative Bias Relative RMSE Coverage Scenario 1: Linkage probabilities correctly specified TR ST M A C Scenario 2: Linkage probabilities estimated TR ST M A C

42 Estimation errors slope 42

43 Comments All three adjusted estimators are unbiased if λ q known or estimated unbiasedly, with C the most efficient performer and M the least efficient Differences in efficiency are not as pronounced as in linear case, most probably because linkage error does not automatically result in measurement error when Y is binary Again, estimated variances include extra component to allow for estimation of λ q. Without this component, coverage is reduced, but effect is not as pronounced as in linear case 43

44 Sample to register linkage First link Y and X-registers (at least conceptually), then take sample from X-register (equivalent to sampling from linked population register), so data = ( y sq,x ) sq with sample weights w sq sampling and linkage processes are independent may also have access to summary statistics from unlinked registers, so data could also include y q, x q weighted (pseudo-likelihood) methodology investigated (no population summary data) 44

45 ( X,Y )= x 1R x 2 R x 3R x 1S x 2S x 3S y 1R y 2 R y 3R y 1S y 2S y 3S = X R Y R but ( X,Y S S )= X S Y S x 1S x 2S x 3S y 3R y 3S y 2S Y = Y R Y S = AY = Y R Y S = A RR A SR A RS A SS Y R Y S 45

46 Nested linkage One register is a subset of the other, so only one register can be completely linked (1) X-register is a subset of the Y-register (2) Y-register is a subset of the X-register sampling and linkage processes are not independent sample to register linkage where sample is first selected, then linked to register is a special case of nested linkage... 46

47 ( X,Y )= x 1B x 2 B x 3B y 1A y 2 A y 3A y 1B y 2 B y 3B = X B Y A Y B but ( X,Y B B)= x 1B x 2 B x 3B y 3A y 3B y 2 B Y = Y A Y B = AY = Y A Y B = A AA A BA A AB A BB Y A Y B 47

48 ABS Census Data Enhancement project Statistical Longitudinal Census Dataset (SLCD) 5% sample of 2006 Census person records linked to their corresponding 2011 Census records without use of names and addresses Substantial opportunity for longitudinal analysis at a relatively small geographical level while maintaining the ABS strong commitment to maintain the confidentiality of its Census respondents 48

49 Problem 1: Will the linked records really be a random sample? who will SLCD data represent in 2011? Problem 2: How accurate will the linkage be? linkage errors are a particular type of measurement error and will induce biases in analysis 49

50 Simulated SLCD Census Dress Rehearsal (2005) linked to 2006 Census gold linking name, address, mesh block + census data items (treated as truth) bronze linking mesh block + census data items Regression models fitted to gold-linked and bronze-linked data give different results 50

51 Can methodology developed so far significantly reduce this discrepancy? Logistic model for Naive deviance Adjusted deviance Migration Employment Student Why? 51

52 Main components of Chi-square error for bronze migration model Source Naive fit Adjusted fit Sample bias Incorrect links (B-B) Incorrect links (B-A) Real problem is that bronze-linked sample and gold-linked sample are not representative of the same population... 52

53 Ongoing related research Extension to multilevel models ARC research project on linked longitudinal data Linking surveys to registers Canadian health survey database (with Milorad Kovacevic) Non-exchangeable models for linkage errors? Informative linkage (after blocking)? Errors in blocking variables? Linkage across multiple databases (longitudinal linkage)? 53

54 References Felligi, I.P. and Sunter, A.B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64, Lahiri, P. and Larsen, M.D. (2005). Regression analysis with linked data. Journal of the American Statistical Association, 100, Neter, J., Maynes, E.S. and Ramanathan, R. (1965). The effect of mismatching on the measurement of response error. Journal of the American Statistical Association, 60, Scheuren, F. and Winkler, W.E. (1993). Regression analysis of data files that are computer matched. Survey Methodology, 19,

Regression Analysis of Probability-Linked Data

Regression Analysis of Probability-Linked Data Official Statistics Research Series, Vol 4, 2009 ISSN 1177-5017; ISBN 978-0-478-31569-1 (Online) Regression Analysis of Probability-Linked Data Ray Chambers Centre for Statistical and Survey Methodology,

More information

Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

More information

5. Linear Regression

5. Linear Regression 5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4

More information

Regression Analysis: Basic Concepts

Regression Analysis: Basic Concepts The simple linear model Regression Analysis: Basic Concepts Allin Cottrell Represents the dependent variable, y i, as a linear function of one independent variable, x i, subject to a random disturbance

More information

Multivariate Logistic Regression

Multivariate Logistic Regression 1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation

More information

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

More information

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction

More information

Overview of Methods for Analyzing Cluster-Correlated Data. Garrett M. Fitzmaurice

Overview of Methods for Analyzing Cluster-Correlated Data. Garrett M. Fitzmaurice Overview of Methods for Analyzing Cluster-Correlated Data Garrett M. Fitzmaurice Laboratory for Psychiatric Biostatistics, McLean Hospital Department of Biostatistics, Harvard School of Public Health Outline

More information

Linking methodology used by Statistics New Zealand in the Integrated Data Infrastructure project

Linking methodology used by Statistics New Zealand in the Integrated Data Infrastructure project Linking methodology used by Statistics New Zealand in the Integrated Data Infrastructure project Crown copyright This work is licensed under the Creative Commons Attribution 3.0 New Zealand licence. You

More information

Statistical Models in R

Statistical Models in R Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Beckman HLM Reading Group: Questions, Answers and Examples Carolyn J. Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Linear Algebra Slide 1 of

More information

From the help desk: Bootstrapped standard errors

From the help desk: Bootstrapped standard errors The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution

More information

Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses

Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses G. Gordon Brown, Celia R. Eicheldinger, and James R. Chromy RTI International, Research Triangle Park, NC 27709 Abstract

More information

University of Ljubljana Doctoral Programme in Statistics Methodology of Statistical Research Written examination February 14 th, 2014.

University of Ljubljana Doctoral Programme in Statistics Methodology of Statistical Research Written examination February 14 th, 2014. University of Ljubljana Doctoral Programme in Statistics ethodology of Statistical Research Written examination February 14 th, 2014 Name and surname: ID number: Instructions Read carefully the wording

More information

Chapter 3: The Multiple Linear Regression Model

Chapter 3: The Multiple Linear Regression Model Chapter 3: The Multiple Linear Regression Model Advanced Econometrics - HEC Lausanne Christophe Hurlin University of Orléans November 23, 2013 Christophe Hurlin (University of Orléans) Advanced Econometrics

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Machine Learning Logistic Regression

Machine Learning Logistic Regression Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.

More information

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541 Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541 libname in1 >c:\=; Data first; Set in1.extract; A=1; PROC LOGIST OUTEST=DD MAXITER=100 ORDER=DATA; OUTPUT OUT=CC XBETA=XB P=PROB; MODEL

More information

SYSTEMS OF REGRESSION EQUATIONS

SYSTEMS OF REGRESSION EQUATIONS SYSTEMS OF REGRESSION EQUATIONS 1. MULTIPLE EQUATIONS y nt = x nt n + u nt, n = 1,...,N, t = 1,...,T, x nt is 1 k, and n is k 1. This is a version of the standard regression model where the observations

More information

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

Data Matching Optimal and Greedy

Data Matching Optimal and Greedy Chapter 13 Data Matching Optimal and Greedy Introduction This procedure is used to create treatment-control matches based on propensity scores and/or observed covariate variables. Both optimal and greedy

More information

The basic unit in matrix algebra is a matrix, generally expressed as: a 11 a 12. a 13 A = a 21 a 22 a 23

The basic unit in matrix algebra is a matrix, generally expressed as: a 11 a 12. a 13 A = a 21 a 22 a 23 (copyright by Scott M Lynch, February 2003) Brief Matrix Algebra Review (Soc 504) Matrix algebra is a form of mathematics that allows compact notation for, and mathematical manipulation of, high-dimensional

More information

Statistics in Retail Finance. Chapter 2: Statistical models of default

Statistics in Retail Finance. Chapter 2: Statistical models of default Statistics in Retail Finance 1 Overview > We consider how to build statistical models of default, or delinquency, and how such models are traditionally used for credit application scoring and decision

More information

Structural Equation Models for Comparing Dependent Means and Proportions. Jason T. Newsom

Structural Equation Models for Comparing Dependent Means and Proportions. Jason T. Newsom Structural Equation Models for Comparing Dependent Means and Proportions Jason T. Newsom How to Do a Paired t-test with Structural Equation Modeling Jason T. Newsom Overview Rationale Structural equation

More information

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2 University of California, Berkeley Prof. Ken Chay Department of Economics Fall Semester, 005 ECON 14 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE # Question 1: a. Below are the scatter plots of hourly wages

More information

Random effects and nested models with SAS

Random effects and nested models with SAS Random effects and nested models with SAS /************* classical2.sas ********************* Three levels of factor A, four levels of B Both fixed Both random A fixed, B random B nested within A ***************************************************/

More information

4. Simple regression. QBUS6840 Predictive Analytics. https://www.otexts.org/fpp/4

4. Simple regression. QBUS6840 Predictive Analytics. https://www.otexts.org/fpp/4 4. Simple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/4 Outline The simple linear model Least squares estimation Forecasting with regression Non-linear functional forms Regression

More information

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F Random and Mixed Effects Models (Ch. 10) Random effects models are very useful when the observations are sampled in a highly structured way. The basic idea is that the error associated with any linear,

More information

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling Jeff Wooldridge NBER Summer Institute, 2007 1. The Linear Model with Cluster Effects 2. Estimation with a Small Number of Groups and

More information

Question 2: How do you solve a matrix equation using the matrix inverse?

Question 2: How do you solve a matrix equation using the matrix inverse? Question : How do you solve a matrix equation using the matrix inverse? In the previous question, we wrote systems of equations as a matrix equation AX B. In this format, the matrix A contains the coefficients

More information

Introduction to Analysis Methods for Longitudinal/Clustered Data, Part 3: Generalized Estimating Equations

Introduction to Analysis Methods for Longitudinal/Clustered Data, Part 3: Generalized Estimating Equations Introduction to Analysis Methods for Longitudinal/Clustered Data, Part 3: Generalized Estimating Equations Mark A. Weaver, PhD Family Health International Office of AIDS Research, NIH ICSSC, FHI Goa, India,

More information

Assignments Analysis of Longitudinal data: a multilevel approach

Assignments Analysis of Longitudinal data: a multilevel approach Assignments Analysis of Longitudinal data: a multilevel approach Frans E.S. Tan Department of Methodology and Statistics University of Maastricht The Netherlands Maastricht, Jan 2007 Correspondence: Frans

More information

Poisson Models for Count Data

Poisson Models for Count Data Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the

More information

Missing Data Dr Eleni Matechou

Missing Data Dr Eleni Matechou 1 Statistical Methods Principles Missing Data Dr Eleni Matechou matechou@stats.ox.ac.uk References: R.J.A. Little and D.B. Rubin 2nd edition Statistical Analysis with Missing Data J.L. Schafer and J.W.

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Introducing the Multilevel Model for Change

Introducing the Multilevel Model for Change Department of Psychology and Human Development Vanderbilt University GCM, 2010 1 Multilevel Modeling - A Brief Introduction 2 3 4 5 Introduction In this lecture, we introduce the multilevel model for change.

More information

Internet Appendix to CAPM for estimating cost of equity capital: Interpreting the empirical evidence

Internet Appendix to CAPM for estimating cost of equity capital: Interpreting the empirical evidence Internet Appendix to CAPM for estimating cost of equity capital: Interpreting the empirical evidence This document contains supplementary material to the paper titled CAPM for estimating cost of equity

More information

Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

More information

Vector and Matrix Norms

Vector and Matrix Norms Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a non-empty

More information

Methodological aspects of small area estimation from the National Electronic Health Records Survey (NEHRS). 1

Methodological aspects of small area estimation from the National Electronic Health Records Survey (NEHRS). 1 Methodological aspects of small area estimation from the National Electronic Health Records Survey (NEHRS). 1 Vladislav Beresovsky (hvy4@cdc.gov), Janey Hsiao National Center for Health Statistics, CDC

More information

Regression III: Advanced Methods

Regression III: Advanced Methods Lecture 5: Linear least-squares Regression III: Advanced Methods William G. Jacoby Department of Political Science Michigan State University http://polisci.msu.edu/jacoby/icpsr/regress3 Simple Linear Regression

More information

Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers

Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers Variance Reduction The statistical efficiency of Monte Carlo simulation can be measured by the variance of its output If this variance can be lowered without changing the expected value, fewer replications

More information

1 Introduction to Matrices

1 Introduction to Matrices 1 Introduction to Matrices In this section, important definitions and results from matrix algebra that are useful in regression analysis are introduced. While all statements below regarding the columns

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

International Statistical Institute, 56th Session, 2007: Phil Everson

International Statistical Institute, 56th Session, 2007: Phil Everson Teaching Regression using American Football Scores Everson, Phil Swarthmore College Department of Mathematics and Statistics 5 College Avenue Swarthmore, PA198, USA E-mail: peverso1@swarthmore.edu 1. Introduction

More information

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING BY OMID ROUHANI-KALLEH THESIS Submitted as partial fulfillment of the requirements for the degree of

More information

Electronic Thesis and Dissertations UCLA

Electronic Thesis and Dissertations UCLA Electronic Thesis and Dissertations UCLA Peer Reviewed Title: A Multilevel Longitudinal Analysis of Teaching Effectiveness Across Five Years Author: Wang, Kairong Acceptance Date: 2013 Series: UCLA Electronic

More information

4. Matrix inverses. left and right inverse. linear independence. nonsingular matrices. matrices with linearly independent columns

4. Matrix inverses. left and right inverse. linear independence. nonsingular matrices. matrices with linearly independent columns L. Vandenberghe EE133A (Spring 2016) 4. Matrix inverses left and right inverse linear independence nonsingular matrices matrices with linearly independent columns matrices with linearly independent rows

More information

Random Effects Models for Longitudinal Survey Data

Random Effects Models for Longitudinal Survey Data Analysis of Survey Data. Edited by R. L. Chambers and C. J. Skinner Copyright 2003 John Wiley & Sons, Ltd. ISBN: 0-471-89987-9 CHAPTER 14 Random Effects Models for Longitudinal Survey Data C. J. Skinner

More information

Multilevel Modeling of Complex Survey Data

Multilevel Modeling of Complex Survey Data Multilevel Modeling of Complex Survey Data Sophia Rabe-Hesketh, University of California, Berkeley and Institute of Education, University of London Joint work with Anders Skrondal, London School of Economics

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

More information

Getting Started with HLM 5. For Windows

Getting Started with HLM 5. For Windows For Windows August 2012 Table of Contents Section 1: Overview... 3 1.1 About this Document... 3 1.2 Introduction to HLM... 3 1.3 Accessing HLM... 3 1.4 Getting Help with HLM... 4 Section 2: Accessing Data

More information

Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components

Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components The eigenvalues and eigenvectors of a square matrix play a key role in some important operations in statistics. In particular, they

More information

Logistic Regression for Spam Filtering

Logistic Regression for Spam Filtering Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

Imputing Missing Data using SAS

Imputing Missing Data using SAS ABSTRACT Paper 3295-2015 Imputing Missing Data using SAS Christopher Yim, California Polytechnic State University, San Luis Obispo Missing data is an unfortunate reality of statistics. However, there are

More information

Financial Risk Management Exam Sample Questions/Answers

Financial Risk Management Exam Sample Questions/Answers Financial Risk Management Exam Sample Questions/Answers Prepared by Daniel HERLEMONT 1 2 3 4 5 6 Chapter 3 Fundamentals of Statistics FRM-99, Question 4 Random walk assumes that returns from one time period

More information

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade Statistics Quiz Correlation and Regression -- ANSWERS 1. Temperature and air pollution are known to be correlated. We collect data from two laboratories, in Boston and Montreal. Boston makes their measurements

More information

HLM software has been one of the leading statistical packages for hierarchical

HLM software has been one of the leading statistical packages for hierarchical Introductory Guide to HLM With HLM 7 Software 3 G. David Garson HLM software has been one of the leading statistical packages for hierarchical linear modeling due to the pioneering work of Stephen Raudenbush

More information

Estimation and Inference in Cointegration Models Economics 582

Estimation and Inference in Cointegration Models Economics 582 Estimation and Inference in Cointegration Models Economics 582 Eric Zivot May 17, 2012 Tests for Cointegration Let the ( 1) vector Y be (1). Recall, Y is cointegrated with 0 cointegrating vectors if there

More information

VI. Introduction to Logistic Regression

VI. Introduction to Logistic Regression VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models

More information

Life Table Analysis using Weighted Survey Data

Life Table Analysis using Weighted Survey Data Life Table Analysis using Weighted Survey Data James G. Booth and Thomas A. Hirschl June 2005 Abstract Formulas for constructing valid pointwise confidence bands for survival distributions, estimated using

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models We have previously worked with regression models where the response variable is quantitative and normally distributed. Now we turn our attention to two types of models where the

More information

MATLAB and Big Data: Illustrative Example

MATLAB and Big Data: Illustrative Example MATLAB and Big Data: Illustrative Example Rick Mansfield Cornell University August 19, 2014 Goals Use a concrete example from my research to: Demonstrate the value of vectorization Introduce key commands/functions

More information

Introduction to Longitudinal Data Analysis

Introduction to Longitudinal Data Analysis Introduction to Longitudinal Data Analysis Longitudinal Data Analysis Workshop Section 1 University of Georgia: Institute for Interdisciplinary Research in Education and Human Development Section 1: Introduction

More information

Univariate Regression

Univariate Regression Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is

More information

From the help desk: Swamy s random-coefficients model

From the help desk: Swamy s random-coefficients model The Stata Journal (2003) 3, Number 3, pp. 302 308 From the help desk: Swamy s random-coefficients model Brian P. Poi Stata Corporation Abstract. This article discusses the Swamy (1970) random-coefficients

More information

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

More information

Models for Longitudinal and Clustered Data

Models for Longitudinal and Clustered Data Models for Longitudinal and Clustered Data Germán Rodríguez December 9, 2008, revised December 6, 2012 1 Introduction The most important assumption we have made in this course is that the observations

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

AN INTRODUCTION TO ERROR CORRECTING CODES Part 1

AN INTRODUCTION TO ERROR CORRECTING CODES Part 1 AN INTRODUCTION TO ERROR CORRECTING CODES Part 1 Jack Keil Wolf ECE 154C Spring 2008 Noisy Communications Noise in a communications channel can cause errors in the transmission of binary digits. Transmit:

More information

We shall turn our attention to solving linear systems of equations. Ax = b

We shall turn our attention to solving linear systems of equations. Ax = b 59 Linear Algebra We shall turn our attention to solving linear systems of equations Ax = b where A R m n, x R n, and b R m. We already saw examples of methods that required the solution of a linear system

More information

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,

More information

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

More information

Extensions of the Partial Least Squares approach for the analysis of biomolecular interactions. Nina Kirschbaum

Extensions of the Partial Least Squares approach for the analysis of biomolecular interactions. Nina Kirschbaum Dissertation am Fachbereich Statistik der Universität Dortmund Extensions of the Partial Least Squares approach for the analysis of biomolecular interactions Nina Kirschbaum Erstgutachter: Prof. Dr. W.

More information

Keep It Simple: Easy Ways To Estimate Choice Models For Single Consumers

Keep It Simple: Easy Ways To Estimate Choice Models For Single Consumers Keep It Simple: Easy Ways To Estimate Choice Models For Single Consumers Christine Ebling, University of Technology Sydney, christine.ebling@uts.edu.au Bart Frischknecht, University of Technology Sydney,

More information

L3: Statistical Modeling with Hadoop

L3: Statistical Modeling with Hadoop L3: Statistical Modeling with Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 10, 2014 Today we are going to learn...

More information

Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes

Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, Discrete Changes JunXuJ.ScottLong Indiana University August 22, 2005 The paper provides technical details on

More information

Factorization Theorems

Factorization Theorems Chapter 7 Factorization Theorems This chapter highlights a few of the many factorization theorems for matrices While some factorization results are relatively direct, others are iterative While some factorization

More information

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results IAPRI Quantitative Analysis Capacity Building Series Multiple regression analysis & interpreting results How important is R-squared? R-squared Published in Agricultural Economics 0.45 Best article of the

More information

Testing for Lack of Fit

Testing for Lack of Fit Chapter 6 Testing for Lack of Fit How can we tell if a model fits the data? If the model is correct then ˆσ 2 should be an unbiased estimate of σ 2. If we have a model which is not complex enough to fit

More information

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month

More information

Marketing Mix Modelling and Big Data P. M Cain

Marketing Mix Modelling and Big Data P. M Cain 1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored

More information

Nominal and ordinal logistic regression

Nominal and ordinal logistic regression Nominal and ordinal logistic regression April 26 Nominal and ordinal logistic regression Our goal for today is to briefly go over ways to extend the logistic regression model to the case where the outcome

More information

Applied Multivariate Analysis - Big data analytics

Applied Multivariate Analysis - Big data analytics Applied Multivariate Analysis - Big data analytics Nathalie Villa-Vialaneix nathalie.villa@toulouse.inra.fr http://www.nathalievilla.org M1 in Economics and Economics and Statistics Toulouse School of

More information

Approaches for Analyzing Survey Data: a Discussion

Approaches for Analyzing Survey Data: a Discussion Approaches for Analyzing Survey Data: a Discussion David Binder 1, Georgia Roberts 1 Statistics Canada 1 Abstract In recent years, an increasing number of researchers have been able to access survey microdata

More information

An extension of the factoring likelihood approach for non-monotone missing data

An extension of the factoring likelihood approach for non-monotone missing data An extension of the factoring likelihood approach for non-monotone missing data Jae Kwang Kim Dong Wan Shin January 14, 2010 ABSTRACT We address the problem of parameter estimation in multivariate distributions

More information

Multivariate Normal Distribution

Multivariate Normal Distribution Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

More information

To give it a definition, an implicit function of x and y is simply any relationship that takes the form:

To give it a definition, an implicit function of x and y is simply any relationship that takes the form: 2 Implicit function theorems and applications 21 Implicit functions The implicit function theorem is one of the most useful single tools you ll meet this year After a while, it will be second nature to

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

1/27/2013. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

1/27/2013. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 Introduce moderated multiple regression Continuous predictor continuous predictor Continuous predictor categorical predictor Understand

More information

2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system

2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system 1. Systems of linear equations We are interested in the solutions to systems of linear equations. A linear equation is of the form 3x 5y + 2z + w = 3. The key thing is that we don t multiply the variables

More information

Factors affecting online sales

Factors affecting online sales Factors affecting online sales Table of contents Summary... 1 Research questions... 1 The dataset... 2 Descriptive statistics: The exploratory stage... 3 Confidence intervals... 4 Hypothesis tests... 4

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information