Regression analysis of probabilitylinked data


 Denis Poole
 2 years ago
 Views:
Transcription
1 Regression analysis of probabilitylinked data Ray Chambers University of Wollongong James Chipperfield Australian Bureau of Statistics Walter Davis Statistics New Zealand 1
2 Overview 1. Probability linkage overview 2. A statistical framework for linkage errors 3. Linear regression with linked data 4. Extension to estimating equations / logistic regression 5. Linking samples to registers 6. Linking nested registers 7. Application: ABS Census Data Enhancement project 8. Future research 2
3 Research questions What impact do errors in probability linking of records from different sources have on subsequent statistical analysis of the linked data? How should standard statistical methods, in particular regression modelling, be modified in order to minimise this impact? 3
4 Background Population of N units, indexed by i = 1,K, N Response = scalar random variable Y Explanators = vector random variable X Linear regression model E X ( Y )= X T β Var X ( Y )= σ 2 Logistic regression model Pr X Y = 1 ( )= exp(xt β) 1+ exp(x T β) 4
5 Estimation of β straightforward given a random sample of values of (Y, X) for units from this population We do not have such a sample instead data are available from two linked population registers Yregister contains values of Y Xregister contains values of X No unique identifier to link records from the two registers, so probabilistic record linkage method used 5
6 Probabilistic record linkage Fellegi & Sunter (1969) Record Linkage is a solution to the problem of recognizing those records in two files which represent identical persons, objects, or events... A set of observed variables present in both registers (matching variables) is used to link records in order to maximise the probability that they refer to the same unit 6
7 Applications of record linkage Merging of large databases (e.g. before data mining) Removing duplicates in registries Generating longitudinal records from crosssectional data Combining data sources (e.g. survey and register) A very active research area, with a focus on: Developing efficient and accurate matching algorithms Alternatives to the statistical theory of FellegiSunter Effects of linkage error on statistical analysis 7
8 An early example Neter et al. (1965) Focus on confounding of linkage errors and response errors, with application to Netherlands Validation Study Reported Bank Balance Observed Relationship Bank Balance from Linked Data 8
9 Can observed regression to the mean relationship between reported and linked bank balances be explained by linkage errors rather than response errors? Using a simple linkage error model (essentially the same as the one that we use later), they show that Pr(correct link) has to be less than 80% to explain this behaviour, from which they conclude that response errors underpin discrepancies 9
10 Analysis framework Both registers are complete, with no duplications, so each contains N records All records on both registers are linked Can define categorical variable Z on both registers measured without error on both takes Q distinct values 1,2,K,Q M q records in each register with Z = q (so N = ) M q q referred to as a Blocking variable in what follows 10
11 Implications Block structure: Linkage errors can only occur within blocks record A on YRegister with Z = z A and record B on X register with Z = z B must correspond to different units if z A z B, but z A = z B does not mean that A and B are the same unit Complete linkage: Every record in block q of the Yregister is linked with a unique record in the same block q of the X register 11
12 Notation: We index the records on the linked data set in exactly the same way as we index the Xregister For each block q we have M q linked data pairs (y i,x i ), where y i denotes the Yvalue from the record on the Yregister in block q that is linked to the record on the Xregister in block q with value X i y q = vector defined by the values (y i ;i q) X q = matrix with rows defined by the values (X i ;i q) y q = unknown vector corresponding to the true Y values (y i ;i q) associated with X q 12
13 A Model for linkage error y q = A q y q A q is an unknown random permutation matrix of order M q, i.e. entries of A q are either zero or one, with a value of one occurring just once in each row and column E X (A q ) = E q Noninformative linkage E ( X A q y q )= E ( X A q )E ( X y q )= E q X q β  i.e. records are mismatched at random given X 13
14 Example y = y 1 y 2 y 3 y 4 y 5 while y = y 3 y 2 y 5 y 4 y 1 A =
15 An exchangeable linkage error model Since linkage process maximises the probability that a declared link is a true link, correct linkages should be more likely than incorrect linkages... Pr(correct linkage in block q) = λ q Pr(wrong linkage in block q) = γ q E q = E X ( A q )= λ q γ q L γ q γ q λ q L γ q M M O M γ q γ q L λ q 15
16 16 Implication Complete linkage within a block, so A q 1 q = A q T 1 q = 1 q, e.g. A q 1 q = =
17 17 Immediately follows that E q 1 q = E q T 1 q = 1 q, i.e. λ q γ q L γ q γ q λ q L γ q M M O M γ q γ q L λ q 1 1 M 1 = 1 1 M 1 or equivalently λ q + (M q 1)γ q = 1 γ q = 1 λ q M q 1
18 For example, when M = 50 and λ = 0.9 = L 0.1 E = L 0.1 = 49 M M O M L L L M M O M L
19 Linear regression Standard Approach: Estimates β as if y and y i.e. A is the identity matrix are identical, ˆβ = (X T X) 1 X T y = q X q T X q 1 q X T q A q y q Biased if linkage is not perfect... E X ( ˆβ ) = q X q T X q 1 q X T q E q X q β = Dβ β 19
20 Biascorrecting the naive OLS estimator Motivated by approach of Scheuren and Winkler (1993). If E is known, D is also known. Provided D 1 estimator of β is exists, an unbiased ˆβ SW = D 1 ˆβ = { q X q T X q 1 } 1 q X T q E q X q ˆβ Assuming that X T q E q X q is of full rank q ˆβ SW = ( X T q q E q X ) 1 ( q X T q q y ) q 20
21 An unbiased OLS estimator Lahiri and Larsen (2005) E X ( y q )= E ( X A q )E ( X y q )= E q X q β = H q β OLS estimator based on this corrected model matrix is unbiased... ˆβ LL = ( H T q q H ) 1 q ( H T q q y ) q = ( X T q E T q q E q X ) 1 ( q X T q E T q q y ) q 21
22 Efficient linear estimation using linked data Regression errors under corrected model are not homoskedastic. Their behaviour varies from block to block, reflecting the impact of different amounts of linkage error in different blocks Var X { ( T )A q }+ Var X A q E X y q ( y q )= E X A q Var X y q { ( )} = E { X A ( q σ 2 T I q )A q }+ Var ( X A q X q β) = σ 2 E ( T X A q A q )+ Var { X A q f } q = σ 2 I q + V q = Σ q 22
23 Best linear unbiased estimator ˆβ BL = ( H T q Σ 1 q q H ) 1 q ( H T q Σ 1 q q y ) q = ( X T q E T q Σ 1 q q E q X ) 1 q ( X T q E T q Σ 1 q q y ) q depends on Σ q, and hence on σ 2 and β substitute ˆσ 2 for σ 2, ˆV q for V q then iterate... 23
24 Estimating the finite population regression parameter Kovacevic (2008): If we knew y q, our best estimate of β would be its OLS estimate B = ( X T q q X ) 1 q ( q X T q y ) q Look for an estimator based on the linked data that is unbiased for B given the correctly linked population values... E YX ( ˆB )= B Note: None of ˆβ SW, ˆβ LL, ˆβ BL have this finite population property 24
25 Consider class of estimators that can be written in the form ˆB = ( X T q q X ) 1 q ( X T q q K q y ) q K q = E q 1 E YX ( ) 1 q ( ˆB )= X T q X q ( q X T q K q E q y q )= B Leads to ˆβ MK = ( X T q q X ) 1 ( q X T q E 1 q q y ) q 25
26 Some modelbased simulation results Three blocks, M 1 = 1500, M 2 = 300 and M 3 = 200, with independent exchangeable linkage errors in each block Two scenarios: o λ q correctly specified (λ 1 = 1.0, λ 2 = 0.95, λ 3 = 0.75) o λ q estimated by ˆλ q = min{ m 1 ( q m q 0.5),max ( M 1 q,l ) q }, with l q equal to the number of correct links in a random sample of m q = 20 linked records in each of blocks 2 and 3 y i = 1+ 5x i + e i, with x i : U[0,1] and e i : N(0,1) 26
27 Estimator Relative Bias Relative RMSE Coverage Intercept Slope Intercept Slope Intercept Slope Scenario 1: Linkage probabilities correctly specified TR ST SW MK LL BL Scenario 2: Linkage probabilities estimated from audit sample TR ST SW MK LL BL
28 Estimation errors slope 28
29 Comments When λ q is known (or estimated unbiasedly) all adjusted estimators of β are unbiased, with BL (EBLUE) the most efficient and MK (finite population unbiased) the least Estimated variances for the adjusted estimators of β include an extra component in the filling of the sandwich estimator. This estimates the additional contribution to variance when λ q is estimated. Without it, CI coverage is not good! 29
30 Audit sample issues Do smaller audit samples increase bias? some evidence of small bias (12.5%) for SW & MK estimators only when clerical sample is as small as 10; LL and BL are fine. 95% CIs are solid Do smaller audit samples substantially increase the variance? some but not substantially (more in a bit) 30
31 What is the impact of the size (m) of the clerical sample? Increase in variance relative to true OLS Known λ q 7 11% Estimated λ q, m q = % Estimated λ q, m q = 25 Estimated λ q, m q = % (33% for MK) 15 30% (>60% MK) 31
32 How low can we go? 10 records out of 200 is a 5% sample and, in the real world, nobody wants to undertake a 5% sample of an administrative database... So for a population size of (blocks of 5000, 3000, 2000) and an audit sample of just 10 in each block: Estimators still look good in terms of bias (esp. LL and BL) Variances are 35 45% higher than true OLS (>100% for MK) 95% CIs are too wide (97 100% coverage) 32
33 Applying in practice: The LinkReg SAS macro Code applying theory initially developed in R In practice, analysis of probabilistically linked data will involve hundreds of thousands, even millions, of records... A means of efficiently fitting linear models to linked datasets of this magnitude is necessary for Statistics New Zealand to take advantage of theory developed in project 33
34 The details %LinkReg (INDATA=dataset, Y=yvariable, X=xvariable, BLOCK=blockvariable, LAMBDA=λvariable, ESTLAMBDA=0/1, MQSAMP=clericalsamplesizevariable, COVEST=0/1, OUTLIB=outputlibrary) A linear model with 10 X variables and a million observations takes less than a minute to fit 34
35 What You Get Printed output: Model specifications Three experimental Rsquare values Coefficients, SEs, Zvalues and p(z) Covariance matrix of the estimates by request Output datasets: _linkbeta, containing coeffs and SEs _covest_??, containing the cov matrix 35
36 Logistic regression Y is binary with E( Y = 1 X = x)= Pr( Y = 1 X = x)= exp(xt β) 1+ exp(x T β) Put f q ( β)= { E( y i x i );i q}= exp(x it β) ;i q. Then, given y q, 1+ exp(x it β) we usually estimate β as the solution to the ML estimating equation { y q f q β } = 0 T X q q ( ) 36
37 Estimating functions with linked data Unbiased estimating function given correctlylinked data H(θ) = N { } G i (θ) y i f i (θ) = G q (θ) y q f q (θ) i=1 q { } When used with probabilitylinked data, this becomes H (θ) = q G q (θ){ y q f q (θ)} 37
38 Estimating function is no longer unbiased... { } E { X H (θ 0 )}= G q (θ 0 ) ( E q I q )f q (θ 0 ) 0 q A biascorrected estimating function H adj (θ) = H (θ) G q (θ) ( E q I q )f q (θ) q = G q (θ) y q E q f q (θ) { } q { } 38
39 Application to logistic regression Estimating equation { } G q (β) y q E q f q (β) = 0 q Choosing G q (β) M (defines MLE when data are correctly linked): G q (β) = X q T A (leads to LL estimator in linear model): G q (β) = X q T E q T C (secondorder optimal, leads to BL estimator in linear model) 39
40 Some simulation results Same set up as in previous simulation i.e. three blocks, M 1 = 1500, M 2 = 300 and M 3 = 200, with independent exchangeable linkage errors in each block Same λ q s as before (λ 1 = 1.0, λ 2 = 0.95, λ 3 = 0.75) Results provided for two cases: λ q known and λ q estimated by { ( )} ˆλ q = min m 1 ( q m q 0.5),max M 1 q,l q logit{ E( y i x ) i }= 1 5x i, with x i : Uniform[0,1] y i = I{ U i E( y i x ) i }, with U i : Uniform[0,1] 40
41 Focus on estimation of slope parameter Estimator Relative Bias Relative RMSE Coverage Scenario 1: Linkage probabilities correctly specified TR ST M A C Scenario 2: Linkage probabilities estimated TR ST M A C
42 Estimation errors slope 42
43 Comments All three adjusted estimators are unbiased if λ q known or estimated unbiasedly, with C the most efficient performer and M the least efficient Differences in efficiency are not as pronounced as in linear case, most probably because linkage error does not automatically result in measurement error when Y is binary Again, estimated variances include extra component to allow for estimation of λ q. Without this component, coverage is reduced, but effect is not as pronounced as in linear case 43
44 Sample to register linkage First link Y and Xregisters (at least conceptually), then take sample from Xregister (equivalent to sampling from linked population register), so data = ( y sq,x ) sq with sample weights w sq sampling and linkage processes are independent may also have access to summary statistics from unlinked registers, so data could also include y q, x q weighted (pseudolikelihood) methodology investigated (no population summary data) 44
45 ( X,Y )= x 1R x 2 R x 3R x 1S x 2S x 3S y 1R y 2 R y 3R y 1S y 2S y 3S = X R Y R but ( X,Y S S )= X S Y S x 1S x 2S x 3S y 3R y 3S y 2S Y = Y R Y S = AY = Y R Y S = A RR A SR A RS A SS Y R Y S 45
46 Nested linkage One register is a subset of the other, so only one register can be completely linked (1) Xregister is a subset of the Yregister (2) Yregister is a subset of the Xregister sampling and linkage processes are not independent sample to register linkage where sample is first selected, then linked to register is a special case of nested linkage... 46
47 ( X,Y )= x 1B x 2 B x 3B y 1A y 2 A y 3A y 1B y 2 B y 3B = X B Y A Y B but ( X,Y B B)= x 1B x 2 B x 3B y 3A y 3B y 2 B Y = Y A Y B = AY = Y A Y B = A AA A BA A AB A BB Y A Y B 47
48 ABS Census Data Enhancement project Statistical Longitudinal Census Dataset (SLCD) 5% sample of 2006 Census person records linked to their corresponding 2011 Census records without use of names and addresses Substantial opportunity for longitudinal analysis at a relatively small geographical level while maintaining the ABS strong commitment to maintain the confidentiality of its Census respondents 48
49 Problem 1: Will the linked records really be a random sample? who will SLCD data represent in 2011? Problem 2: How accurate will the linkage be? linkage errors are a particular type of measurement error and will induce biases in analysis 49
50 Simulated SLCD Census Dress Rehearsal (2005) linked to 2006 Census gold linking name, address, mesh block + census data items (treated as truth) bronze linking mesh block + census data items Regression models fitted to goldlinked and bronzelinked data give different results 50
51 Can methodology developed so far significantly reduce this discrepancy? Logistic model for Naive deviance Adjusted deviance Migration Employment Student Why? 51
52 Main components of Chisquare error for bronze migration model Source Naive fit Adjusted fit Sample bias Incorrect links (BB) Incorrect links (BA) Real problem is that bronzelinked sample and goldlinked sample are not representative of the same population... 52
53 Ongoing related research Extension to multilevel models ARC research project on linked longitudinal data Linking surveys to registers Canadian health survey database (with Milorad Kovacevic) Nonexchangeable models for linkage errors? Informative linkage (after blocking)? Errors in blocking variables? Linkage across multiple databases (longitudinal linkage)? 53
54 References Felligi, I.P. and Sunter, A.B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64, Lahiri, P. and Larsen, M.D. (2005). Regression analysis with linked data. Journal of the American Statistical Association, 100, Neter, J., Maynes, E.S. and Ramanathan, R. (1965). The effect of mismatching on the measurement of response error. Journal of the American Statistical Association, 60, Scheuren, F. and Winkler, W.E. (1993). Regression analysis of data files that are computer matched. Survey Methodology, 19,
Regression Analysis of ProbabilityLinked Data
Official Statistics Research Series, Vol 4, 2009 ISSN 11775017; ISBN 9780478315691 (Online) Regression Analysis of ProbabilityLinked Data Ray Chambers Centre for Statistical and Survey Methodology,
More informationMultiple Linear Regression in Data Mining
Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple
More information5. Linear Regression
5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4
More informationRegression Analysis: Basic Concepts
The simple linear model Regression Analysis: Basic Concepts Allin Cottrell Represents the dependent variable, y i, as a linear function of one independent variable, x i, subject to a random disturbance
More informationMultivariate Logistic Regression
1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation
More informationOverview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
More informationData Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression
Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction
More informationOverview of Methods for Analyzing ClusterCorrelated Data. Garrett M. Fitzmaurice
Overview of Methods for Analyzing ClusterCorrelated Data Garrett M. Fitzmaurice Laboratory for Psychiatric Biostatistics, McLean Hospital Department of Biostatistics, Harvard School of Public Health Outline
More informationLinking methodology used by Statistics New Zealand in the Integrated Data Infrastructure project
Linking methodology used by Statistics New Zealand in the Integrated Data Infrastructure project Crown copyright This work is licensed under the Creative Commons Attribution 3.0 New Zealand licence. You
More informationStatistical Models in R
Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 16233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova
More informationSTATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 SigmaRestricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
More informationI L L I N O I S UNIVERSITY OF ILLINOIS AT URBANACHAMPAIGN
Beckman HLM Reading Group: Questions, Answers and Examples Carolyn J. Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANACHAMPAIGN Linear Algebra Slide 1 of
More informationFrom the help desk: Bootstrapped standard errors
The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution
More informationUsing Repeated Measures Techniques To Analyze Clustercorrelated Survey Responses
Using Repeated Measures Techniques To Analyze Clustercorrelated Survey Responses G. Gordon Brown, Celia R. Eicheldinger, and James R. Chromy RTI International, Research Triangle Park, NC 27709 Abstract
More informationUniversity of Ljubljana Doctoral Programme in Statistics Methodology of Statistical Research Written examination February 14 th, 2014.
University of Ljubljana Doctoral Programme in Statistics ethodology of Statistical Research Written examination February 14 th, 2014 Name and surname: ID number: Instructions Read carefully the wording
More informationChapter 3: The Multiple Linear Regression Model
Chapter 3: The Multiple Linear Regression Model Advanced Econometrics  HEC Lausanne Christophe Hurlin University of Orléans November 23, 2013 Christophe Hurlin (University of Orléans) Advanced Econometrics
More informationLeast Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN13: 9780470860809 ISBN10: 0470860804 Editors Brian S Everitt & David
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More informationMachine Learning Logistic Regression
Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.
More informationUsing An Ordered Logistic Regression Model with SAS Vartanian: SW 541
Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541 libname in1 >c:\=; Data first; Set in1.extract; A=1; PROC LOGIST OUTEST=DD MAXITER=100 ORDER=DATA; OUTPUT OUT=CC XBETA=XB P=PROB; MODEL
More informationSYSTEMS OF REGRESSION EQUATIONS
SYSTEMS OF REGRESSION EQUATIONS 1. MULTIPLE EQUATIONS y nt = x nt n + u nt, n = 1,...,N, t = 1,...,T, x nt is 1 k, and n is k 1. This is a version of the standard regression model where the observations
More informationAuxiliary Variables in Mixture Modeling: 3Step Approaches Using Mplus
Auxiliary Variables in Mixture Modeling: 3Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives
More informationSAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
More informationData Matching Optimal and Greedy
Chapter 13 Data Matching Optimal and Greedy Introduction This procedure is used to create treatmentcontrol matches based on propensity scores and/or observed covariate variables. Both optimal and greedy
More informationThe basic unit in matrix algebra is a matrix, generally expressed as: a 11 a 12. a 13 A = a 21 a 22 a 23
(copyright by Scott M Lynch, February 2003) Brief Matrix Algebra Review (Soc 504) Matrix algebra is a form of mathematics that allows compact notation for, and mathematical manipulation of, highdimensional
More informationStatistics in Retail Finance. Chapter 2: Statistical models of default
Statistics in Retail Finance 1 Overview > We consider how to build statistical models of default, or delinquency, and how such models are traditionally used for credit application scoring and decision
More informationStructural Equation Models for Comparing Dependent Means and Proportions. Jason T. Newsom
Structural Equation Models for Comparing Dependent Means and Proportions Jason T. Newsom How to Do a Paired ttest with Structural Equation Modeling Jason T. Newsom Overview Rationale Structural equation
More informationChapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )
Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples
More informationChapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
More informationECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2
University of California, Berkeley Prof. Ken Chay Department of Economics Fall Semester, 005 ECON 14 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE # Question 1: a. Below are the scatter plots of hourly wages
More informationRandom effects and nested models with SAS
Random effects and nested models with SAS /************* classical2.sas ********************* Three levels of factor A, four levels of B Both fixed Both random A fixed, B random B nested within A ***************************************************/
More information4. Simple regression. QBUS6840 Predictive Analytics. https://www.otexts.org/fpp/4
4. Simple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/4 Outline The simple linear model Least squares estimation Forecasting with regression Nonlinear functional forms Regression
More informationE(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F
Random and Mixed Effects Models (Ch. 10) Random effects models are very useful when the observations are sampled in a highly structured way. The basic idea is that the error associated with any linear,
More informationWhat s New in Econometrics? Lecture 8 Cluster and Stratified Sampling
What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling Jeff Wooldridge NBER Summer Institute, 2007 1. The Linear Model with Cluster Effects 2. Estimation with a Small Number of Groups and
More informationQuestion 2: How do you solve a matrix equation using the matrix inverse?
Question : How do you solve a matrix equation using the matrix inverse? In the previous question, we wrote systems of equations as a matrix equation AX B. In this format, the matrix A contains the coefficients
More informationIntroduction to Analysis Methods for Longitudinal/Clustered Data, Part 3: Generalized Estimating Equations
Introduction to Analysis Methods for Longitudinal/Clustered Data, Part 3: Generalized Estimating Equations Mark A. Weaver, PhD Family Health International Office of AIDS Research, NIH ICSSC, FHI Goa, India,
More informationAssignments Analysis of Longitudinal data: a multilevel approach
Assignments Analysis of Longitudinal data: a multilevel approach Frans E.S. Tan Department of Methodology and Statistics University of Maastricht The Netherlands Maastricht, Jan 2007 Correspondence: Frans
More informationPoisson Models for Count Data
Chapter 4 Poisson Models for Count Data In this chapter we study loglinear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the
More informationMissing Data Dr Eleni Matechou
1 Statistical Methods Principles Missing Data Dr Eleni Matechou matechou@stats.ox.ac.uk References: R.J.A. Little and D.B. Rubin 2nd edition Statistical Analysis with Missing Data J.L. Schafer and J.W.
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationIntroducing the Multilevel Model for Change
Department of Psychology and Human Development Vanderbilt University GCM, 2010 1 Multilevel Modeling  A Brief Introduction 2 3 4 5 Introduction In this lecture, we introduce the multilevel model for change.
More informationInternet Appendix to CAPM for estimating cost of equity capital: Interpreting the empirical evidence
Internet Appendix to CAPM for estimating cost of equity capital: Interpreting the empirical evidence This document contains supplementary material to the paper titled CAPM for estimating cost of equity
More informationMachine Learning Big Data using Map Reduce
Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? Web data (web logs, click histories) ecommerce applications (purchase histories) Retail purchase histories
More informationVector and Matrix Norms
Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a nonempty
More informationMethodological aspects of small area estimation from the National Electronic Health Records Survey (NEHRS). 1
Methodological aspects of small area estimation from the National Electronic Health Records Survey (NEHRS). 1 Vladislav Beresovsky (hvy4@cdc.gov), Janey Hsiao National Center for Health Statistics, CDC
More informationRegression III: Advanced Methods
Lecture 5: Linear leastsquares Regression III: Advanced Methods William G. Jacoby Department of Political Science Michigan State University http://polisci.msu.edu/jacoby/icpsr/regress3 Simple Linear Regression
More informationVariance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers
Variance Reduction The statistical efficiency of Monte Carlo simulation can be measured by the variance of its output If this variance can be lowered without changing the expected value, fewer replications
More information1 Introduction to Matrices
1 Introduction to Matrices In this section, important definitions and results from matrix algebra that are useful in regression analysis are introduced. While all statements below regarding the columns
More informationNCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
More informationInternational Statistical Institute, 56th Session, 2007: Phil Everson
Teaching Regression using American Football Scores Everson, Phil Swarthmore College Department of Mathematics and Statistics 5 College Avenue Swarthmore, PA198, USA Email: peverso1@swarthmore.edu 1. Introduction
More informationANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING
ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING BY OMID ROUHANIKALLEH THESIS Submitted as partial fulfillment of the requirements for the degree of
More informationElectronic Thesis and Dissertations UCLA
Electronic Thesis and Dissertations UCLA Peer Reviewed Title: A Multilevel Longitudinal Analysis of Teaching Effectiveness Across Five Years Author: Wang, Kairong Acceptance Date: 2013 Series: UCLA Electronic
More information4. Matrix inverses. left and right inverse. linear independence. nonsingular matrices. matrices with linearly independent columns
L. Vandenberghe EE133A (Spring 2016) 4. Matrix inverses left and right inverse linear independence nonsingular matrices matrices with linearly independent columns matrices with linearly independent rows
More informationRandom Effects Models for Longitudinal Survey Data
Analysis of Survey Data. Edited by R. L. Chambers and C. J. Skinner Copyright 2003 John Wiley & Sons, Ltd. ISBN: 0471899879 CHAPTER 14 Random Effects Models for Longitudinal Survey Data C. J. Skinner
More informationMultilevel Modeling of Complex Survey Data
Multilevel Modeling of Complex Survey Data Sophia RabeHesketh, University of California, Berkeley and Institute of Education, University of London Joint work with Anders Skrondal, London School of Economics
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation:  Feature vector X,  qualitative response Y, taking values in C
More informationIntroduction to General and Generalized Linear Models
Introduction to General and Generalized Linear Models General Linear Models  part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK2800 Kgs. Lyngby
More informationGetting Started with HLM 5. For Windows
For Windows August 2012 Table of Contents Section 1: Overview... 3 1.1 About this Document... 3 1.2 Introduction to HLM... 3 1.3 Accessing HLM... 3 1.4 Getting Help with HLM... 4 Section 2: Accessing Data
More informationEigenvalues, Eigenvectors, Matrix Factoring, and Principal Components
Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components The eigenvalues and eigenvectors of a square matrix play a key role in some important operations in statistics. In particular, they
More informationLogistic Regression for Spam Filtering
Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used
More informationCI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
More informationImputing Missing Data using SAS
ABSTRACT Paper 32952015 Imputing Missing Data using SAS Christopher Yim, California Polytechnic State University, San Luis Obispo Missing data is an unfortunate reality of statistics. However, there are
More informationFinancial Risk Management Exam Sample Questions/Answers
Financial Risk Management Exam Sample Questions/Answers Prepared by Daniel HERLEMONT 1 2 3 4 5 6 Chapter 3 Fundamentals of Statistics FRM99, Question 4 Random walk assumes that returns from one time period
More informationAnswer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade
Statistics Quiz Correlation and Regression  ANSWERS 1. Temperature and air pollution are known to be correlated. We collect data from two laboratories, in Boston and Montreal. Boston makes their measurements
More informationHLM software has been one of the leading statistical packages for hierarchical
Introductory Guide to HLM With HLM 7 Software 3 G. David Garson HLM software has been one of the leading statistical packages for hierarchical linear modeling due to the pioneering work of Stephen Raudenbush
More informationEstimation and Inference in Cointegration Models Economics 582
Estimation and Inference in Cointegration Models Economics 582 Eric Zivot May 17, 2012 Tests for Cointegration Let the ( 1) vector Y be (1). Recall, Y is cointegrated with 0 cointegrating vectors if there
More informationVI. Introduction to Logistic Regression
VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models
More informationLife Table Analysis using Weighted Survey Data
Life Table Analysis using Weighted Survey Data James G. Booth and Thomas A. Hirschl June 2005 Abstract Formulas for constructing valid pointwise confidence bands for survival distributions, estimated using
More informationGeneralized Linear Models
Generalized Linear Models We have previously worked with regression models where the response variable is quantitative and normally distributed. Now we turn our attention to two types of models where the
More informationMATLAB and Big Data: Illustrative Example
MATLAB and Big Data: Illustrative Example Rick Mansfield Cornell University August 19, 2014 Goals Use a concrete example from my research to: Demonstrate the value of vectorization Introduce key commands/functions
More informationIntroduction to Longitudinal Data Analysis
Introduction to Longitudinal Data Analysis Longitudinal Data Analysis Workshop Section 1 University of Georgia: Institute for Interdisciplinary Research in Education and Human Development Section 1: Introduction
More informationUnivariate Regression
Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is
More informationFrom the help desk: Swamy s randomcoefficients model
The Stata Journal (2003) 3, Number 3, pp. 302 308 From the help desk: Swamy s randomcoefficients model Brian P. Poi Stata Corporation Abstract. This article discusses the Swamy (1970) randomcoefficients
More informationLOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as
LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values
More informationModels for Longitudinal and Clustered Data
Models for Longitudinal and Clustered Data Germán Rodríguez December 9, 2008, revised December 6, 2012 1 Introduction The most important assumption we have made in this course is that the observations
More informationLogistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
More informationAN INTRODUCTION TO ERROR CORRECTING CODES Part 1
AN INTRODUCTION TO ERROR CORRECTING CODES Part 1 Jack Keil Wolf ECE 154C Spring 2008 Noisy Communications Noise in a communications channel can cause errors in the transmission of binary digits. Transmit:
More informationWe shall turn our attention to solving linear systems of equations. Ax = b
59 Linear Algebra We shall turn our attention to solving linear systems of equations Ax = b where A R m n, x R n, and b R m. We already saw examples of methods that required the solution of a linear system
More informationDEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9
DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,
More informationMISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group
MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could
More informationExtensions of the Partial Least Squares approach for the analysis of biomolecular interactions. Nina Kirschbaum
Dissertation am Fachbereich Statistik der Universität Dortmund Extensions of the Partial Least Squares approach for the analysis of biomolecular interactions Nina Kirschbaum Erstgutachter: Prof. Dr. W.
More informationKeep It Simple: Easy Ways To Estimate Choice Models For Single Consumers
Keep It Simple: Easy Ways To Estimate Choice Models For Single Consumers Christine Ebling, University of Technology Sydney, christine.ebling@uts.edu.au Bart Frischknecht, University of Technology Sydney,
More informationL3: Statistical Modeling with Hadoop
L3: Statistical Modeling with Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 10, 2014 Today we are going to learn...
More informationUsing the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes
Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, Discrete Changes JunXuJ.ScottLong Indiana University August 22, 2005 The paper provides technical details on
More informationFactorization Theorems
Chapter 7 Factorization Theorems This chapter highlights a few of the many factorization theorems for matrices While some factorization results are relatively direct, others are iterative While some factorization
More informationIAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results
IAPRI Quantitative Analysis Capacity Building Series Multiple regression analysis & interpreting results How important is Rsquared? Rsquared Published in Agricultural Economics 0.45 Best article of the
More informationTesting for Lack of Fit
Chapter 6 Testing for Lack of Fit How can we tell if a model fits the data? If the model is correct then ˆσ 2 should be an unbiased estimate of σ 2. If we have a model which is not complex enough to fit
More informationCross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.
Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month
More informationMarketing Mix Modelling and Big Data P. M Cain
1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored
More informationNominal and ordinal logistic regression
Nominal and ordinal logistic regression April 26 Nominal and ordinal logistic regression Our goal for today is to briefly go over ways to extend the logistic regression model to the case where the outcome
More informationApplied Multivariate Analysis  Big data analytics
Applied Multivariate Analysis  Big data analytics Nathalie VillaVialaneix nathalie.villa@toulouse.inra.fr http://www.nathalievilla.org M1 in Economics and Economics and Statistics Toulouse School of
More informationApproaches for Analyzing Survey Data: a Discussion
Approaches for Analyzing Survey Data: a Discussion David Binder 1, Georgia Roberts 1 Statistics Canada 1 Abstract In recent years, an increasing number of researchers have been able to access survey microdata
More informationAn extension of the factoring likelihood approach for nonmonotone missing data
An extension of the factoring likelihood approach for nonmonotone missing data Jae Kwang Kim Dong Wan Shin January 14, 2010 ABSTRACT We address the problem of parameter estimation in multivariate distributions
More informationMultivariate Normal Distribution
Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #47/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues
More informationTo give it a definition, an implicit function of x and y is simply any relationship that takes the form:
2 Implicit function theorems and applications 21 Implicit functions The implicit function theorem is one of the most useful single tools you ll meet this year After a while, it will be second nature to
More information15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
More information1/27/2013. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2
PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 Introduce moderated multiple regression Continuous predictor continuous predictor Continuous predictor categorical predictor Understand
More information2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system
1. Systems of linear equations We are interested in the solutions to systems of linear equations. A linear equation is of the form 3x 5y + 2z + w = 3. The key thing is that we don t multiply the variables
More informationFactors affecting online sales
Factors affecting online sales Table of contents Summary... 1 Research questions... 1 The dataset... 2 Descriptive statistics: The exploratory stage... 3 Confidence intervals... 4 Hypothesis tests... 4
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More information