Estimation of Variance Components

Similar documents
Robust procedures for Canadian Test Day Model final report for the Holstein breed

GENOMIC SELECTION: THE FUTURE OF MARKER ASSISTED SELECTION AND ANIMAL BREEDING

Statistical Models in R

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Genetic Parameters for Productive and Reproductive Traits of Sows in Multiplier Farms

New models and computations in animal breeding. Ignacy Misztal University of Georgia Athens

1/27/2013. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Univariate Regression

Multivariate Analysis of Variance (MANOVA): I. Theory

STATISTICA Formula Guide: Logistic Regression. Table of Contents

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

Random effects and nested models with SAS

xtmixed & denominator degrees of freedom: myth or magic

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Least Squares Estimation

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

ANOVA. February 12, 2015

Introduction to General and Generalized Linear Models

One-Way Analysis of Variance: A Guide to Testing Differences Between Multiple Groups

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Genomic Selection in. Applied Training Workshop, Sterling. Hans Daetwyler, The Roslin Institute and R(D)SVS

BLUP Breeding Value Estimation. Using BLUP Technology on Swine Farms. Traits of Economic Importance. Traits of Economic Importance

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Introduction to Data Analysis in Hierarchical Linear Models

MULTIPLE REGRESSION WITH CATEGORICAL DATA

Topic 9. Factorial Experiments [ST&D Chapter 15]

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

PureTek Genetics Technical Report February 28, 2016

STATISTICS AND DATA ANALYSIS IN GEOLOGY, 3rd ed. Clarificationof zonationprocedure described onpp

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

GLM I An Introduction to Generalized Linear Models

Logistic Regression (1/24/13)

1.5 Oneway Analysis of Variance

Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

10. Analysis of Longitudinal Studies Repeat-measures analysis

AP: LAB 8: THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

Use of deviance statistics for comparing models

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Multivariate Analysis of Variance (MANOVA)

Multiple Linear Regression in Data Mining

A Basic Introduction to Missing Data

Sample Size Calculation for Longitudinal Studies

Post-hoc comparisons & two-way analysis of variance. Two-way ANOVA, II. Post-hoc testing for main effects. Post-hoc testing 9.

SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables.

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

SAS Software to Fit the Generalized Linear Model

Ordinal Regression. Chapter

Research Methods & Experimental Design

Lecture 14: GLM Estimation and Logistic Regression

MIXED MODEL ANALYSIS USING R

Statistical Machine Learning

September Population analysis of the Retriever (Flat Coated) breed

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Regression Analysis: A Complete Example

Notes on Applied Linear Regression

Psychology 205: Research Methods in Psychology

An analysis method for a quantitative outcome and two categorical explanatory variables.

Logit Models for Binary Data

Longitudinal random effects models for genetic analysis of binary data with application to mastitis in dairy cattle

Time Series Analysis

Chapter 5 Analysis of variance SPSS Analysis of variance

Section 13, Part 1 ANOVA. Analysis Of Variance

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

2013 MBA Jump Start Program. Statistics Module Part 3

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

Illustration (and the use of HLM)

Moderation. Moderation

2. Simple Linear Regression

Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis

Lab 5 Linear Regression with Within-subject Correlation. Goals: Data: Use the pig data which is in wide format:

D-optimal plans in observational studies

Overview of Methods for Analyzing Cluster-Correlated Data. Garrett M. Fitzmaurice

Poisson Models for Count Data

ln(p/(1-p)) = α +β*age35plus, where p is the probability or odds of drinking

CS 147: Computer Systems Performance Analysis

Description. Textbook. Grading. Objective

LAB : THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

Nonlinear Regression:

Week 5: Multiple Linear Regression

Simple Linear Regression Inference

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Quadratic forms Cochran s theorem, degrees of freedom, and all that

Chapter 7 Factor Analysis SPSS

SAS Syntax and Output for Data Manipulation:

5. Linear Regression

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

Basics of Marker Assisted Selection

individualdifferences

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

General Regression Formulae ) (N-2) (1 - r 2 YX

Transcription:

Estimation of Variance Components Why? Better understanding of genetic mechanism Needed for prediction of breeding values Selection Index / BLUP Needed for optimization of breeding programs and prediction of response

Variance Components Add. Genetic Residual Parameters Heritability Maternal Permanent Environment Maternal Heritability Repeatability Litter, Common full-sib comp t ( c ) Dominance, Herd Covariances Correlations Phenotypic/ Genetic

When to (re) estimate variance components? New trait (co)variances change over time due to environmental and/or genetic change Selection Upgrading Trait definition

Variance and Covariance Variance: measure of differences (extent of) Covariance: measure of differences in common Between individuals/ between traits Types of family resemblance None Moderate Full sire 3 3 3 Individual Values 3 3 3 3 3 3 3 3 3 Var between Families None Moderate Large Var within Fam Large Moderate None

Relating variance components to underlying effects - give it a meaning! Variance between groups = covariance within groups! Variance between HS families = Covariance among half sibs = ¼ V A They share 5% of their genes! Variance within HS families = Residual Variance = V P ¼ V A = ¾ V A + V E + V D

Relating variance components to underlying effects - give it a meaning! Variance between groups = covariance within groups! Variance between FS families = Covariance among half sibs = ½ V A + V ec + ¼ V D They share 50% of their genes! Variance within FS families = Residual Variance = V P ½ V A -V ec - ¼ V D = ½ V A + V EW + ¾ V D

Analyses of Variance Principle Detect the importance of different sources of effects Importance is determined by its contribution to variation Variation if derived from sums of squares and df

Analyses of Variance Example y i = µ + e i µ = mean (fixed) e i = residual is random (causes variation) Same as Var(y) = n = ( y i i y) /( n ) Calculating sum of squares n = e = i i SSE Equal SS to its expectation E(SSE) = (n-).σ e

Analyses of Variance Example Data y = [8, 9,, ] Model: y i = µ + e i Sums of squares: Total: 8 + 9 + + = 40 Mean: 4 * 0 = 400 Residual SS = 0 (= (-) + (-) + + )

Analyses of Variance Example Data y = [8, 9,, ] a: i = Model: y i = µ + a i + e ij Estimates: µ = 0 a i = -.5 a = +.5 Sum of squares Observed: 8 9 40 SS Total Mean: 0 0 0 0 400 SS Mean a-effect -.5 -.5 +.5 +.5 9 SSA Residual -0.5 +0.5-0.5 +0.5 SSE

ANOVA-Table SS df MS EMS Mean 400 A-effect 9 9 σ e + σ a Expected Mean Squares Residual 0.5 σ e Total 40 4 Nr. per class Note: a-effect is a classification of data: e.g. according to sires (half sib groups). It relates to variance between groups residual relates to variance within groups

Group (e.g. sire) differences relate to variance between groups residual differences relates to variance within groups σ a = 4.5 σ a =. a a 8, 9, σ e = 0.5 σ e = 0.7

Summarizing the procedure Modeling (general) Data = fixed effects + random effects E(y) = fixed effects means Var(y) = variance due to random effects Interpretation Statistically: Need sufficient data Need to think about data structure Sampling conditions need to be fulfilled (random?) Genetically Translating the components into meaningful parameters (e.g. sire variance = ¼ V A )

Accuracy: SE of heritability estimate True heritability Nr. of records 0. 0.3 00 0.8 0.30 500 0.08 0.4 000 0.06 0.0 5000 0.03 0.04

Methods for variance component estimation ANOVA - balanced data ANOVA unbalanced data Henderson s methods (SAS etc) Likelihood methods Maximum Likelihood Restricted maximum Likelihood (REML) Bayesian Methods Gibbs Sampling

ANOVA in Unbalanced data Same idea as balanced (previous) but use a weighted number for n in: EMS A = σ e + nσ a Need matrix notation to work out SS and EMS (as in linear models) Standard method in computer programs such as SAS, Harvey, SPSS etc. Most general of those is called the Henderson III method

Likelihood methods Each observation has a probability density, determined by its distribution expected value (e.g. mean) location parameters variance dispersion parameters E.g. y with normal distribution, mean µ and variance σ f ( y µ ) σ ( y) = e σ π This is a Probability Density Function (PDF) for the observation It gives the probability of the observation, given the parameters µ and σ But we turn this around and get the likelihood of the parameters given y

Likelihood methods We can multiply these probability values over the whole data, and include the fact that some of the observations may be related, i.e. we have a joint distribution Data vector y with exp. means E(y) = Xb and var(y) = V The log of the likelihood is: L( b, V X, y) = N log( π ) log( V ) ( y Xb )' V ( y Xb ) The expression gives the likelihood of the parameters (b, V) given data (X, y) in the right-hand side: first two terms are expectations the last term is a sum of squares

Restricted Maximum Likelihood Correct all data first for all fixed effects Find the maximum likelihood (solution for variance components) after these corrections Usually an iterative procedure is used to solve the problem Starting values (for the parameters) are needed to get going

An example of a REML algorithm (EM-algorithm,for illustration only) Solve mixed model equations using a prior value for the variance components (ratio) X X Z X X Z Z Z + λ A Solve variance components from the MME-solutions - - σ a = a$ A a + tr( A C ) σ e / q Use a new λ (= σ e /σ a ) and iterate between and - bˆ aˆ = X y Z Y [ ] [ yy bxy $ azy $ ] ( X ) = - - / N-r( ) σ e

REML Algorithms EM algorithm (first derivatives) Fischer Scoring (nd derivatives) Derivative Free Average Information algorithm programs ASREML DFREML VCE

Why is REML better than ANOVA from SAS? It is by definition more accurate Uses full mixed model equations, so can utilize all animal relationships (animal model) Therefore, it has many properties as BLUP, e.g. it accounts for selection It allows more complicated mixed models (maternal effects, multiple traits etc) as with BLUP

Further notes on REML procedure If using an animal model, heritability is estimated from naturally combining information between families (HS/FS) information from parent-offspring regression The method and model are very flexible, but it can be hard to evaluate the estimates based on the data and the data structure e.g. Is there a good family structure?

Evaluating the quality of the Accuracy parameter estimates Look at SE of estimates (although these are approximated!) Evaluate effect of number of records, and structure (nr. of groups vs nr. per group) Unbiasedness From the data, and the possible effects, evaluate whether there was no bias from selection, or from confounding effects

Example: Analysis of weaning weight for White Suffolk data on 9700 animals, 5,000 in pedigree Comparison of including or not including the correlation between direct genetic and maternal effects and the effect of ignoring maternal effects. Correlation A-M No correlation No maternal included effect PhenVar 3.45 3.6 3.94 Heritability 0.5 0.04 0.9 0.03 0.44 0.03 Maternal Heritab. 0.8 0.04 0.8 0.0 Correl. direct-matern. -0.44 0.0

Example: Analysis of weaning weight for White Suffolk data on 9700 animals, 5,000 in pedigree The effect of ignoring or including a permanent environmental effect (PE) of dams with PE without PE Phenotypic Var. 3.06 3.45 Heritability (direct) 0.5 0.04 0.5 0.04 Maternal heritability 0.3 0.04 0.8 0.04 Corr Mat-Direct -0.50 0. -0.44 0.0 Permanent Env. Ewe 0. 0.0

Comparing the different models Likelihood Ratio test Max _ Likelihood( reduced mod el) LR = ln Max _ Likelihood( full model) Chi-squared Distribution df = model diff in nr. of parameters

Designs needed for VC estimation Additive Genetic: Family structure/animals Per. Environm: Repeated Measurements Maternal: Mothers with more progeny Maternal Genetic: Family structure/mothers Correlation Va ~ Vm: Records on mum(?)