CH927 Quantitative Genomics Lecture 2. How can quantitative traits be mapped?

Transcription

1 CH927 Quantitative Genomics Lecture 2 How can quantitative traits be mapped?

2 Lecture objectives By the end of this lecture you should be able to explain: What the main steps in QTL mapping are What the different methods for QTL analysis are: - Single marker, Interval/Flanking Mapping - Composite Interval Mapping (CIM), Multiple Interval Mapping (MIM) and under which experimental conditions they should be used What the different statistical methods for QTL analysis are: - t-test, ANOVA, multiple regression, linear regression and what they can predict about QTLs

3 Basis for QTL mapping known for over 70 years but lack of genetic markers prevented widespread use until the mid 80 s With DNA sequencing, the number & density of markers have grown Also, more statistically-sophisticated mapping methods have been developed 1. Score a population for (i) a trait, and (ii) distribution of genome markers 2. Identify regions of the genome containing QTLs based on occurance of a phenotypemarker association that is significantly more likely than chance

4 Association of phenotypes with markers a G A g agb Agb agb agb AgB B b agb AgB agb Agb agb agb agb agb agb AgB AgB Agb Agb A/a and B/b = molecular scores G/g = phenotypic score Results from marker A/a: suggests that the gene is very close to the marker Results from marker B/b: suggests that the gene is not linked to the marker

5 This is a generalisation of the principle...but for only one gene. We need to consider Quantitative Trait Loci (multiple) a G A g agch BJkD AgcH bjkd AgcH BJKD AgcH BJkd agch bjkd c H C h AgcH BJkD agch bjkd agch BJKd AgcH bjkd B J b j k D K d AgcH BJkd AgcH BJKD AgcH BJkD agch bjkd agch bjkd AgcH bjkd AgcH bjkd agch BJkD agch BJKd

6 Objectives of QTL analysis 1. Score a population for (i) a trait, and (ii) distribution of genome markers 2. Identify regions of the genome containing QTLs based on occurence of a phenotype-marker association that is significantly more likely than chance 3. Estimate the effects of the QTLs on the quantitative trait: - many genes with small effect each or few genes with large effect each? - their effects on the trait: is gene action additive or dominant? - their positions in the genome: linkage and association, epistasis - their interaction with the environment 4. Identify candidate genes underlying the QTL and thus the trait

7 QTL analysis can be classified by the type of progeny used All of the different progenies are derived from the same reference population From this reference population different progenies can be produced P1 P2 MMQQ x mmqq M = marker genotype Q = QTL genotype TC4 self F2 self x 5 x P3 F1 MmQq self F7 (RILs) x P4 SI lines x P3 TC1 x P4 TC2 TC3

8 F2 x P2 Backcrosses and Near Isogenic Lines (NILs) self BC1 (Backcross1) F1: use for QTL mapping BC1 F2 Rapid generation of material for QTL analysis x BC1 F3 BC2 F1 BC2 F2 BC2 F3 Near Isogenic Lines Isolate part of genome A of interest

9 To map a quantitative trait: 1. Make a cross and generate marker data - Type of mapping population (e.g. RIL) 2. Generate linkage maps - Genome size, genome coverage 3. Collect phenotypic measurements - Evaluate in uniform environment, - Evaluate in multiple environments - Data transformation (approach normal distribution) frequency A 1 /A 2 A 1 /A 1 A 2 /A 2 trait value Total variance = V T = V G + V E genetic variance + environmental variance heterogeneous env. stochastic events measurement error Assumes genes act additively (i.e. no epistasis) and that their effects are not conditional on environment, otherwise V T = V G + V E + V GxG + V GxE

10 Lecture objectives By the end of this lecture you should be able to explain: What the main steps in QTL mapping are What the different methods for QTL analysis are: - Single marker, Interval/Flanking Mapping - Composite Interval Mapping (CIM), Multiple Interval Mapping (MIM) and under which experimental conditions they should be used What the different statistical methods for QTL analysis are: - t-test, ANOVA, multiple regression, linear regression and what they can predict about QTLs

11 4. The statistical machinery for QTL mapping Several analysis frameworks for marker-qtl associations: - Single marker tests (t-test, F-test or Linear Regression) - Interval/Flanking Mapping (IM) (pair of markers simultaneously) - Composite Interval Mapping (CIM) (analysis of a marker interval, flanked by adjacent markers, ML-based) - Multiple Interval Mapping (MIM)

12 4. The statistical machinery for QTL mapping Four main analysis techniques: Simple t-test: use to evaluate presence of a QTL through statistical differences between two marker genotypes ANOVA (marker regression): detects marker differences when there are more than two marker genotypes. Produces a ranking of genotypes, in order of phenotypic effect for the trait of interest, and tests for significant differences between each genotype Multiple regression: simple remodelling of the ANOVA technique in regression terms, with the same ranking and testing for differences Linear regression: most complex point analysis method, allowing different characteristics of the QTL to be investigated. Including: dominance effects, additive effects genotype-environment interactions, epistasis

13 Probabilites and t-tests

14 Basic mapping format: conditional probablities The conditional probibility that the QTL genotype is Qq, given that the marker genotype is Mm: P1 P2 Pr(Qk Mj) = Pr(QkMj) Pr(Mj) MM QQ x mm qq Calculate this in an F2 from: gamete frequencies marker genotype probabilities Consider a QTL linked to a marker (recombination Fraction = c) In the F 2, freq(mq) = freq(mq) = (1-c)/2 freq(mq) = freq(mq) = c/2 self F2 F1 Mm Qq QTL genotypes = missing Marker genotypes = observed

15 Basic mapping format: conditional probablities In the F 2, freq(mq) = freq(mq) = (1-c)/2 freq(mq) = freq(mq) = c/2 Hence, Pr(MMQQ) = Pr(MQ)Pr(MQ) = (1-c) 2 /4 Pr(MMQq) = 2Pr(MQ)Pr(Mq) = 2c(1-c) /4 Pr(MMqq) = Pr(Mq)Pr(Mq) = c 2 /4 Since Pr(MM) = 1/4, the conditional probabilities become: Pr(QQ MM) = Pr(MMQQ)/Pr(MM) = (1-c) 2 Pr(Qq MM) = Pr(MMQq)/Pr(MM) = 2c(1-c) Pr(qq MM) = Pr(MMqq)/Pr(MM) = c 2

16 Using a t-test to probe a QTL e.g. backcross with two genes: marker (alleles M, m), and QTL (alleles Q, q) These two genes are linked with the recombination fraction of c MmQq Mmqq mmqq mmqq Frequency (1-c)/2 c/2 c/2 (1-c)/2 Mean effect m+a m m+a m Mean of marker genotype Mm: m 1 = (1-c)/2(m+a) + c/2m = m + (1-c)a A small MM-mm difference: small effect tight linkage Mean of marker genotype mm: m 0 = c/2(m+a) + (1-c)/2m = m + ca If trait mean is significantly different for the genotypes at a marker locus, it is linked to a QTL large effect loose linkage

17 ANOVA and single marker regression

18 Partitioning of variance: a simple ANOVA model Partition variance: genetically-determined and environmental components Model (there is a QTL linked to a marker) is tested against the null hypothesis of no QTL trait value A 1 /A 1 A 1 /A 2 A 2 /A 2 genotype

19 Partitioning of variance methodology Total sum of squares: calculate grand mean, deviation of each individual from mean SS T square each deviation & sum all the deviations for the population Total mean sum, MS T = SS T degrees of freedom = n-1 = total variance n=23 trait value Grand mean A 1 /A 1 A 1 /A 2 A 2 /A 2

20 Partitioning of variance: fitting the model Calculate mean for each genotype group SS R = residual sum of squares = sum (deviations of each individual from genotype mean) 2 Total mean sum, MS R = SS R degrees of freedom = (n-1) - #genotypes) = variance not explained by the model (or explained by this QTL) trait value Grand mean A 1 /A 1 A 1 /A 2 A 2 /A 2

21 Genetic variance and testing the model Model sum of squares, SS M = sum values for each genotype: (grand mean - each genotype mean) 2 x (# individuals with that genotype) Genetic variance, MS M = But since MS T = MS M + MS R SS M degrees of freedom = 2 It is easier to calculate as MS M = MS T - MS R

22 Genetic variance and testing the model To test whether the QTL explains a significant amount of the variation, calculate Model to residual variance, F-ratio = MS M / MS R Variance explained by the QTL = MS M / MS T Look up the minimum value of F that is unlikely to have occurred by chance, given 2 d.f. for MS M and 20 for MS R (F 3.49 for p 0.05 in this case) If F exceeds this value, we can reject the null hypothesis of no QTL MS M = MS T - MS R

23 This is essentially a least-squares regression Incorporate terms into the model to estimate: The additive effect of the alleles, a = half the difference between the averages for the two homozygotes can be positive or negative, depending on which allele is being considered The dominance deviation, d = the average difference between hets and the mid-point of the homs can also be positive or negative If d = ±a one allele completely dominant If d > ±a one allele shows over-dominance

24 Estimation of additive and dominance effects MmQq Mmqq mmqq mmqq Frequency (1-c)/2 c/2 c/2 (1-c)/2 Mean effect m+a m m+a m Mean of marker genotype Mm: m 1 Mean of marker genotype mm: m 0 a* = estimated additive effects d* = estimated dominance effects Additive effects (a): (m 1 m 0 )/2 = a(1-2c) = a* Dominance effects (d): m 2 - (m 1 m 0 )/2 = d(1-2c) = d* (m 1 m 0 )/2

25 Linear Models for QTL Detection Uses the linear relationship between the apparent affects of a marker on a quantitative character, and the substantial effects of all related QTLs that are linked to that marker y mk = π + b m + e mk Effect of marker genotype m on trait value Value of trait in kth individual of marker genotype m Differences in the distance between the QTL and the markers alter factors in this relationship Detection: a QTL is linked to the marker if at least one of the b m is significantly different from zero Estimation (QTL effect and position): have to relate the b m to the QTL effects and map position

26 Detecting epistasis One major advantage of linear models is their flexibility Test for epistasis between two QTLs: use an ANOVA with an interaction term: Effect from marker genotype at first marker set (can be > 1 loci) Interaction between marker genotypes i in 1st marker set and k in 2nd marker set y = π + ai + bk + di k + e Effect from marker genotype at second marker set At least one of the a i significantly different from 0 QTL linked to first marker set At least one of the b k significantly different from 0 QTL linked to second marker set At least one of the d ik significantly different from 0 interactions between QTL in sets 1 and two

27 Interval mapping and marker regression

28 Problems with single marker mapping using ANOVA If marker density is high, ANOVA with individual marker genotypes is effective: single marker analysis or single marker regression Three important weaknesses: Do not receive separate estimates of QTL location and QTL effect. Must discard individuals whose genotypes are missing at the marker When markers are sparse, the QTL may be quite far from all markers, and so the power for QTL detection will decrease

29 Interval mapping Can use probability estimates for the genotypes in intervals between markers Move the QTL position every 2cM from M 1 to M 2 and draw the profile of the F value. The peak of the profile corresponds to the best estimate of the QTL position F-value M 1 M 2 M 3 M 4 M 5 Testing position

30 Interval mapping implementation Carry out a QTL scan step-wise: once a significant QTL has been identified, other markers tested for their ability to explain the residual variation Known QTL are said to be fixed or co-factors in the regression F-ratio Interval mapping by regression (QTL Express) ** ** * ** * **

31 Interval mapping with regression approach Consider a marker interval M 1 -M 2. We assume that a QTL is located at a particular position between the two markers (r 1 and θ are fixed) With response variable, y i, and dependent variable, x i, a regression model is constructed: The phenotypic value for individual i affected by a QTL can be expressed as, y i = μ + a*x i + e i i = 1,, n (latent model) y i is the overall mean x*i is the indicator variable for QTL genotypes: x*i = 1 for Qq; 0 for qq a* is the additive effect effect of the putative QTL on the trait ei is the residual error, e i ~ N(0, σ 2 )

32 Advantages and disadvantages of interval mapping Advantages: - the position of the QTL can be inferred by a support interval - the estimated position and effects of the QTL tend to be asymptotically unbiased if there is only one segregating QTL on a chromosome - method requires fewer individuals Disadvantages: - this is not an interval test - even when there is no QTL within an interval, the likelihood profile on the interval can still exceed the threshold if there is a QTL nearby - if there is more than one QTL on a chromosome, the test statistic at the position being tested will be affected by all QTL and the estimated positions - not efficient to use only two markers at a time for testing

33 Flanking methods and Maximum likelihood

34 Flanking marker methods have been the most popular analysis techniques over recent years Due to their accuracy and level of characterisation of the putative QTL - combine both detection and estimation of QTL effects and position Two basic techniques: Maximum likelihood Maximum likelihood estimation through regression Three methods for estimating likelihood: Single marker maximum likelihood (least power) Flanking marker maximum likelihood (most versatile) Order restricted interval mapping (most power)

35 LOD score Estimating the QTL position (θ): Likelihood maps View θ as a fixed parameter, assume the QTL is located at a particular position View θ as a variable being estimated (derive log-likelihood equation for MLE of θ) (L O / L A ) = ratio of the likelihood of the null hypothesis (no QTL in the marker interval) to the likelihood of the alternative hypothesis (QTL present) LOD (Log of the Odds) = log 10 (L O / L A ) Support interval Estimated QTL location In each method a likelihood map is produced: Significance threshold 0 Chromosome position

36 Composite interval mapping (CIM) Uses multiple markers as additional factors (marker cofactors) i-1 i i+1 i+2 Interval being mapped Five different types of markers are considered for the regression model, depending on the characteristics of the chromosome region: - markers surrounding the QTL of interest - linked & unlinked markers within the QTL region - linked & unlinked markers outside the QTL region Method: Predict QTL marker genotype every x cm Carry out an LR test for QTL effect every x cm Combines MLE and multiple regression methods

37 Permutation testing to determine experiment-wide signficance thresholds Multiple testing problem: how often are random QTL effects of a certain magnitude detected in similar datasets? Method: top 5% of random - create a large number of random empirical datasets - take your marker data and randomly reassign the phenotypes back to the marker genotypes - repeat the QTL detection process - record the highest LR produced for a random QTL anywhere in the map 95% of random - repeat the whole process > 500 times - record the magnitude of the lowest random QTL observed in the top 5% of LR results = threshold

38 Multiple interval mapping Uses multiple marker intervals simultaneously Aims to map multiple QTLs in a single step Method: Build regression models which include all QTLs (detected first by CIM) Use information content (IC) theory to evaluate alternative models Allows simultaneous detection and estimation of additive, dominance & epistatic effects

39 Some examples of the final output

40 Genetics and genomics of post harvest senescence in broccoli Vicky Buchanan-Wollaston and Dave Pink (Warwick HRI) 1 2 JoinMap 2010 broccoli linkage map plant lines, 211 loci (189 SSR, 22 AFLP) 7 8 9

41 QTLs for senescence traits in broccoli Two major QTL for time to yellowing confirmed on 2010 broccoli map REML calculated: 64.4 % of line mean variation is genetic Chr Lod p >0.001 Chr Lod p > cm 7 cm 30.6% of variation 3.8 Lod p > % of variation MapQTL Permutation test 10,000 iterations