LD and Haplotype Analysis Tutorial

Similar documents
CNV Univariate Analysis Tutorial

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis

Hierarchical Clustering Analysis

Excel Tutorial. Bio 150B Excel Tutorial 1

Step by Step Guide to Importing Genetic Data into JMP Genomics

Scatter Plots with Error Bars

MultiExperiment Viewer Quickstart Guide

Data Visualization. Prepared by Francisco Olivera, Ph.D., Srikanth Koka Department of Civil Engineering Texas A&M University February 2004

Statistical Analysis for Genetic Epidemiology (S.A.G.E.) Version 6.2 Graphical User Interface (GUI) Manual

Making Visio Diagrams Come Alive with Data

Prism 6 Step-by-Step Example Linear Standard Curves Interpolating from a standard curve is a common way of quantifying the concentration of a sample.

Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER

Company Setup 401k Tab

UOFL SHAREPOINT ADMINISTRATORS GUIDE

This activity will show you how to draw graphs of algebraic functions in Excel.

ITS Training Class Charts and PivotTables Using Excel 2007

HOW TO VIEW AND EDIT PICTURES

MARS STUDENT IMAGING PROJECT

Basic Microsoft Excel 2007

Microsoft Excel 2013: Charts June 2014

DataPA OpenAnalytics End User Training

Integrated Company Analysis

Business Analytics Enhancements June 2013

Charting LibQUAL+(TM) Data. Jeff Stark Training & Development Services Texas A&M University Libraries Texas A&M University

To launch the Microsoft Excel program, locate the Microsoft Excel icon, and double click.

Go to: URL:

Data representation and analysis in Excel

Mail Merge Creating Mailing Labels 3/23/2011

Statgraphics Getting started

DATA VISUALIZATION WITH TABLEAU PUBLIC. (Data for this tutorial at

PowerWorld Simulator

Microsoft Excel 2010 Tutorial

Drawing a histogram using Excel

Data Visualization. Brief Overview of ArcMap

Getting Started With SPSS

Intellect Platform - Tables and Templates Basic Document Management System - A101

RuleBender Tutorial

Interactive Excel Spreadsheets:

During the process of creating ColorSwitch, you will learn how to do these tasks:

Intro to Excel spreadsheets

Spatial Adjustment Tools: The Tutorial

Novell ZENworks Asset Management 7.5

How to Make the Most of Excel Spreadsheets

Introduction to Microsoft Access 2003

SPSS: Getting Started. For Windows

CREATING EXCEL PIVOT TABLES AND PIVOT CHARTS FOR LIBRARY QUESTIONNAIRE RESULTS

Figure 1. An embedded chart on a worksheet.

Summary of important mathematical operations and formulas (from first tutorial):

Using SPSS, Chapter 2: Descriptive Statistics

Sample Table. Columns. Column 1 Column 2 Column 3 Row 1 Cell 1 Cell 2 Cell 3 Row 2 Cell 4 Cell 5 Cell 6 Row 3 Cell 7 Cell 8 Cell 9.

Scientific Graphing in Excel 2010

Tutorial on gplink. PLINK tutorial, December 2006; Shaun Purcell,

Viewing and Troubleshooting Perfmon Logs

Market Pricing Override

Generative Drafting. Page DASSAULT SYSTEMES. IBM Product Lifecycle Management Solutions / Dassault Systemes

SW43W. Users Manual. FlukeView Power Quality Analyzer Software Version 3.20 onwards

The following is an overview of lessons included in the tutorial.

SNPbrowser Software v3.5

Ofgem Carbon Savings Community Obligation (CSCO) Eligibility System

JustClust User Manual

Visualization with Excel Tools and Microsoft Azure

Microsoft Word Quick Reference Guide. Union Institute & University

Budget Process using PeopleSoft Financial 9.1

Step One. Step Two. Step Three USING EXPORTED DATA IN MICROSOFT ACCESS (LAST REVISED: 12/10/2013)

Petrel TIPS&TRICKS from SCM

Introduction to Microsoft Excel 2007/2010

Microsoft Excel 2010 Part 3: Advanced Excel

Client Marketing: Sets

BIGPOND ONLINE STORAGE USER GUIDE Issue August 2005

There are six different windows that can be opened when using SPSS. The following will give a description of each of them.

Logi Ad Hoc Reporting System Administration Guide

MS Excel Template Building and Mapping for Neat 5

Merging Labels, Letters, and Envelopes Word 2013

Tutorial for proteome data analysis using the Perseus software platform

Outlook Tips & Tricks. Training For Current & New Employees

Creating an Excel XY (Scatter) Plot

How to make a line graph using Excel 2007

EXCEL Tutorial: How to use EXCEL for Graphs and Calculations.

Project Management within ManagePro

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

ICP Data Entry Module Training document. HHC Data Entry Module Training Document

Chapter 11 Sharing and Reviewing Documents

ImageNow Document Management Created on Friday, October 01, 2010

5.7. Quick Guide to Fusion Pro Schedule

Load testing with. WAPT Cloud. Quick Start Guide

Spreadsheets and Laboratory Data Analysis: Excel 2003 Version (Excel 2007 is only slightly different)

Plotting: Customizing the Graph

APPLYING BENFORD'S LAW This PDF contains step-by-step instructions on how to apply Benford's law using Microsoft Excel, which is commonly used by

Instructions for creating a data entry form in Microsoft Excel

How to access some of the frameworki reports to help manage workload

SPSS Manual for Introductory Applied Statistics: A Variable Approach

Reading Wonders Training Resource Guide

Excel -- Creating Charts

What is Microsoft Excel?

BID2WIN Workshop. Advanced Report Writing

In this example, Mrs. Smith is looking to create graphs that represent the ethnic diversity of the 24 students in her 4 th grade class.

Microsoft Excel Tutorial

Chapter 7: Prepare a field trial and analyze results

Report and Export Options

What is OneDrive for Business at University of Greenwich? Accessing OneDrive from Office 365

Transcription:

LD and Haplotype Analysis Tutorial Release 8.1 Golden Helix, Inc. March 8, 2014

Contents 1. Generating LD Plots 2 A. Open the Project............................................... 2 B. Generate a log10 P-value plot....................................... 2 C. Add LD Plot................................................. 2 2. Computing Haplotype Blocks 5 A. Automatically Computing Haplotype Blocks................................ 5 B. Manually Manipulating Haplotype Blocks................................. 6 3. Comparing Multiple LD Plots 7 4. Haplotype Frequency Tables 9 A. Generating Frequency Tables........................................ 9 5. Haplotype Association Tests 11 A. Performing Per Haplotype Association Tests................................ 11 B. Performing Per Block Haplotype Association Tests............................. 12 6. Large-Scale Haplotype Association Testing 13 A. Calculating Haplotype Blocks for Chromosome 22............................. 13 B. Haplotype Association Tests Using Block Definitions........................... 13 C. Plotting Haplotype and Single Marker Association Results Together.................... 15 7. Haplotype Trend Regression 17 A. Full Model Regression............................................ 17 B. Full vs Reduced Model Regression..................................... 19 i

ii

Updated: March 8, 2014 Level: Intermediate Packages: SNP Analysis, Power Seat This tutorial leads you through various LD and haplotype analyses in SVS 8. For demonstration purposes, a simulated dataset is used consisting of actual Affymetrix 500K genotypes from the four HapMap populations (Phase II) mapped to the hg19 Human reference build GRCh_37, simulated case/control status and simulated quantitative phenotype. Caution: This tutorial does not cover quality assurance and therefore no quality assurance steps have been performed on the data. As it may be appropriate to filter markers based on Hardy-Weinberg Equilibrium or those with low call rates and minor allele frequencies, it is recommended that you perform such measures with your own data prior to performing LD and haplotype analysis. Requirements To follow along you will need to download and unzip the following file, which includes several datasets: Download LD_and_Haplotype_Tutorial.zip We hope you enjoy the experience and look forward to your feedback. Contents 1

1. Generating LD Plots The general workflow outlined in this tutorial is intended to emulate a study, whereby one does a whole genome scan on individual markers and then hones in on significant regions for a more in-depth investigation of LD and haplotypes. A. Open the Project Launch Golden Helix SVS and choose File >Open Project. Navigate to the LD and Haplotypes Analysis.ghp file downloaded previously and click Open. You ll notice a couple of datasets already created in the project including a joined spreadsheet of phenotype and genotype data for the HapMap samples (Phenotype Dataset + 500K Genotypes) as well as an association test results spreadsheet (Association Tests (Genotypic Tests)). B. Generate a log10 P-value plot Open the Association Tests (Genotypic Tests) spreadsheet. Right-click on the Chi-Squared log10 P column (2) and select Plot Variable in GenomeBrowse. A p-value plot is created. Notice, there are two regions of significance, one on chr14 and the other on chr22. In this part of the tutorial we will focus on the chr22 region. Before you move on, go back to the Project Navigator and rename the plot node just created to -log10 P + LD (right-click on the node and select Rename Node). Now, from the log10 P + LD plot copy and paste 22:37,284,796-37,342,082 into the Region: text box at the top of the window. Press Enter to complete the zoom. You should now be zoomed into a region on 22q12.3. (Figure 1a). C. Add LD Plot You can add an LD plot to an existing graph from any spreadsheet that contains column marker mapped genotype data. In this case you want to generate an LD plot from the same genotype spreadsheet used to produce the association test results. From the -log10 P + LD plot, select File >Add and click the Project button. Select the Phenotype Dataset + 500K Genotypes - Sheet 1 spreadsheet and choose LD. Make sure the screen looks like Figure 1b and then click Plot & Close. 2

Figure 1a. P-value plot initial zoom Figure 1b. Add LD Plot to Graph C. Add LD Plot 3

An LD plot will now appear above the p-value plot and an LD node will appear in the Plot Tree (Figure 1c). Figure 1c. P-value and LD plot. Notice the apparent block of LD (red) in the middle of the plot interrupted by a single SNP that is uncorrelated (blue) with the other markers. 4 1. Generating LD Plots

2. Computing Haplotype Blocks In SVS 8 you can compute haplotype blocks manually via the LD interface or automatically using the Gabriel, et al. method. This tutorial will lead you through a combination of both. A. Automatically Computing Haplotype Blocks From the -log10 P + LD plot select the LD item in the Plot Tree and under the Marker Blocks tab. Select Visible Markers from the Compute options. Note: Selecting all markers would compute haplotype blocks across the entire 500K dataset. Figure 2a. LD plot with haplotype blocks. Notice at the top of the Haplotype Block Detection window it tells you how many markers on how many chromosomes haplotype blocks will be computed for. In this case it is 22 markers active in 1 chromosome. Use the default parameters and click Run. 5

The algorithm produces two haplotype blocks which appear as black outlined pentagons at the top of the LD plot (Figure 2a). One could argue there should only be one block instead of two. For this reason, SVS 8 makes it easy to manually manipulate blocks when needed and then save the block definitions for subsequent analyses. B. Manually Manipulating Haplotype Blocks In this step you will manually define a single block from two separate blocks. Figure 2b. Haplotype frequencies in Data Console. Click inside the larger block, this will change the outline to green and details for this block will appear in the Console window. Left-click on the left edge and hold your mouse button down. Then drag the cursor to the left, expanding the larger block over the smaller block. Release the mouse button and a new block will be created. Note: You can generate Haplotype frequencies for the selected block by clicking the option to Compute Haplotype Tables under the Marker Blocks Tab of the Controls dialog. 6 2. Computing Haplotype Blocks

3. Comparing Multiple LD Plots This step is covered as it may be useful in your own study to compare multiple LD plots to understand how the correlation structure in one dataset compares to that of a similar dataset, e.g. comparing a random set of Caucasians in your study with CEU samples of HapMap. Another useful example is comparing a less dense array (e.g. Affymetrix 500K) with a denser array (e.g. Affymetrix 6.0). Though not a standard practice, for demonstration purposes this tutorial compares the overall LD structure of all HapMap populations with that of only Yorubans. From the -log10 P + LD plot viewer, right-click on the LD and select Edit Title... enter LD - All Populations. Open Phenotype Dataset + 500K Genotypes - Sheet 1 spreadsheet, right-click on the Ethnicity (column 4) and select Activate by Category. Highlight YRI and click OK. This will inactivate all samples of a different ethnicity and create Phenotype Dataset + 500K Genotypes - Sheet 2. Then from the -log10 P + LD plot go to File >Add, click the Project button and select the LD option from the Phenotype Dataset + 500K Genotypes - Sheet 2 spreadsheet. Right-click on the second LD plot and select Edit Title..., set the new name to be LD - YRI Population. Zoom into the region around the block defined in the LD - All Populations plot (Figure 3a). In this instance there is a slight difference in LD structure displayed in the two plots. If you observed this in your own data, you would want to investigate why such a difference exists. Go ahead and delete the LD - YRI Population LD plot by right-clicking its associated node in the Plot Tree and selecting Delete. 7

Figure 3a. Comparing LD between all HapMap and Yorubans. 8 3. Comparing Multiple LD Plots

4. Haplotype Frequency Tables Once you define a given haplotype block you can then investigate haplotype and diplotype frequency estimations for the entire population broken down by cases and controls if applicable and each individual sample in the dataset. A. Generating Frequency Tables From the log10 P + LD plot, select the LD - All Populations item in the Plot Tree and click the Compute Haplotype Tables button on the Marker Blocks tab. Keep the default values, except make sure that Per sample EM, Per sample diplotype, and Overall haplotype frequencies are selected and click Run. This will create three tables, one for each selected in the previous window. The Block #2 - Haplotype Table contains overall haplotype frequencies for the entire sample set. Notice that only the first marker is listed in the row label column along with the various alleles represented in the haplotype. To see all the SNPs in the haplotype block go to the Project Navigator and select the Block #2 - Haplotype Table node. All SNPs are listed in the Node Change Log in addition to other summary statistics (Figure 4a). The Block #2 - EM Frequencies Table displays the various genotypes for each sample and their respective frequency estimations for each haplotype calculated with the EM algorithm. The Block #2 - Diplotype Table displays each sample s haplotype pair, combined as diplotypes, and each diplotype s respective frequency estimations. 9

Figure 4a. List of SNPs in Node Change Log. 10 4. Haplotype Frequency Tables

5. Haplotype Association Tests Golden Helix SVS provides two overall methods for association testing per haplotype and per block tests. A. Performing Per Haplotype Association Tests Open the log10 P + LD plot and select the defined block by clicking inside block on the LD plot. The block boundary will change green to indicate it has been selected. Click the Selected Block button for Subset options on the Marker Blocks tab. This creates a subset spreadsheet (Phenotype Dataset + 500K Genotypes Marker Block Subset) of only those markers in the block. The phenotype data has been lost so we will need to rejoin before proceeding to Association testing. Open Phenotype Dataset + 500K Genotypes - Sheet 1 go to Select >Column >Inactivate All Columns, then reactive the first 4 columns by left-clicking once on each column header. Go to File >Join or Merge Spreadsheets and select Phenotype Dataset + 500K Genotypes - Marker Block Subset and click OK. On the join dialog change the New dataset name: to Phenotype + Marker Block Subset leave all other default options and click OK. From Phenotype + Marker Block Subset - Sheet 1, set the C/C phenotype as dependent by clicking once on the column header turning it magenta then go to Genotype >Haplotype Association Tests. The Haplotype Association Tests window appears with a number of parameter settings. Set the parameters as follows: In this case we are treating all markers in the subset spreadsheet as a single block. Thus under Haplotype Block Definition, select Use all markers as single block. Under Haplotype Association Tests, select Calculate per haplotype. Under Tests select Chi-squared test and Odds ratio with 95% CI. Under Multiple Testing Correction check only Bonferroni adjustment (on N covariates). Under Additional Outputs, check Haplotype frequencies and Output data for P-P/Q-Q plots. Click Run to finish. A single spreadsheet (Haplotype Association Tests (Per Haplotype)) is produced with a row for each haplotype and a column for each test statistic selected. Notice again that only the first marker of the block is represented in the row label column. 11

Figure 5a. Per-haplotype association test results. B. Performing Per Block Haplotype Association Tests Another, perhaps more informative test of association is a per block test where a 2 X N chi-square table is used with N = the number of haplotypes represented. Again, open Phenotype + Marker Block Subset - Sheet 1, and select Genotype >Haplotype Association Tests. Leave all the parameters the same except this time select Calculate per block under Haplotype Association Tests. Click Run to finish. A new spreadsheet is created (Haplotype Association Tests (Per Block)) with a single row of data representing per block association results. Figure 5b. Per-block association test results. 12 5. Haplotype Association Tests

6. Large-Scale Haplotype Association Testing Now that you know how haplotype association testing works on a single-haplotype, perhaps you want to investigate haplotypes on a larger, multi-haplotype scale. For the sake of computation time, this tutorial will lead you through haplotype association on chromosome 22 only, though the workflow can be applied directly to the entire genome. A. Calculating Haplotype Blocks for Chromosome 22 Open the Phenotype Dataset + 500K Genotypes - Sheet 1 spreadsheet and select Select >Activate by Chromosomes. Click Uncheck All and then check 22 and click OK. This will create a new spreadsheet (Phenotype Dataset + 500K Genotypes - Sheet 4) where only genotypes in chromosome 22 are active along with the phenotype data. Rename the subset spreadsheet in the Project Navigator to, Chr22 Genotypes. From Chr22 Genotypes select Genotype >Haplotype Block Detection. Keep the defaults and click Run. A new block definition spreadsheet is created (Haplotype blocks, 1362 markers in 542 groups) with a single column representing various markers and the blocks they belong to (Figure 6a). B. Haplotype Association Tests Using Block Definitions Open Chr22 Genotypes select Genotype >Haplotype Association Tests. Under Haplotype Block Definition select Use precomputed blocks. Click Choose Sheet. Select the Haplotype blocks, 1381 markers in 549 groups block definition spreadsheet and click OK. Keep the rest of the parameters the same as before (make sure Calculate per block is selected) and click Run. A new p-value spreadsheet is created, Haplotype Association Tests (Per Block), this time with results for each haplotype block defined across chromosome 22. In the Project Navigator, rename this spreadsheet to Haplotype Association Tests (Per Block) - Chr22. 13

Figure 6a. Block definitions spreadsheet. 14 6. Large-Scale Haplotype Association Testing

C. Plotting Haplotype and Single Marker Association Results Together To see if haplotypes provide additional power in association testing, you can compare haplotype association results side-by-side with single marker association results. Open the -log 10 P + LD plot and select the first Chi-Squared -log10 P node in the Plot Tree under the Add tab select Add Item(s). Click the Project Button and from the Haplotype Association Tests (per Block) - Chr22 spreadsheet select Chi-Squared -log10 P then click Plot & Close. In the Plot Tree, rename the first Chi-Squared log10 P graph item to Haplotype log10 P (right-click Edit Title) and the second to Single Marker log10 P. You can change the attributes of the Haplotype log10 P graph item to differentiate it more from the Single Marker log10 P graph item. Select the Haplotype log10 P graph item and under the Display tab change the Connector from None to Drop Line. Increase the weight to 3. Under the Style tab change the color to green and the symbol size to 5. Zoom into the region surrounding the peek on chromosome 22 by copying and pasting 22:36,372,272-37,995,325 into the location bar at the top of the plot window. The result is shown in Figure 6b. Figure 6b. Single marker association vs. haplotype association results. You can add the generated block set to the LD from the Plot Tree. Select the LD - All Populations graph item and under the Marker Blocks tab click Blocks under the Load options. C. Plotting Haplotype and Single Marker Association Results Together 15

Select the Haplotype blocks, 1362 markers in 542 groups spreadsheet and click OK. You now have a p-value plot with single marker and haplotype association results along with an LD plot of chromosome 22 with automatically defined haplotype blocks. You can zoom in to any region by left-click and dragging in either graph. Left-click on the p-valued plot s x-axis and drag from one side of the significant peak in chromosome 22 to the other side (Figure 6c). Figure 6c. Haplotype vs single marker association zoomed in. 16 6. Large-Scale Haplotype Association Testing

7. Haplotype Trend Regression New to SVS 8 is the ability to perform haplotype regression analysis using a quantitative phenotype. Haplotype Trend Regression (HTR) takes one or more block(s) of genotypic markers and for each block of markers, estimates the haplotypes for these markers, then regresses their by-sample haplotype probabilities against a dependent variable. Please see the SVS manual for full details on all of the options available for this new tool. A. Full Model Regression Open Phenotype Dataset + 500K Genotypes - Sheet 1, left-click once on the C/C phenotype to inactivate the column. Now set the quantitative variable Pheno as dependent. Note: For this simulated phenotype performing a Corr/Trend Association test using an Additive model will show markers in chromosome 14 that show significance, so for the purpose of saving time we will only look at markers in chromosome 14. Go to Select >Activate by Chromosome, click Uncheck All then check 14 and click OK. We will first compute our haplotype blocks for chromosome 14 to use in the analysis by selecting Genotype >Haplotype Block Detection, leave the defaults and click Run. Then from the Phenotype Dataset + 500K Genotypes - Sheet 6 spreadsheet go to Genotype >Haplotype Trend Regression. Under Haplotype Block Definition select Use precomputed block and select the Haplotype blocks, 4105 markers in 1537 groups, leave the rest of the default settings and click Run (Figure 7a) The resulting spreadsheet Haplotype Trend Regression Results is produced, the rows of this spreadsheet correspond to the haplotype blocks used. The row labels will correspond to the first marker in the block. Plot the results by right-clicking on the -log10 Full-Model P column and selecting Plot Value in Genome- Browse Zoom into the area around the most significant block (first marker SNP_A-1859412) and add the LD Plot. File >Add clicking the Project button and choosing the Phenotype Dataset + 500K Genotypes - Sheet 6 checking LD and clicking Plot & Close. Add in the computed marker blocks by selecting the LD node in the Plot Tree then under the Marker Blocks tab select Blocks under the Load options. Plot should look similar to Figure 7c. The LD Plot indicates a high linkage disequilibrium that corresponds to the significant p-value for the haplotype regression analysis. The R-Squared LD values are included in the Haplotype Trend Regression Results spreadsheet. 17

Figure 7a. Full Model Haplotype Trend Regression Options Figure 7b. Haplotype Trend Regression Results 18 7. Haplotype Trend Regression

Figure 7c. Full Model Haplotype Regression and LD Plot B. Full vs Reduced Model Regression Haplotype Trend Regression can also be used to correct for any covariates. In this tutorial dataset we have several possible covariates included in the data; Ethnicity, Age or Gender. Open Phenotype Dataset + 500K Genotypes - Sheet 6 and once again select Genotype >Haplotype Trend Regression Under Haplotype Block Definition select Use precomputed blocks and choose the Haplotype blocks, 4105 markers in 1537 groups spreadsheet. Choose the Compute significance of full model vs. reduced model option under Haplotype Trend Regression Options. Under Reduced Model Fixed Covariates select Add Covariate and choose the Ethnicity column. Click Add then Close. Leave the rest of the default settings (Figure 7d) and click run. Now we will add these results to the previous plot to compare. Open the Plot of Column -log10 Full-Model P from Haplotype Trend Regression Results and select the first -log10 Full-Model P node in the Plot Tree. Then on the Add tab click Add Item(s). Select the second Haplotype Trend Regression Results spreadsheet and check -log10 FvR Model P to be added to the plot. Click Plot & Close. Change the color of then new points by selecting the -log10 FvR Model P node in the Plot Tree and under the Stype tab click the blue square and change it to green. B. Full vs Reduced Model Regression 19

Figure 7d. Full vs. Reduced Model Options You will should see that the same block is still significant after correcting for ethnicity but not quite as significant (Figure 7e). 20 7. Haplotype Trend Regression

Figure 7e. Addition of Correct Regression Results B. Full vs Reduced Model Regression 21