SNP and destroy - a discussion of a weighted distance-based SNP selection algorithm

Size: px
Start display at page:

Download "SNP and destroy - a discussion of a weighted distance-based SNP selection algorithm"

Transcription

1 SNP and destroy - a discussion of a weighted distance-based SNP selection algorithm David A. Hall Rodney A. Lea November 14, 2005 Abstract Recent developments in bioinformatics have introduced a number of methods for quickly typing a large number of Single Nucleotide Polymorphisms (SNPs) in the human genome. Due to time and cost constraints, carrying out similar typing in large populations may not be viable. Further to this, due to linkage between SNPs, any typing done at one location may be fully predictive for typing carried out at another adjacent SNP. A Java program has been developed (SNPBlaster) that is able to carry out a weighted iterative SNP selection procedure, which may be useful in weeding out SNPs that are not likely to be useful in a population SNP screen, based on frequency differences between two populations. Included in this discussion is an application of the algorithm for finding SNPs on chromosome 4, approximately 1MB apart, that are likely to be informative for ancestry. While the program is useful in its current form, there are a few cautions that should be taken into account related to the prototype / test nature of the program. 1 Introduction An increasingly large number of Single Nucleotide Polymorphisms (SNPs) are being found in the human genome, with the help of large-scale genotyping projects such as HapMap (The International HapMap Consortium, 2003). However, due to cost constraints, most studies will need to select a much smaller number of SNPs, preferably ones with a high information content. A Java application, SNPBlaster, has been designed that attempts to remove SNPs from a large set, retaining only those SNPs that would be useful or informative. The program combines positional information with a arbitrary (user defined) information measure, referred to in this paper as a weighting factor. It is assumed that this measure at each SNP has already been calculated, and 1

2 uses that measure to determine, within a small region of the chromosome (window), what SNPs should be removed. This paper gives an example of the use of SNPBlaster using population differences as the measure, with a window size of 1MB. 2 Algorithm 2.1 Summary of the algorithm j l a b c d e f g h i k m n o Figure 1: This figure indicates a hypothetical situation in which a number of equally weighted markers are given to the SNPBlaster program to remove. The SNPBlaster program will remove markers b,c,e,g,i,j,k,l and o with the given window size. The algorithm begins with a group of SNPs with a high weighting factor ( ), removing those that are closer than a certain distance (the window size) to other SNPs. If there is more than one SNP that may be removed, then the algorithm will attempt to remove the SNP that will result in a better distribution of SNPs, such that the variance in distance between SNPs is low. Once this group is dealt with, the algorithm locks the SNPs so that they can t be removed, then continues on with the next group of SNPs, until all SNPs have been tested for potential removal from the marker set. 2.2 Detail The SNPBlaster program used an O(n) iterative algorithm to select SNPs from a list for a given window size, given the SNP location and an associated weighting factor. The algorithm begins with an empty list of SNPs to be worked on, and at the beginning of each iteration, moves SNPs from the complete list of SNPs into the working list. Each group of SNPs is anything still in the complete list that has a weight factor greater than or equal to the threshold value (which starts at 1.0, and is reduced by 0.1 after each iterative step). After each step, the SNPs that remain in the working list are locked, 2

3 so that they will not be removed, even if another more appropriate SNP, location wise, is found the assumption being that a high weight factor is more important than the location of the SNP. The core part of the algorithm involves a sweep through the chromosome, selecting up to four adjacent SNPs in order to decide which to remove. For the purpose of explanation, they are named p2, p1, c1, and c2 ( previous and current ), with the main decision process involving deciding which of p1 or c1 (if any) to remove. After each decision, markers are shifted as necessary based on the decision that has been made. Looking at figure 1, it may help to note that p2, p1, c1 and c2 start out as a, b, c and d respectively. There are a few trivial cases in the process, which may be useful in pointing out before discussing more complex cases. Firstly, if p1 and c1 are further apart than the window size and p1 and p2 are also further apart than the window size, then no SNP removal is carried out. Also, no removal is carried out if the closest markers (within a window) are locked a situation that should only occur if the SNPs are explicitly locked by the user. In cases where p1 is closer than the window size to either c1 or p2 (and at least one is unlocked), then a SNP will be removed. If one SNP is locked, then the other will be removed (with a decision being made on the p1,c1 pair first, if possible). If c1 and p1 are closer than the window size to each other, and neither are locked, then the SNP that is removed will be the one that results in the most even spread between the remaining three (of the four) markers. In the example in figure 1, b will be the first marker removed, because the distance between a and b is less than the distance between c and d. 3 Application The following details an example application of the SNPBlaster algorithm to SNPs recorded in the HapMap database (The International HapMap Consortium, 2003). 3

4 3.1 Preparation All available non-redundant SNPs for chromosome 4 as at 15 August 2005 were loaded into a database. The difference in genotype frequency for the non-reference ( rare ) alleles between the CEU population (Utah residents with ancestry from northern and western Europe) and the CHB population (Han Chinese in Beijing, China) was used to weight the usefulness of the alleles - a value of 1 indicated 100% difference in allele frequencies, while 0 indicated no difference in allele frequencies (a more rigorous study may take the minimum difference for all population combinations). The rs#, chromosome position and frequency difference for each SNP was exported into a text file, and the SNPBlaster header (with a chromosome length of base pairs) was added to prepare the file for the program. 3.2 Running Figure 2: A plot of the SNPs on chromosome 4 chosen by SNPBlaster that appear to be informative for ancestry. The measure (given on the y axis) is the allele frequency difference between two populations, labelled CEU and CHB in the HapMap project. After the input file was prepared, the program was run, loading up the prepared input file, and setting a window size of 1MB. From an input file of approximately markers, an output file containing 146 markers was generated, giving an average SNP 4

5 separation distance of about 1.35MB. A graphical representation of the SNPs that were selected is shown in figure Speed Loading all the HapMap information for chromosome 4 into the database certainly took the longest time. The process took approximately 17 minutes, but it may be possible to reduce this to around 2-5 minutes depending on the database format, and program used to import the data. In comparison to this, the SNPBlaster algorithm was significantly faster, typically taking 2-5 seconds for the loading process, and a similar time for the iterative algorithm. This suggests that the algorithm is reasonably fast, and is likely to be very useful for carrying out SNP selection tasks in the future. 4 Discussion 4.1 Distance-based information [Something about crossover etc, centimorgans] One paper has decided that a crossover frequency of 1% is sufficient for detecting recent population structure. This relates to a base pair distance of approximately one megabase [is it reasonable to say that we can treat the mutation rate as effectively zero? does the mutation rate matter? does it remove information at a SNP, or add information to a SNP?]. Regardless of the approach, it is reasonable to assume that within a certain distance, two SNP mutations are likely to have a high degree of linkage a specific mutation in one SNP will almost always correspond to a specific mutation in the other SNP. For this reason, it is not useful to record mutations at both of those SNPs in a study. This minimum allowed distance between SNPs will be referred to as the window size. The more SNPs that are typed, the greater the cost per individual, meaning that a smaller number of individuals will be able to be typed for the same amount of money. With a large window size, the cost per individual will be low, but it is also likely that there will be a loss of information because some of the variation will be missed by not typing enough SNPs. 5

6 4.2 Measure-based information Another approach to determining the information derived from specific SNPs is to carry out some function on each SNP to give an idea of the information content of that SNP. This function, regardless of its method, will typically identify the SNPs that will be the most useful in any investigation. It would be expected that something that did this would also be able to choose which of two SNPs would be more appropriate for an investigation Population differences One way to get an estimate of the information content with respect to the ancestry of an individual is to determine differences between populations at specific SNPs. An example of this (the example used in this paper) is one that compares allele frequency differences of a specific mutation of a SNP. The reasoning behind this is as follows: if a certain mutation is always present in one population (or more correctly a small subset of that population), but never present in another population, then that SNP will be informative in determinining the proportion of ancestry [or something similar] that an unknown individual has relative to those population groups. In this sense, a SNP with a high frequency difference between populations will be considered useful, and one that has a low difference between populations will not be useful. Other methods for determining the information content of SNPs are available. A reader who is interested in these may like to read [cite some relevant papers]. 4.3 Binocular vision There are issues involved with choosing only one of these two information procedures in SNP selection. Working purely on distance based information is likely to mean that some of the SNPs that are chosen will not be informative enough for the investigation, even if a more appropriate SNP is available nearby. Working purely on measure-based information may mean that some variation will be missed, and there may be a lot of unnecessary content. To further explain this, if many highly informative SNPs were in a single area of the chromosome (none were further apart from each other than the window size), then all the SNPs would be expected to carry linked mutations any one 6

7 of them could be used to infer the mutations at the other locations. In addition to this, some parts of the chromosome may be missed out, because those parts only have SNPs with a very low measure-based information content. It is probably worth noting that some SNP information measures will also take into account local distance information. However, the algorithms used to generate these measures may be very processor intensive, because they require recalculation after each SNP removal. The SNPBlaster algorithm gets around the issue of complexity due to recalculation by grouping SNPs of the same measure together, and from then on using distance methods to select SNPs. While the measure is the key factor in determing which SNP is chosen locally, the window size is typically more important on a whole chromosome level. 4.4 Similar programs Another application, CHOISS (Lee and Kang, 2004), is currently available to carry out a similar SNP selection process, choosing SNPs (either by stating an interval, or by stating the required number of SNPs) to minimise variance. After attempting to use the web-based version of CHOISS with a data set derived from chromosome 4 (approximately 50,000 markers), the web interface timed out at fifteen minutes. The algorithmic complexity of CHOISS is reported to be O(n 2 ), while the complexity of SNPBlaster is O(n). For small to medium numbers of SNPs (possibly up to around 5000), this may not make a significant difference, but above that level, the solution time for SNPBlaster is likely to be significantly less than that of CHOISS. However, there may be situations in which CHOISS is more appropriate, most likely those situations where a rough guess at the best SNPs is not appropriate. After downloading CHOISS, it was noticed that the algorithm did in fact run in a reasonably short period of time (2 minutes, compared with around 5 seconds for SNPBlaster). However, the output was not what was expected. When SNPs in the input file were in a random order, the CHOISS algorithm selected approximately 2000 SNPs, with a reported average distance of 1MB for a chromosome of total length around 200MB. When the SNPs were sorted, the algorithm did not select any SNPs. It is likely that this selection process had more to do with range overflows (i.e. numbers being too large) rather than problems with the algorithm itself, but it will be difficult to work out 7

8 for sure without a more thorough analysis of the program. [I should probably contact the authors then] 4.5 Caveats While the algorithm as described is useful for quick large scale selection of SNPs, there are a number of factors that may make it not as reliable as would be expected. At the moment, the program will only work for SNPs on a single chromosome. If more chromosomes are desired (as would be expected for a full genome selection process), then it is necessary to work on each chromosome individually. The algorithm is iterative, but does not have soft divisions between different weight factors. This means that SNPs with a weighting at the low end of a grouping (e.g. 0.91) are considered to be just as important as SNPs at the high end of that grouping (e.g. 0.99). In some cases, this may be alleviated by increasing the number of iterative divisions in the algorithm, but this would remove the linking between similar weights. An alternative procedure would attempt to determine a relationship between the distance between markers and the weighting factor, although it is likely that such a procedure would be only able to be applied to specific types of studies. Such an approach will certainly increase the amount of processing required (as all SNPs within a window will need to be tested, rather than just the closest), but as long as the window size is small, this increase is unlikely to make the algorithm unmanageably slow. The window size defined in the algorithm refers to the minimum allowed distance between SNPs. If there is enough coverage, then the maximum distance between SNPs will be just under twice the window size. Knowledge of this variation in SNP distance may be important in choosing a window size for the algorithm, because the average SNP distance is likely to be closer to 1.5 the window size. An increase in the number of iterative divisions is likely to increase the average distance between SNPs, as it is less likely that two SNPs within one of the weight ranges will be close to each other. 4.6 Other Applications While the algorithm was designed for the purpose of removing SNPs in order to obtain a panel of useful markers for further studies, attempts have been made to make the algorithm as generic as possible. There is potential for similar processes to be carried 8

9 out in other areas anywhere in which a well-distributed selection is required along a linear track, and there are a limited number of choices along that track. References Lee, S. and Kang, C. (2004). Choiss for selection of single nucleotide polymorphism markers on interval regularity, Bioinformatics 20(4): Liu, Z. and Lin, S. (2005). Multilocus LD measure and tagging SNP selection with generalized mutual information, Genetic Epidemiology?(?): 1 10? Sebastiani, P., Lazarus, R., Weiss, S. T., Kunkel, L. M., Kohane, I. S. and Ramoni, M. F. (2003). Minimal haplotype tagging, PNAS 100(17): The International HapMap Consortium (2003). The international hapmap project, Nature 426:

Haplotype Blocks: or how I learned to stop worrying and love the recombination hotspot

Haplotype Blocks: or how I learned to stop worrying and love the recombination hotspot Haplotype Blocks: or how I learned to stop worrying and love the recombination hotspot Benjamin Neale, David Evans, Pak Sham Boulder, Colorado March 005 http://webpages.charter.net/harshec/lego/images/simpsons/milhouse_0.jpg

More information

SNPbrowser Software v3.5

SNPbrowser Software v3.5 Product Bulletin SNP Genotyping SNPbrowser Software v3.5 A Free Software Tool for the Knowledge-Driven Selection of SNP Genotyping Assays Easily visualize SNPs integrated with a physical map, linkage disequilibrium

More information

Genome Wide Association Study (GWAS) Outline

Genome Wide Association Study (GWAS) Outline Genome Wide Association Study (GWAS) Viji R. Avali Outline Basic genetic concepts behind GWAS Genotyping technologies and common study designs Statistical concepts for GWAS analysis Replication, interpretation

More information

Globally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the

Globally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the Chapter 5 Analysis of Prostate Cancer Association Study Data 5.1 Risk factors for Prostate Cancer Globally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the disease has

More information

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis Goal: This tutorial introduces several websites and tools useful for determining linkage disequilibrium

More information

Updated 10/28/2007 Software to download prior to using HapMap Java- Haploview-

Updated 10/28/2007 Software to download prior to using HapMap Java-  Haploview- Updated 10/28/2007 Software to download prior to using HapMap Java- http://www.java.com/ Haploview- http://www.broad.mit.edu/mpg/haploview/ Use of HapMap: Find HapMap SNPs near a gene or region of interest

More information

LD and Haplotype Analysis Tutorial

LD and Haplotype Analysis Tutorial LD and Haplotype Analysis Tutorial Release 8.1 Golden Helix, Inc. March 8, 2014 Contents 1. Generating LD Plots 2 A. Open the Project............................................... 2 B. Generate a log10

More information

Journal of Statistical Software

Journal of Statistical Software JSS Journal of Statistical Software October 2006, Volume 16, Code Snippet 3. http://www.jstatsoft.org/ LDheatmap: An R Function for Graphical Display of Pairwise Linkage Disequilibria between Single Nucleotide

More information

Statistical and Software Resources for Genetic Epidemiology Stud. Summer Institute in Statistical Genetics 2011 Module 10 Lecture 11

Statistical and Software Resources for Genetic Epidemiology Stud. Summer Institute in Statistical Genetics 2011 Module 10 Lecture 11 Statistical and Software Resources for Genetic Epidemiology Studies Summer Institute in Statistical Genetics 2011 Module 10 Lecture 11 Introduction In the last 10 years there has been a tremendous amount

More information

Identification of Minimum Redundancy Tagging SNPs via Gibbs Sampling

Identification of Minimum Redundancy Tagging SNPs via Gibbs Sampling Identification of Minimum Redundancy Tagging SNPs via Gibbs Sampling Gaolin Zheng Department of Math and Computer Science, North Carolina Central University, Durham, NC 7707, USA Abstract Single nucleotide

More information

GAW 15 Problem 3: Simulated Rheumatoid Arthritis Data Full Model and Simulation Parameters

GAW 15 Problem 3: Simulated Rheumatoid Arthritis Data Full Model and Simulation Parameters GAW 15 Problem 3: Simulated Rheumatoid Arthritis Data Full Model and Simulation Parameters Michael B Miller , Michael Li , Gregg Lind , Soon-Young

More information

Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER

Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER JMP Genomics Step-by-Step Guide to Bi-Parental Linkage Mapping Introduction JMP Genomics offers several tools for the creation of linkage maps

More information

Supplementary Methods: Recombination Rate calculations: Hotspot identification:

Supplementary Methods: Recombination Rate calculations: Hotspot identification: Supplementary Methods: Recombination Rate calculations: To calculate recombination rates we used LDHat version 2[1] with minor modifications introduced to simplify the use of the program in a batch environment.

More information

Haplotype Block Definition and Its Application. X. Zhu, S. Zhang, D. Kan, and R. Cooper. Pacific Symposium on Biocomputing 9: (2004)

Haplotype Block Definition and Its Application. X. Zhu, S. Zhang, D. Kan, and R. Cooper. Pacific Symposium on Biocomputing 9: (2004) Haplotype Block Definition and Its Application X. Zhu, S. Zhang, D. an, and R. Cooper Pacific Symposium on Biocomputing 9:5-63(004) HAPLOTYPE BLOC DEFINITION AND ITS APPLICATION X. ZHU, S. ZHANG,3, D.

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Bradley Broom Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org bmbroom@mdanderson.org

More information

Linkage Disequilibrium

Linkage Disequilibrium Linkage Disequilibrium Why do we care about linkage disequilibrium? Determines the extent to which association mapping can be used in a species Long distance LD o Mapping at the centimorgan (cm) distances

More information

Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 Genomes Project phase I data

Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 Genomes Project phase I data Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 Genomes Project phase I data Débora Y. C. Brandt*, Vitor R. C. Aguiar*, Bárbara D. Bitarello*, Kelly Nunes*, Jérôme

More information

Combining Data from Different Genotyping Platforms. Gonçalo Abecasis Center for Statistical Genetics University of Michigan

Combining Data from Different Genotyping Platforms. Gonçalo Abecasis Center for Statistical Genetics University of Michigan Combining Data from Different Genotyping Platforms Gonçalo Abecasis Center for Statistical Genetics University of Michigan The Challenge Detecting small effects requires very large sample sizes Combined

More information

SNP Essentials The same SNP story

SNP Essentials The same SNP story HOW SNPS HELP RESEARCHERS FIND THE GENETIC CAUSES OF DISEASE SNP Essentials One of the findings of the Human Genome Project is that the DNA of any two people, all 3.1 billion molecules of it, is more than

More information

Chaplin Case-control haplotype inference software

Chaplin Case-control haplotype inference software Chaplin Case-control haplotype inference software Emory University School of Medicine Department of Human Genetics 615 Michael Street, Suite 301 Atlanta, GA 30322 Version 1.2.2, June 2006 http://www.genetics.emory.edu/labs/epstein/software/chaplin/index.html

More information

CS2220 Introduction to Computational Biology

CS2220 Introduction to Computational Biology CS2220 Introduction to Computational Biology WEEK 7: SINGLE (SIMPLE) NUCLEOTIDE POLYMORPHISMS (SNPS) 1 Dr. Mengling FENG Institute for Infocomm Research Massachusetts Institute of Technology mfeng@mit.edu

More information

HST.161 Molecular Biology and Genetics in Modern Medicine Fall 2007

HST.161 Molecular Biology and Genetics in Modern Medicine Fall 2007 MIT OpenCourseWare http://ocw.mit.edu HST.161 Molecular Biology and Genetics in Modern Medicine Fall 2007 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Using the HapMap Web Site

Using the HapMap Web Site 6 Using the HapMap Web Site Albert Vernon Smith Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724; Genthor ehf., 101 Reykjavik, Iceland, and Icelandic Heart Association, 201 Kopavogur,

More information

Release Notes. JMP Genomics. Version 2.0.3

Release Notes. JMP Genomics. Version 2.0.3 JMP Genomics Version 2.0.3 Release Notes Creativity involves breaking out of established patterns in order to look at things in a different way. Edward de Bono JMP. A Business Unit of SAS SAS Campus Drive

More information

SNPtrack manual and quick start guide v1.0 All Rights Reserved 2012

SNPtrack manual and quick start guide v1.0 All Rights Reserved 2012 SNPtrack manual and quick start guide v1.0 All Rights Reserved 2012 Method description and required data SNPtrack software method relies on the fact that all individuals with monogenic recessive traits

More information

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012 Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

More information

Outline GWA in presence of population stratification Reasons for genetic association Confounding in genetic studies What we see mapping

Outline GWA in presence of population stratification Reasons for genetic association Confounding in genetic studies What we see mapping Outline GWA in presence of population stratification Erasmus MC Rotterdam 26.08.2009 Reasons for genetic association Confounding in genetic studies What we see Causation LD mapping Causation Causative

More information

Selection and genotyping of unlinked genetic markers

Selection and genotyping of unlinked genetic markers SUPPLEMENTARY MATERIAL Selection and genotyping of unlinked genetic markers A total of 37 unlinked SNPs, distributed across the entire genome and located outside any known gene regions, were selected for

More information

Introduction to multivariate analysis applications in genomics

Introduction to multivariate analysis applications in genomics Introduction to multivariate analysis applications in genomics Thibaut Jombart (t.jombart@imperial.ac.uk) MRC Centre for Outbreak Analysis and Modelling Imperial College London MSc Modern epidemiology

More information

HOW SNPS HELP RESEARCHERS FIND THE GENETIC CAUSES OF DISEASE

HOW SNPS HELP RESEARCHERS FIND THE GENETIC CAUSES OF DISEASE H OW OW SNPS H ELP FIND THE GENETIC CAUSES OF DISEASE ELP RESEARCHERS SNP Essentials One of the findings of the Human Genome Project is that the DNA of any two people, all 3.1 billion molecules of it,

More information

PSTAT 5A List of topics

PSTAT 5A List of topics PSTAT 5A List of topics 1 Introduction to Probability 2 Discrete Probability Distributions including the Binomial Distribution 3 Continuous Probability Distributions including the Normal Distribution 4

More information

New Mexican Hispanic smokers have lower odds of COPD and less decline in lung function than non-hispanic whites

New Mexican Hispanic smokers have lower odds of COPD and less decline in lung function than non-hispanic whites New Mexican Hispanic smokers have lower odds of COPD and less decline in lung function than non-hispanic whites Shannon Bruse, Akshay Sood, Hans Petersen, Yushi Liu, Shuguang Leng, Juan C. Celedón, Frank

More information

SAMPLING DISTRIBUTIONS Page So far the entire set of elementary events has been called the sample space, since

SAMPLING DISTRIBUTIONS Page So far the entire set of elementary events has been called the sample space, since SAMPLING DISTRIBUTIONS Page 1 I. Populations, Parameters, and Statistics 1. So far the entire set of elementary events has been called the sample space, since this term is useful and current in probability

More information

Introduction to genomics. CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman

Introduction to genomics. CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Introduction to genomics CM226: Machine Learning for Bioinformatics Fall 2016 Sriram Sankararaman Prerequisites Some programming experience (R strongly encouraged) Familiarity with probability, statistics,

More information

Cambridge University Press Bioinformatics for Biologists Edited by Pavel Pevzner and Ron Shamir Excerpt More information PART I

Cambridge University Press Bioinformatics for Biologists Edited by Pavel Pevzner and Ron Shamir Excerpt More information PART I PART I GENOMES CHAPTER ONE Identifying the genetic basis of disease Vineet Bafna It is all in the DNA. Our genetic code, or genotype, influences much about us. Not only are physical attributes (appearance,

More information

Extraneous markers used for genetic similarity leads to loss of power in GWAS and heritability determination

Extraneous markers used for genetic similarity leads to loss of power in GWAS and heritability determination Extraneous markers used for genetic similarity leads to loss of power in GWAS and heritability determination Christoph Lippert 1*, Gerald Quon 1, Jennifer Listgarten 1*, and David Heckerman 1* 1 escience

More information

Regression with Covariates Tutorial

Regression with Covariates Tutorial Regression with Covariates Tutorial Release 8.1 Golden Helix, Inc. March 18, 2014 Contents 1. Overview of the Project 2 A. Data preparation............................................... 2 2. Controlling

More information

Predicting DNA Sequence Motifs of Recombination Hotspots by Integrative Visualization and Analysis

Predicting DNA Sequence Motifs of Recombination Hotspots by Integrative Visualization and Analysis Predicting DNA Sequence Motifs of Recombination Hotspots by Integrative Visualization and Analysis Peng Yang 1, Min Wu 1, Chee Keong Kwoh 1, Pavel P. Khil 2, R. Daniel Camerini-Otero 2, Teresa M. Przytycka

More information

Allele Frequencies: Staying Constant. Chapter 14

Allele Frequencies: Staying Constant. Chapter 14 Allele Frequencies: Staying Constant Chapter 14 What is Allele Frequency? How frequent any allele is in a given population: Within one race Within one nation Within one town/school/research project Calculated

More information

Introduction and Basic Concepts. Classical and Advanced Techniques for Optimization

Introduction and Basic Concepts. Classical and Advanced Techniques for Optimization Introduction and Basic Concepts Classical and Advanced Techniques for Optimization 1 Classical Optimization Techniques The classical optimization techniques are useful in finding the optimum solution or

More information

GOBII. Genomic & Open-source Breeding Informatics Initiative

GOBII. Genomic & Open-source Breeding Informatics Initiative GOBII Genomic & Open-source Breeding Informatics Initiative My Background BS Animal Science, University of Tennessee MS Animal Breeding, University of Georgia Random regression models for longitudinal

More information

NimbleGen Sequence Capture. Genetic Discovery Made Easy

NimbleGen Sequence Capture. Genetic Discovery Made Easy NimbleGen Sequence Capture Genetic Discovery Made Easy Targeted Sequencing Enables Genetic Discovery In life science research, understanding the human genome and the diseases associated with genetic mutations

More information

QTL Mapping using WinQTLCart V2.5

QTL Mapping using WinQTLCart V2.5 QTL Mapping using WinQTLCart V2.5 Balram Marathi 1, A. K. Singh 2, Rajender Parsad 3 and V.K. Gupta 3 1 Institute of Biotechnology, Acharya N. G. Ranga Agricultural University, Rajendranagar, Hyderabad,

More information

DNA-Analytik III. Genetische Variabilität

DNA-Analytik III. Genetische Variabilität DNA-Analytik III Genetische Variabilität Genetische Variabilität Lexikon Scherer et al. Nat Genet Suppl 39:s7 (2007) Genetische Variabilität Sequenzvariation Mutationen (Mikro~) Basensubstitution Insertion

More information

ESTIMATION OF LOCAL GENETIC ANCESTRY IN AN ADMIXED CATTLE POPULATION APPLYING DIFFERENT METHODS

ESTIMATION OF LOCAL GENETIC ANCESTRY IN AN ADMIXED CATTLE POPULATION APPLYING DIFFERENT METHODS 24 th Int. Symp. Animal Science Days, Ptuj, Slovenia, Sept. 21 st 23 rd, 2016. COBISS: 1.08 Agris category code: L10 ESTIMATION OF LOCAL GENETIC ANCESTRY IN AN ADMIXED CATTLE POPULATION APPLYING DIFFERENT

More information

White Paper Neanderthal Ancestry Estimator

White Paper Neanderthal Ancestry Estimator White Paper 23-05 Neanderthal Ancestry Estimator Authors: Eric Y. Durand edurand@23andme.com Created: 5 December 2011 Last Edited: 8 January 2012 Summary: Neanderthal ancestry estimator is a 23andMe feature

More information

Overview of Basic Genetic Concepts and Terminology. Overview of Basic Genetic Concepts and Terminology

Overview of Basic Genetic Concepts and Terminology. Overview of Basic Genetic Concepts and Terminology Overview of Basic Genetic Concepts and Terminology Advancements in Human Genetics Some of the objectives for genetic studies include: Identify the genetic causes of phenotypic variation Have better understanding

More information

Smoking and Lung Cancer in China

Smoking and Lung Cancer in China Smoking and Lung Cancer in China Hongbing Shen, M.D., Ph.D. Professor of Epidemiology Department of Epidemiology & Biostatistics Nanjing Medical University School of Public Health Tel: 86862747; Email:

More information

Genetic Algorithms I N T R O D U C T I O N

Genetic Algorithms I N T R O D U C T I O N Genetic Algorithms I N T R O D U C T I O N Genetic Algorithms are good at taking large, potentially huge search spaces and navigating them, looking for optimal combinations of things, solutions you might

More information

Karyomapping: A Rapid PGD Solution for Single-Gene Disorders

Karyomapping: A Rapid PGD Solution for Single-Gene Disorders Karyomapping: Rapid PD Solution for Single-ene Disorders he ability to decipher DN sequences is providing new insights into human health and helping us enter a new era of genomics-based healthcare. mong

More information

DNA Circles White Paper

DNA Circles White Paper DNA Circles White Paper AncestryDNA 2014 DNA Circles White Paper Last updated November 18, 2014 Identifying groups of descendants using pedigrees and genetically inferred relationships in a large database

More information

Algorithms. Theresa Migler-VonDollen CMPS 5P

Algorithms. Theresa Migler-VonDollen CMPS 5P Algorithms Theresa Migler-VonDollen CMPS 5P 1 / 32 Algorithms Write a Python function that accepts a list of numbers and a number, x. If x is in the list, the function returns the position in the list

More information

Comparison of SNP-based and gene-based. association studies in detecting rare variants using

Comparison of SNP-based and gene-based. association studies in detecting rare variants using Comparison of SNP-based and gene-based association studies in detecting rare variants using unrelated individuals Liping Tong 1,2 *, Bamidele Tayo 2, Jie Yang 3, Richard S Cooper 2 1 Department of Mathematics

More information

LAB 1: INTRODUCTION TO MOTION

LAB 1: INTRODUCTION TO MOTION Name Date Partners V1 OBJECTIVES OVERVIEW LAB 1: INTRODUCTION TO MOTION To discover how to measure motion with a motion detector To see how motion looks as a positiontime graph To see how motion looks

More information

Introduction to Genetic Epidemiology. Erwin Schurr McGill International TB Centre McGill University

Introduction to Genetic Epidemiology. Erwin Schurr McGill International TB Centre McGill University Introduction to Genetic Epidemiology Erwin Schurr McGill International TB Centre McGill University Methods of investigation in humans Phenotype Rare (very severe forms) Common (infection/affection status)

More information

Solving Cryptarithmetic Problems Using Parallel Genetic Algorithm

Solving Cryptarithmetic Problems Using Parallel Genetic Algorithm 2009 Second International Conference on Computer and Electrical Engineering Solving Cryptarithmetic s Using Parallel Genetic Algorithm Reza Abbasian Department of Computer Engineering Shahid Chamran University

More information

Instructions for IMPUTE version 2

Instructions for IMPUTE version 2 Instructions for IMPUTE version 2 Bryan Howie and Jonathan Marchini June 18, 2009 1 Introduction Background IMPUTE version 2 is a new imputation algorithm for large population genetic datasets, such as

More information

The Inheritance of Complex Traits

The Inheritance of Complex Traits Michael R. Cummings Chapter 5 The Inheritance of Complex Traits David Reisman University of South Carolina 5.1 Polygenic Traits Discontinuous variation Phenotypes that fall into two or more distinct, nonoverlapping

More information

Database Analysis. Chapter 20

Database Analysis. Chapter 20 Database Analysis Chapter 20 Local Population Databases Not talking about all of the CODIS database Which has millions of DNA profiles Instead these are local databases Specific for a certain ethnic background

More information

Gene Mapping Techniques

Gene Mapping Techniques Gene Mapping Techniques OBJECTIVES By the end of this session the student should be able to: Define genetic linkage and recombinant frequency State how genetic distance may be estimated State how restriction

More information

Genotyping and quality control of UK Biobank, a large- scale, extensively phenotyped prospective resource

Genotyping and quality control of UK Biobank, a large- scale, extensively phenotyped prospective resource Genotyping and quality control of UK Biobank, a large- scale, extensively phenotyped prospective resource Information for researchers Interim Data Release, 2015 1 Introduction... 3 1.1 UK Biobank... 3

More information

Chapter - III Results and Discussion

Chapter - III Results and Discussion Chapter - III Results and Discussion Chapter - III Results and Discussion In the present genetic epidemiological study, for evaluating association between recurrent pregnancy loss (RPL) and three pro-inflammatory

More information

Genetic Epidemiology RESEARCH ARTICLE. MaCH-Admix: Genotype Imputation for Admixed Populations. Introduction

Genetic Epidemiology RESEARCH ARTICLE. MaCH-Admix: Genotype Imputation for Admixed Populations. Introduction RESEARCH ARTICLE Genetic Epidemiology MaCH-Admix: Genotype Imputation for Admixed Populations Eric Yi Liu, 1 Mingyao Li, 2 Wei Wang, 1,3 and Yun Li 1,4 1 Department of Computer Science, University of North

More information

High-density SNP Genotyping Analysis of Broiler Breeding Lines

High-density SNP Genotyping Analysis of Broiler Breeding Lines AS 653 ASL R2219 2007 High-density SNP Genotyping Analysis of Broiler Breeding Lines Abebe T. Hassen Jack C.M. Dekkers Susan J. Lamont Rohan L. Fernando Santiago Avendano Aviagen Ltd. See next page for

More information

Calculating LOD score: experimental comparison

Calculating LOD score: experimental comparison Calculating LOD score: experimental comparison Natalia Flerova September 17, 2010 1 Introduction The purpose of this experimental work is to compare the performance of three programs capable of calculating

More information

Estimating population size via line graph reconstruction

Estimating population size via line graph reconstruction Estimating population size via line graph reconstruction Bjarni V. Halldórsson 1, Dima Blokh 2, and Roded Sharan 2 (1) School of Science and Engineering, Reykjavík University, Menntavegur 1, 101 Reykjavik,

More information

Phasing the Chromosomes of a Family Group When One Parent is Missing

Phasing the Chromosomes of a Family Group When One Parent is Missing Journal of Genetic Genealogy, 6(1), 2010 Phasing the Chromosomes of a Family Group When One Parent is Missing T. Whit Athey Abstract A technique is presented for the phasing of sets of SNP data collected

More information

IRiS User Manual. April 22, 2011

IRiS User Manual. April 22, 2011 IRiS User Manual April 22, 2011 Contents 1 Introduction 2 2 Phase 1: Detecting Recombinations 4 2.1 Input format............................................. 4 2.2 Command Line Parameters.....................................

More information

Bioinformatics Resources at a Glance

Bioinformatics Resources at a Glance Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences

More information

Supplementary Materials for

Supplementary Materials for www.sciencemag.org/cgi/content/full/338/6114/1627/dc1 Supplementary Materials for Probing Meiotic Recombination and Aneuploidy of Single Sperm Cells by Whole-Genome Sequencing Sijia Lu, Chenghang Zong,

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

GENETIC MAPS. Genetica per Scienze Naturali a.a prof S. Presciuttini

GENETIC MAPS. Genetica per Scienze Naturali a.a prof S. Presciuttini GENETIC MAPS Questo documento è pubblicato sotto licenza Creative Commons Attribuzione Non commerciale Condividi allo stesso modo http://creativecommons.org/licenses/by-nc-sa/2.5/deed.it An early dihybrid

More information

Biostatistics 666 Statistical Models in Human Genetics. Instructor Gonçalo Abecasis

Biostatistics 666 Statistical Models in Human Genetics. Instructor Gonçalo Abecasis Biostatistics 666 Statistical Models in Human Genetics Instructor Gonçalo Abecasis Course Logistics Grading Office Hours Class Notes Course Objective Provide an understanding of statistical models used

More information

BAPS: Bayesian Analysis of Population Structure

BAPS: Bayesian Analysis of Population Structure BAPS: Bayesian Analysis of Population Structure Manual v. 5.3 NOTE: ANY INQUIRIES CONCERNING THE PROGRAM SHOULD BE SENT TO JUKKA CORANDER. EMAIL ADDRESS IS VISIBLE AT THE BAPS WEBPAGE: http://web.abo.fi/fak/mnf//mate/jc/software/baps.html

More information

BAPS: Bayesian Analysis of Population Structure

BAPS: Bayesian Analysis of Population Structure BAPS: Bayesian Analysis of Population Structure Manual v. 6.0 NOTE: ANY INQUIRIES CONCERNING THE PROGRAM SHOULD BE SENT TO JUKKA CORANDER (first.last at helsinki.fi). http://www.helsinki.fi/bsg/software/baps/

More information

Name That Gene, Disease and Protein!

Name That Gene, Disease and Protein! Name That Gene, Disease and Protein! By Sharlene Denos (UIUC) & Kathryn Hafner (Danville High) What You Will Be Doing In this assignment you will be introduced to the field of bioinformatics. This is a

More information

Cloud-Based Big Data Analytics in Bioinformatics

Cloud-Based Big Data Analytics in Bioinformatics Cloud-Based Big Data Analytics in Bioinformatics Presented By Cephas Mawere Harare Institute of Technology, Zimbabwe 1 Introduction 2 Big Data Analytics Big Data are a collection of data sets so large

More information

(1-p) 2. p(1-p) From the table, frequency of DpyUnc = ¼ (p^2) = #DpyUnc = p^2 = 0.0004 ¼(1-p)^2 + ½(1-p)p + ¼(p^2) #Dpy + #DpyUnc

(1-p) 2. p(1-p) From the table, frequency of DpyUnc = ¼ (p^2) = #DpyUnc = p^2 = 0.0004 ¼(1-p)^2 + ½(1-p)p + ¼(p^2) #Dpy + #DpyUnc Advanced genetics Kornfeld problem set_key 1A (5 points) Brenner employed 2-factor and 3-factor crosses with the mutants isolated from his screen, and visually assayed for recombination events between

More information

Introduction to Genetic Epidemiology

Introduction to Genetic Epidemiology e of Genetic Epidemiology Introduction to Genetic Epidemiology Dr. Christian Gieger & Dr. Rajesh Rawal e of Genetic Epidemiology Helmholtz Zentrum München Key concepts concepts in in genetic genetic epidemiology

More information

Instruction to run LD estimation and Persistence of Phase calculations from phased

Instruction to run LD estimation and Persistence of Phase calculations from phased Instruction to run LD estimation and Persistence of Phase calculations from phased files Yvonne M. Badke Department of Animal Science Michigan State University East Lansing, Mi, USA email: badkeyvo@msu.edu

More information

Next Generation Sequencing: Technology, Mapping, and Analysis

Next Generation Sequencing: Technology, Mapping, and Analysis Next Generation Sequencing: Technology, Mapping, and Analysis Gary Benson Computer Science, Biology, Bioinformatics Boston University gbenson@bu.edu http://tandem.bu.edu/ The Human Genome Project took

More information

Genetic data concepts and tests

Genetic data concepts and tests Genetic data concepts and tests Cavan Reilly September 23, 2013 Table of contents Overview Linkage disequilibrium Quantifying LD Heatmap for LD Hardy-Weinberg equilibrium Overview Prior to conducting tests

More information

Investigating the genetic basis for intelligence

Investigating the genetic basis for intelligence Investigating the genetic basis for intelligence Steve Hsu University of Oregon and BGI www.cog-genomics.org Outline: a multidisciplinary subject 1. What is intelligence? Psychometrics 2. g and GWAS: a

More information

Asexual Versus Sexual Reproduction in Genetic Algorithms 1

Asexual Versus Sexual Reproduction in Genetic Algorithms 1 Asexual Versus Sexual Reproduction in Genetic Algorithms Wendy Ann Deslauriers (wendyd@alumni.princeton.edu) Institute of Cognitive Science,Room 22, Dunton Tower Carleton University, 25 Colonel By Drive

More information

Population Structure. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 9

Population Structure. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 9 Population Structure Timothy Thornton and Katie Kerr Summer Institute in Statistical Genetics 2014 Module 10 Lecture 9 1 / 1 Nonrandom Mating HWE assumes that mating is random in the population Most natural

More information

Fully powered polygenic prediction using summary statistics

Fully powered polygenic prediction using summary statistics Fully powered polygenic prediction using summary statistics Alkes L. Price Harvard T.H. Chan School of Public Health October 7, 015 To download slides of this talk: google Alkes HSPH Summary statistics

More information

Honesty, power and bootstrapping in composite interval quantitative trait locus mapping

Honesty, power and bootstrapping in composite interval quantitative trait locus mapping Open Journal of Genetics, 2013, 3, 127-140 http://dx.doi.org/10.4236/ojgen.2013.32016 Published Online June 2013 (http://www.scirp.org/journal/ojgen/) OJGen Honesty, power and bootstrapping in composite

More information

PRE-LAB PREPARATION SHEET FOR LAB 1: INTRODUCTION TO MOTION

PRE-LAB PREPARATION SHEET FOR LAB 1: INTRODUCTION TO MOTION Name Date PRE-LAB PREPARATION SHEET FOR LAB 1: INTRODUCTION TO MOTION (Due at the beginning of Lab 1) Directions: Read over Lab 1 and then answer the following questions about the procedures. 1. In Activity

More information

Quick Reference Guide

Quick Reference Guide GeneMapper Software Version 4.0 Quick Reference Guide In This Guide This quick reference guide describes example analysis workflows for the GeneMapper Software Version 4.0. It also summarizes the version(s)

More information

Basics of Linkage and Gene Mapping

Basics of Linkage and Gene Mapping Chapter 5 Basics of Linkage and Gene Mapping Julius van der Werf Basics of Linkage and Gene Mapping...45 Linkage...45 Linkage disequilibrium...47 Mapping functions...48 Mapping of genetic markers...50

More information

The 1000 Genomes Project. John Pearson RCPA Short Course in Medical Genetics and Genetic Pathology, Melbourne

The 1000 Genomes Project. John Pearson RCPA Short Course in Medical Genetics and Genetic Pathology, Melbourne The 1000 Genomes Project John Pearson 2011 06 18 RCPA Short Course in Medical Genetics and Genetic Pathology, Melbourne Overview: 1. Introduction to Next Generation Sequencing (NGS) 2. 1000 Genome Project

More information

Genomes and their variation

Genomes and their variation Genomes and their variation Phenotypic variation arises from genetic and environmental variation. Both are usually major contributors, and each influences the other. The genetic variation is encoded by

More information

Benchmarking Student Learning Outcomes using Shewhart Control Charts

Benchmarking Student Learning Outcomes using Shewhart Control Charts Benchmarking Student Learning Outcomes using Shewhart Control Charts Steven J. Peterson, MBA, PE Weber State University Ogden, Utah This paper looks at how Shewhart control charts a statistical tool used

More information

Coancestry in the analysis of complex traits

Coancestry in the analysis of complex traits Coancestry in the analysis of complex traits Elizabeth Thompson University of Washington For: Simons Institute Workshop Berkeley, California 18-21 February, 2014 With acknowledgement to Sharon Browning,

More information

Genomes and SNPs in Malaria and Sickle Cell Anemia

Genomes and SNPs in Malaria and Sickle Cell Anemia Genomes and SNPs in Malaria and Sickle Cell Anemia Introduction to Genome Browsing with Ensembl Ensembl The vast amount of information in biological databases today demands a way of organising and accessing

More information

Linkage and Recombination

Linkage and Recombination Linkage and Recombination v linkage equilibrium ² refers to cases in which the alleles of different genes are in random association ² expected when genes are on different chromosomes ² gamete frequencies

More information

Statistics of Linked Markers in Relationship Testing

Statistics of Linked Markers in Relationship Testing Statistics of Linked Markers in Relationship Testing Max P. Baur Rolf Fimmers Inst. f. Med. Biometry, University of Bonn Peter Schneider Inst. f. Legal Medicine, University of Cologne Population Genetics

More information

11/14/2010 Intelligent Systems and Soft Computing 1

11/14/2010 Intelligent Systems and Soft Computing 1 Lecture 9 Evolutionary Computation: Genetic algorithms Introduction, or can evolution be intelligent? Simulation of natural evolution Genetic algorithms Case study: maintenance scheduling with genetic

More information