SNP and destroy - a discussion of a weighted distance-based SNP selection algorithm

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "SNP and destroy - a discussion of a weighted distance-based SNP selection algorithm"

Transcription

1 SNP and destroy - a discussion of a weighted distance-based SNP selection algorithm David A. Hall Rodney A. Lea November 14, 2005 Abstract Recent developments in bioinformatics have introduced a number of methods for quickly typing a large number of Single Nucleotide Polymorphisms (SNPs) in the human genome. Due to time and cost constraints, carrying out similar typing in large populations may not be viable. Further to this, due to linkage between SNPs, any typing done at one location may be fully predictive for typing carried out at another adjacent SNP. A Java program has been developed (SNPBlaster) that is able to carry out a weighted iterative SNP selection procedure, which may be useful in weeding out SNPs that are not likely to be useful in a population SNP screen, based on frequency differences between two populations. Included in this discussion is an application of the algorithm for finding SNPs on chromosome 4, approximately 1MB apart, that are likely to be informative for ancestry. While the program is useful in its current form, there are a few cautions that should be taken into account related to the prototype / test nature of the program. 1 Introduction An increasingly large number of Single Nucleotide Polymorphisms (SNPs) are being found in the human genome, with the help of large-scale genotyping projects such as HapMap (The International HapMap Consortium, 2003). However, due to cost constraints, most studies will need to select a much smaller number of SNPs, preferably ones with a high information content. A Java application, SNPBlaster, has been designed that attempts to remove SNPs from a large set, retaining only those SNPs that would be useful or informative. The program combines positional information with a arbitrary (user defined) information measure, referred to in this paper as a weighting factor. It is assumed that this measure at each SNP has already been calculated, and 1

2 uses that measure to determine, within a small region of the chromosome (window), what SNPs should be removed. This paper gives an example of the use of SNPBlaster using population differences as the measure, with a window size of 1MB. 2 Algorithm 2.1 Summary of the algorithm j l a b c d e f g h i k m n o Figure 1: This figure indicates a hypothetical situation in which a number of equally weighted markers are given to the SNPBlaster program to remove. The SNPBlaster program will remove markers b,c,e,g,i,j,k,l and o with the given window size. The algorithm begins with a group of SNPs with a high weighting factor ( ), removing those that are closer than a certain distance (the window size) to other SNPs. If there is more than one SNP that may be removed, then the algorithm will attempt to remove the SNP that will result in a better distribution of SNPs, such that the variance in distance between SNPs is low. Once this group is dealt with, the algorithm locks the SNPs so that they can t be removed, then continues on with the next group of SNPs, until all SNPs have been tested for potential removal from the marker set. 2.2 Detail The SNPBlaster program used an O(n) iterative algorithm to select SNPs from a list for a given window size, given the SNP location and an associated weighting factor. The algorithm begins with an empty list of SNPs to be worked on, and at the beginning of each iteration, moves SNPs from the complete list of SNPs into the working list. Each group of SNPs is anything still in the complete list that has a weight factor greater than or equal to the threshold value (which starts at 1.0, and is reduced by 0.1 after each iterative step). After each step, the SNPs that remain in the working list are locked, 2

3 so that they will not be removed, even if another more appropriate SNP, location wise, is found the assumption being that a high weight factor is more important than the location of the SNP. The core part of the algorithm involves a sweep through the chromosome, selecting up to four adjacent SNPs in order to decide which to remove. For the purpose of explanation, they are named p2, p1, c1, and c2 ( previous and current ), with the main decision process involving deciding which of p1 or c1 (if any) to remove. After each decision, markers are shifted as necessary based on the decision that has been made. Looking at figure 1, it may help to note that p2, p1, c1 and c2 start out as a, b, c and d respectively. There are a few trivial cases in the process, which may be useful in pointing out before discussing more complex cases. Firstly, if p1 and c1 are further apart than the window size and p1 and p2 are also further apart than the window size, then no SNP removal is carried out. Also, no removal is carried out if the closest markers (within a window) are locked a situation that should only occur if the SNPs are explicitly locked by the user. In cases where p1 is closer than the window size to either c1 or p2 (and at least one is unlocked), then a SNP will be removed. If one SNP is locked, then the other will be removed (with a decision being made on the p1,c1 pair first, if possible). If c1 and p1 are closer than the window size to each other, and neither are locked, then the SNP that is removed will be the one that results in the most even spread between the remaining three (of the four) markers. In the example in figure 1, b will be the first marker removed, because the distance between a and b is less than the distance between c and d. 3 Application The following details an example application of the SNPBlaster algorithm to SNPs recorded in the HapMap database (The International HapMap Consortium, 2003). 3

4 3.1 Preparation All available non-redundant SNPs for chromosome 4 as at 15 August 2005 were loaded into a database. The difference in genotype frequency for the non-reference ( rare ) alleles between the CEU population (Utah residents with ancestry from northern and western Europe) and the CHB population (Han Chinese in Beijing, China) was used to weight the usefulness of the alleles - a value of 1 indicated 100% difference in allele frequencies, while 0 indicated no difference in allele frequencies (a more rigorous study may take the minimum difference for all population combinations). The rs#, chromosome position and frequency difference for each SNP was exported into a text file, and the SNPBlaster header (with a chromosome length of base pairs) was added to prepare the file for the program. 3.2 Running Figure 2: A plot of the SNPs on chromosome 4 chosen by SNPBlaster that appear to be informative for ancestry. The measure (given on the y axis) is the allele frequency difference between two populations, labelled CEU and CHB in the HapMap project. After the input file was prepared, the program was run, loading up the prepared input file, and setting a window size of 1MB. From an input file of approximately markers, an output file containing 146 markers was generated, giving an average SNP 4

5 separation distance of about 1.35MB. A graphical representation of the SNPs that were selected is shown in figure Speed Loading all the HapMap information for chromosome 4 into the database certainly took the longest time. The process took approximately 17 minutes, but it may be possible to reduce this to around 2-5 minutes depending on the database format, and program used to import the data. In comparison to this, the SNPBlaster algorithm was significantly faster, typically taking 2-5 seconds for the loading process, and a similar time for the iterative algorithm. This suggests that the algorithm is reasonably fast, and is likely to be very useful for carrying out SNP selection tasks in the future. 4 Discussion 4.1 Distance-based information [Something about crossover etc, centimorgans] One paper has decided that a crossover frequency of 1% is sufficient for detecting recent population structure. This relates to a base pair distance of approximately one megabase [is it reasonable to say that we can treat the mutation rate as effectively zero? does the mutation rate matter? does it remove information at a SNP, or add information to a SNP?]. Regardless of the approach, it is reasonable to assume that within a certain distance, two SNP mutations are likely to have a high degree of linkage a specific mutation in one SNP will almost always correspond to a specific mutation in the other SNP. For this reason, it is not useful to record mutations at both of those SNPs in a study. This minimum allowed distance between SNPs will be referred to as the window size. The more SNPs that are typed, the greater the cost per individual, meaning that a smaller number of individuals will be able to be typed for the same amount of money. With a large window size, the cost per individual will be low, but it is also likely that there will be a loss of information because some of the variation will be missed by not typing enough SNPs. 5

6 4.2 Measure-based information Another approach to determining the information derived from specific SNPs is to carry out some function on each SNP to give an idea of the information content of that SNP. This function, regardless of its method, will typically identify the SNPs that will be the most useful in any investigation. It would be expected that something that did this would also be able to choose which of two SNPs would be more appropriate for an investigation Population differences One way to get an estimate of the information content with respect to the ancestry of an individual is to determine differences between populations at specific SNPs. An example of this (the example used in this paper) is one that compares allele frequency differences of a specific mutation of a SNP. The reasoning behind this is as follows: if a certain mutation is always present in one population (or more correctly a small subset of that population), but never present in another population, then that SNP will be informative in determinining the proportion of ancestry [or something similar] that an unknown individual has relative to those population groups. In this sense, a SNP with a high frequency difference between populations will be considered useful, and one that has a low difference between populations will not be useful. Other methods for determining the information content of SNPs are available. A reader who is interested in these may like to read [cite some relevant papers]. 4.3 Binocular vision There are issues involved with choosing only one of these two information procedures in SNP selection. Working purely on distance based information is likely to mean that some of the SNPs that are chosen will not be informative enough for the investigation, even if a more appropriate SNP is available nearby. Working purely on measure-based information may mean that some variation will be missed, and there may be a lot of unnecessary content. To further explain this, if many highly informative SNPs were in a single area of the chromosome (none were further apart from each other than the window size), then all the SNPs would be expected to carry linked mutations any one 6

7 of them could be used to infer the mutations at the other locations. In addition to this, some parts of the chromosome may be missed out, because those parts only have SNPs with a very low measure-based information content. It is probably worth noting that some SNP information measures will also take into account local distance information. However, the algorithms used to generate these measures may be very processor intensive, because they require recalculation after each SNP removal. The SNPBlaster algorithm gets around the issue of complexity due to recalculation by grouping SNPs of the same measure together, and from then on using distance methods to select SNPs. While the measure is the key factor in determing which SNP is chosen locally, the window size is typically more important on a whole chromosome level. 4.4 Similar programs Another application, CHOISS (Lee and Kang, 2004), is currently available to carry out a similar SNP selection process, choosing SNPs (either by stating an interval, or by stating the required number of SNPs) to minimise variance. After attempting to use the web-based version of CHOISS with a data set derived from chromosome 4 (approximately 50,000 markers), the web interface timed out at fifteen minutes. The algorithmic complexity of CHOISS is reported to be O(n 2 ), while the complexity of SNPBlaster is O(n). For small to medium numbers of SNPs (possibly up to around 5000), this may not make a significant difference, but above that level, the solution time for SNPBlaster is likely to be significantly less than that of CHOISS. However, there may be situations in which CHOISS is more appropriate, most likely those situations where a rough guess at the best SNPs is not appropriate. After downloading CHOISS, it was noticed that the algorithm did in fact run in a reasonably short period of time (2 minutes, compared with around 5 seconds for SNPBlaster). However, the output was not what was expected. When SNPs in the input file were in a random order, the CHOISS algorithm selected approximately 2000 SNPs, with a reported average distance of 1MB for a chromosome of total length around 200MB. When the SNPs were sorted, the algorithm did not select any SNPs. It is likely that this selection process had more to do with range overflows (i.e. numbers being too large) rather than problems with the algorithm itself, but it will be difficult to work out 7

8 for sure without a more thorough analysis of the program. [I should probably contact the authors then] 4.5 Caveats While the algorithm as described is useful for quick large scale selection of SNPs, there are a number of factors that may make it not as reliable as would be expected. At the moment, the program will only work for SNPs on a single chromosome. If more chromosomes are desired (as would be expected for a full genome selection process), then it is necessary to work on each chromosome individually. The algorithm is iterative, but does not have soft divisions between different weight factors. This means that SNPs with a weighting at the low end of a grouping (e.g. 0.91) are considered to be just as important as SNPs at the high end of that grouping (e.g. 0.99). In some cases, this may be alleviated by increasing the number of iterative divisions in the algorithm, but this would remove the linking between similar weights. An alternative procedure would attempt to determine a relationship between the distance between markers and the weighting factor, although it is likely that such a procedure would be only able to be applied to specific types of studies. Such an approach will certainly increase the amount of processing required (as all SNPs within a window will need to be tested, rather than just the closest), but as long as the window size is small, this increase is unlikely to make the algorithm unmanageably slow. The window size defined in the algorithm refers to the minimum allowed distance between SNPs. If there is enough coverage, then the maximum distance between SNPs will be just under twice the window size. Knowledge of this variation in SNP distance may be important in choosing a window size for the algorithm, because the average SNP distance is likely to be closer to 1.5 the window size. An increase in the number of iterative divisions is likely to increase the average distance between SNPs, as it is less likely that two SNPs within one of the weight ranges will be close to each other. 4.6 Other Applications While the algorithm was designed for the purpose of removing SNPs in order to obtain a panel of useful markers for further studies, attempts have been made to make the algorithm as generic as possible. There is potential for similar processes to be carried 8

9 out in other areas anywhere in which a well-distributed selection is required along a linear track, and there are a limited number of choices along that track. References Lee, S. and Kang, C. (2004). Choiss for selection of single nucleotide polymorphism markers on interval regularity, Bioinformatics 20(4): Liu, Z. and Lin, S. (2005). Multilocus LD measure and tagging SNP selection with generalized mutual information, Genetic Epidemiology?(?): 1 10? Sebastiani, P., Lazarus, R., Weiss, S. T., Kunkel, L. M., Kohane, I. S. and Ramoni, M. F. (2003). Minimal haplotype tagging, PNAS 100(17): The International HapMap Consortium (2003). The international hapmap project, Nature 426:

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis Goal: This tutorial introduces several websites and tools useful for determining linkage disequilibrium

More information

Globally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the

Globally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the Chapter 5 Analysis of Prostate Cancer Association Study Data 5.1 Risk factors for Prostate Cancer Globally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the disease has

More information

Updated 10/28/2007 Software to download prior to using HapMap Java- Haploview-

Updated 10/28/2007 Software to download prior to using HapMap Java-  Haploview- Updated 10/28/2007 Software to download prior to using HapMap Java- http://www.java.com/ Haploview- http://www.broad.mit.edu/mpg/haploview/ Use of HapMap: Find HapMap SNPs near a gene or region of interest

More information

SNPbrowser Software v3.5

SNPbrowser Software v3.5 Product Bulletin SNP Genotyping SNPbrowser Software v3.5 A Free Software Tool for the Knowledge-Driven Selection of SNP Genotyping Assays Easily visualize SNPs integrated with a physical map, linkage disequilibrium

More information

Journal of Statistical Software

Journal of Statistical Software JSS Journal of Statistical Software October 2006, Volume 16, Code Snippet 3. http://www.jstatsoft.org/ LDheatmap: An R Function for Graphical Display of Pairwise Linkage Disequilibria between Single Nucleotide

More information

GAW 15 Problem 3: Simulated Rheumatoid Arthritis Data Full Model and Simulation Parameters

GAW 15 Problem 3: Simulated Rheumatoid Arthritis Data Full Model and Simulation Parameters GAW 15 Problem 3: Simulated Rheumatoid Arthritis Data Full Model and Simulation Parameters Michael B Miller , Michael Li , Gregg Lind , Soon-Young

More information

Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER

Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER JMP Genomics Step-by-Step Guide to Bi-Parental Linkage Mapping Introduction JMP Genomics offers several tools for the creation of linkage maps

More information

Supplementary Methods: Recombination Rate calculations: Hotspot identification:

Supplementary Methods: Recombination Rate calculations: Hotspot identification: Supplementary Methods: Recombination Rate calculations: To calculate recombination rates we used LDHat version 2[1] with minor modifications introduced to simplify the use of the program in a batch environment.

More information

Combining Data from Different Genotyping Platforms. Gonçalo Abecasis Center for Statistical Genetics University of Michigan

Combining Data from Different Genotyping Platforms. Gonçalo Abecasis Center for Statistical Genetics University of Michigan Combining Data from Different Genotyping Platforms Gonçalo Abecasis Center for Statistical Genetics University of Michigan The Challenge Detecting small effects requires very large sample sizes Combined

More information

Algorithms. Theresa Migler-VonDollen CMPS 5P

Algorithms. Theresa Migler-VonDollen CMPS 5P Algorithms Theresa Migler-VonDollen CMPS 5P 1 / 32 Algorithms Write a Python function that accepts a list of numbers and a number, x. If x is in the list, the function returns the position in the list

More information

Extraneous markers used for genetic similarity leads to loss of power in GWAS and heritability determination

Extraneous markers used for genetic similarity leads to loss of power in GWAS and heritability determination Extraneous markers used for genetic similarity leads to loss of power in GWAS and heritability determination Christoph Lippert 1*, Gerald Quon 1, Jennifer Listgarten 1*, and David Heckerman 1* 1 escience

More information

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012 Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

More information

Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 Genomes Project phase I data

Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 Genomes Project phase I data Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 Genomes Project phase I data Débora Y. C. Brandt*, Vitor R. C. Aguiar*, Bárbara D. Bitarello*, Kelly Nunes*, Jérôme

More information

Benchmarking Student Learning Outcomes using Shewhart Control Charts

Benchmarking Student Learning Outcomes using Shewhart Control Charts Benchmarking Student Learning Outcomes using Shewhart Control Charts Steven J. Peterson, MBA, PE Weber State University Ogden, Utah This paper looks at how Shewhart control charts a statistical tool used

More information

Investigating the genetic basis for intelligence

Investigating the genetic basis for intelligence Investigating the genetic basis for intelligence Steve Hsu University of Oregon and BGI www.cog-genomics.org Outline: a multidisciplinary subject 1. What is intelligence? Psychometrics 2. g and GWAS: a

More information

QTL Mapping using WinQTLCart V2.5

QTL Mapping using WinQTLCart V2.5 QTL Mapping using WinQTLCart V2.5 Balram Marathi 1, A. K. Singh 2, Rajender Parsad 3 and V.K. Gupta 3 1 Institute of Biotechnology, Acharya N. G. Ranga Agricultural University, Rajendranagar, Hyderabad,

More information

Gene Mapping Techniques

Gene Mapping Techniques Gene Mapping Techniques OBJECTIVES By the end of this session the student should be able to: Define genetic linkage and recombinant frequency State how genetic distance may be estimated State how restriction

More information

BAPS: Bayesian Analysis of Population Structure

BAPS: Bayesian Analysis of Population Structure BAPS: Bayesian Analysis of Population Structure Manual v. 6.0 NOTE: ANY INQUIRIES CONCERNING THE PROGRAM SHOULD BE SENT TO JUKKA CORANDER (first.last at helsinki.fi). http://www.helsinki.fi/bsg/software/baps/

More information

DNA-Analytik III. Genetische Variabilität

DNA-Analytik III. Genetische Variabilität DNA-Analytik III Genetische Variabilität Genetische Variabilität Lexikon Scherer et al. Nat Genet Suppl 39:s7 (2007) Genetische Variabilität Sequenzvariation Mutationen (Mikro~) Basensubstitution Insertion

More information

SNP Essentials The same SNP story

SNP Essentials The same SNP story HOW SNPS HELP RESEARCHERS FIND THE GENETIC CAUSES OF DISEASE SNP Essentials One of the findings of the Human Genome Project is that the DNA of any two people, all 3.1 billion molecules of it, is more than

More information

Regression in SPSS. Workshop offered by the Mississippi Center for Supercomputing Research and the UM Office of Information Technology

Regression in SPSS. Workshop offered by the Mississippi Center for Supercomputing Research and the UM Office of Information Technology Regression in SPSS Workshop offered by the Mississippi Center for Supercomputing Research and the UM Office of Information Technology John P. Bentley Department of Pharmacy Administration University of

More information

Lecture 6: Single nucleotide polymorphisms (SNPs) and Restriction Fragment Length Polymorphisms (RFLPs)

Lecture 6: Single nucleotide polymorphisms (SNPs) and Restriction Fragment Length Polymorphisms (RFLPs) Lecture 6: Single nucleotide polymorphisms (SNPs) and Restriction Fragment Length Polymorphisms (RFLPs) Single nucleotide polymorphisms or SNPs (pronounced "snips") are DNA sequence variations that occur

More information

Simple linear regression

Simple linear regression Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

More information

Confidence Intervals for the Difference Between Two Means

Confidence Intervals for the Difference Between Two Means Chapter 47 Confidence Intervals for the Difference Between Two Means Introduction This procedure calculates the sample size necessary to achieve a specified distance from the difference in sample means

More information

Bioinformatics Resources at a Glance

Bioinformatics Resources at a Glance Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences

More information

Sorting, Polynomials

Sorting, Polynomials Sorting, Polynomials http://people.sc.fsu.edu/ jburkardt/isc/week07 lecture 14.pdf... ISC3313: Introduction to Scientific Computing with C++ Summer Semester 2011... John Burkardt Department of Scientific

More information

Logistic Regression (1/24/13)

Logistic Regression (1/24/13) STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

More information

(1-p) 2. p(1-p) From the table, frequency of DpyUnc = ¼ (p^2) = #DpyUnc = p^2 = 0.0004 ¼(1-p)^2 + ½(1-p)p + ¼(p^2) #Dpy + #DpyUnc

(1-p) 2. p(1-p) From the table, frequency of DpyUnc = ¼ (p^2) = #DpyUnc = p^2 = 0.0004 ¼(1-p)^2 + ½(1-p)p + ¼(p^2) #Dpy + #DpyUnc Advanced genetics Kornfeld problem set_key 1A (5 points) Brenner employed 2-factor and 3-factor crosses with the mutants isolated from his screen, and visually assayed for recombination events between

More information

Statistical Analysis for Genetic Epidemiology (S.A.G.E.) Version 6.2 Graphical User Interface (GUI) Manual

Statistical Analysis for Genetic Epidemiology (S.A.G.E.) Version 6.2 Graphical User Interface (GUI) Manual Statistical Analysis for Genetic Epidemiology (S.A.G.E.) Version 6.2 Graphical User Interface (GUI) Manual Department of Epidemiology and Biostatistics Wolstein Research Building 2103 Cornell Rd Case Western

More information

BAPS: Bayesian Analysis of Population Structure

BAPS: Bayesian Analysis of Population Structure BAPS: Bayesian Analysis of Population Structure Manual v. 5.3 NOTE: ANY INQUIRIES CONCERNING THE PROGRAM SHOULD BE SENT TO JUKKA CORANDER. EMAIL ADDRESS IS VISIBLE AT THE BAPS WEBPAGE: http://web.abo.fi/fak/mnf//mate/jc/software/baps.html

More information

AP Statistics 2008 Scoring Guidelines Form B

AP Statistics 2008 Scoring Guidelines Form B AP Statistics 2008 Scoring Guidelines Form B The College Board: Connecting Students to College Success The College Board is a not-for-profit membership association whose mission is to connect students

More information

Y-DNA FACT SHEET. Bruce A. Crawford

Y-DNA FACT SHEET. Bruce A. Crawford Y-DNA FACT SHEET By Bruce A. Crawford For those not familiar with DNA analysis and particularly Y-DNA, the following explanation may help. DNA is the basic building block of cell information and heredity.

More information

Next Generation Sequencing: Technology, Mapping, and Analysis

Next Generation Sequencing: Technology, Mapping, and Analysis Next Generation Sequencing: Technology, Mapping, and Analysis Gary Benson Computer Science, Biology, Bioinformatics Boston University gbenson@bu.edu http://tandem.bu.edu/ The Human Genome Project took

More information

GOBII. Genomic & Open-source Breeding Informatics Initiative

GOBII. Genomic & Open-source Breeding Informatics Initiative GOBII Genomic & Open-source Breeding Informatics Initiative My Background BS Animal Science, University of Tennessee MS Animal Breeding, University of Georgia Random regression models for longitudinal

More information

Genotyping and quality control of UK Biobank, a large- scale, extensively phenotyped prospective resource

Genotyping and quality control of UK Biobank, a large- scale, extensively phenotyped prospective resource Genotyping and quality control of UK Biobank, a large- scale, extensively phenotyped prospective resource Information for researchers Interim Data Release, 2015 1 Introduction... 3 1.1 UK Biobank... 3

More information

Comparison of Major Domination Schemes for Diploid Binary Genetic Algorithms in Dynamic Environments

Comparison of Major Domination Schemes for Diploid Binary Genetic Algorithms in Dynamic Environments Comparison of Maor Domination Schemes for Diploid Binary Genetic Algorithms in Dynamic Environments A. Sima UYAR and A. Emre HARMANCI Istanbul Technical University Computer Engineering Department Maslak

More information

MAT140: Applied Statistical Methods Summary of Calculating Confidence Intervals and Sample Sizes for Estimating Parameters

MAT140: Applied Statistical Methods Summary of Calculating Confidence Intervals and Sample Sizes for Estimating Parameters MAT140: Applied Statistical Methods Summary of Calculating Confidence Intervals and Sample Sizes for Estimating Parameters Inferences about a population parameter can be made using sample statistics for

More information

DnaSP, DNA polymorphism analyses by the coalescent and other methods.

DnaSP, DNA polymorphism analyses by the coalescent and other methods. DnaSP, DNA polymorphism analyses by the coalescent and other methods. Author affiliation: Julio Rozas 1, *, Juan C. Sánchez-DelBarrio 2,3, Xavier Messeguer 2 and Ricardo Rozas 1 1 Departament de Genètica,

More information

Step by Step Guide to Importing Genetic Data into JMP Genomics

Step by Step Guide to Importing Genetic Data into JMP Genomics Step by Step Guide to Importing Genetic Data into JMP Genomics Page 1 Introduction Data for genetic analyses can exist in a variety of formats. Before this data can be analyzed it must imported into one

More information

LD-Plus. Visualizing SNP Statistics in the Context of Linkage. Disequilibrium

LD-Plus. Visualizing SNP Statistics in the Context of Linkage. Disequilibrium LD LD-Plus Visualizing SNP Statistics in the Context of Linkage Disequilibrium Introduction LD-Plus is a data visualization script for the display of single SNP statistics in the context of linkage disequilibrium

More information

ASSIsT: An Automatic SNP ScorIng Tool for in and out-breeding species Reference Manual

ASSIsT: An Automatic SNP ScorIng Tool for in and out-breeding species Reference Manual ASSIsT: An Automatic SNP ScorIng Tool for in and out-breeding species Reference Manual Di Guardo M, Micheletti D, Bianco L, Koehorst-van Putten HJJ, Longhi S, Costa F, Aranzana MJ, Velasco R, Arús P, Troggio

More information

Lab 11. Simulations. The Concept

Lab 11. Simulations. The Concept Lab 11 Simulations In this lab you ll learn how to create simulations to provide approximate answers to probability questions. We ll make use of a particular kind of structure, called a box model, that

More information

Chinook analysis report

Chinook analysis report Chinook analysis report Mars Veterinary 04/1/09 Introduction... 2 Data source, error checking, and validation... 4 Analysis:... 5 Investigating Haplotypes... 5 Looking at common haplotypes between breeds...

More information

Infinite Campus Grade Book BETA

Infinite Campus Grade Book BETA Infinite Campus Grade Book BETA This tool was released for an open beta testing period. This new Grade Book will continue to exist parallel to the current Grade Book. All Teachers in the Nelson County

More information

WEEK 2: INTRODUCTION TO MOTION

WEEK 2: INTRODUCTION TO MOTION Names Date OBJECTIVES WEEK 2: INTRODUCTION TO MOTION To discover how to use a motion detector. To explore how various motions are represented on a distance (position) time graph. To explore how various

More information

Minesweeper as a Constraint Satisfaction Problem

Minesweeper as a Constraint Satisfaction Problem Minesweeper as a Constraint Satisfaction Problem by Chris Studholme Introduction To Minesweeper Minesweeper is a simple one player computer game commonly found on machines with popular operating systems

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

2. True or False? The sequence of nucleotides in the human genome is 90.9% identical from one person to the next. False (it s 99.

2. True or False? The sequence of nucleotides in the human genome is 90.9% identical from one person to the next. False (it s 99. 1. True or False? A typical chromosome can contain several hundred to several thousand genes, arranged in linear order along the DNA molecule present in the chromosome. True 2. True or False? The sequence

More information

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable

More information

Genomes and SNPs in Malaria and Sickle Cell Anemia

Genomes and SNPs in Malaria and Sickle Cell Anemia Genomes and SNPs in Malaria and Sickle Cell Anemia Introduction to Genome Browsing with Ensembl Ensembl The vast amount of information in biological databases today demands a way of organising and accessing

More information

Using a Genetic Algorithm to Solve Crossword Puzzles. Kyle Williams

Using a Genetic Algorithm to Solve Crossword Puzzles. Kyle Williams Using a Genetic Algorithm to Solve Crossword Puzzles Kyle Williams April 8, 2009 Abstract In this report, I demonstrate an approach to solving crossword puzzles by using a genetic algorithm. Various values

More information

MINITAB ASSISTANT WHITE PAPER

MINITAB ASSISTANT WHITE PAPER MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

More information

The Functional but not Nonfunctional LILRA3 Contributes to Sex Bias in Susceptibility and Severity of ACPA-Positive Rheumatoid Arthritis

The Functional but not Nonfunctional LILRA3 Contributes to Sex Bias in Susceptibility and Severity of ACPA-Positive Rheumatoid Arthritis The Functional but not Nonfunctional LILRA3 Contributes to Sex Bias in Susceptibility and Severity of ACPA-Positive Rheumatoid Arthritis Yan Du Peking University People s Hospital 100044 Beijing CHINA

More information

Population 1 Population 2. A a A a p 1. 1-m m m 1-m. A a A a. ' p 2

Population 1 Population 2. A a A a p 1. 1-m m m 1-m. A a A a. ' p 2 Gene Flow Up to now, we have dealt with local populations in which all individuals can be viewed as sharing a common system of mating. But in many species, the species is broken up into many local populations

More information

A very brief introduction to genetic algorithms

A very brief introduction to genetic algorithms A very brief introduction to genetic algorithms Radoslav Harman Design of experiments seminar FACULTY OF MATHEMATICS, PHYSICS AND INFORMATICS COMENIUS UNIVERSITY IN BRATISLAVA 25.2.2013 Optimization problems:

More information

Cloud-Based Big Data Analytics in Bioinformatics

Cloud-Based Big Data Analytics in Bioinformatics Cloud-Based Big Data Analytics in Bioinformatics Presented By Cephas Mawere Harare Institute of Technology, Zimbabwe 1 Introduction 2 Big Data Analytics Big Data are a collection of data sets so large

More information

, then the form of the model is given by: which comprises a deterministic component involving the three regression coefficients (

, then the form of the model is given by: which comprises a deterministic component involving the three regression coefficients ( Multiple regression Introduction Multiple regression is a logical extension of the principles of simple linear regression to situations in which there are several predictor variables. For instance if we

More information

Content DESCRIPTIVE STATISTICS. Data & Statistic. Statistics. Example: DATA VS. STATISTIC VS. STATISTICS

Content DESCRIPTIVE STATISTICS. Data & Statistic. Statistics. Example: DATA VS. STATISTIC VS. STATISTICS Content DESCRIPTIVE STATISTICS Dr Najib Majdi bin Yaacob MD, MPH, DrPH (Epidemiology) USM Unit of Biostatistics & Research Methodology School of Medical Sciences Universiti Sains Malaysia. Introduction

More information

An analysis of the 2003 HEFCE national student survey pilot data.

An analysis of the 2003 HEFCE national student survey pilot data. An analysis of the 2003 HEFCE national student survey pilot data. by Harvey Goldstein Institute of Education, University of London h.goldstein@ioe.ac.uk Abstract The summary report produced from the first

More information

10-810 /02-710 Computational Genomics. Clustering expression data

10-810 /02-710 Computational Genomics. Clustering expression data 10-810 /02-710 Computational Genomics Clustering expression data What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally,

More information

Cracking the Sudoku: A Deterministic Approach

Cracking the Sudoku: A Deterministic Approach Cracking the Sudoku: A Deterministic Approach David Martin Erica Cross Matt Alexander Youngstown State University Center for Undergraduate Research in Mathematics Abstract The model begins with the formulation

More information

Online Resource 6. Estimating the required sample size

Online Resource 6. Estimating the required sample size Online Resource 6. Estimating the required sample size Power calculations help program managers and evaluators estimate the required sample size that is large enough to provide sufficient statistical power

More information

DNA PHENOTYPING: PREDICTING ANCESTRY AND PHYSICAL APPEARANCE FROM FORENSIC DNA

DNA PHENOTYPING: PREDICTING ANCESTRY AND PHYSICAL APPEARANCE FROM FORENSIC DNA DNA PHENOTYPING: PREDICTING ANCESTRY AND PHYSICAL APPEARANCE FROM FORENSIC DNA Ellen McRae Greytak, PhD* and Steven Armentrout, PhD Parabon NanoLabs, Inc., 11260 Roger Bacon Dr., Suite 406, Reston, VA

More information

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Focusing on results not data comprehensive data analysis for targeted next generation sequencing Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes

More information

Research Variables. Measurement. Scales of Measurement. Chapter 4: Data & the Nature of Measurement

Research Variables. Measurement. Scales of Measurement. Chapter 4: Data & the Nature of Measurement Chapter 4: Data & the Nature of Graziano, Raulin. Research Methods, a Process of Inquiry Presented by Dustin Adams Research Variables Variable Any characteristic that can take more than one form or value.

More information

Asexual Versus Sexual Reproduction in Genetic Algorithms 1

Asexual Versus Sexual Reproduction in Genetic Algorithms 1 Asexual Versus Sexual Reproduction in Genetic Algorithms Wendy Ann Deslauriers (wendyd@alumni.princeton.edu) Institute of Cognitive Science,Room 22, Dunton Tower Carleton University, 25 Colonel By Drive

More information

Minitab Guide. This packet contains: A Friendly Guide to Minitab. Minitab Step-By-Step

Minitab Guide. This packet contains: A Friendly Guide to Minitab. Minitab Step-By-Step Minitab Guide This packet contains: A Friendly Guide to Minitab An introduction to Minitab; including basic Minitab functions, how to create sets of data, and how to create and edit graphs of different

More information

Lab 4: 26 th March 2012. Exercise 1: Evolutionary algorithms

Lab 4: 26 th March 2012. Exercise 1: Evolutionary algorithms Lab 4: 26 th March 2012 Exercise 1: Evolutionary algorithms 1. Found a problem where EAs would certainly perform very poorly compared to alternative approaches. Explain why. Suppose that we want to find

More information

High Throughput Testing (HTT) Overview of Pro-Test and Praxis

High Throughput Testing (HTT) Overview of Pro-Test and Praxis High Throughput Testing (HTT) Overview of Pro-Test and Praxis HTT Overview High Throughput Testing (HTT) is a new technology which provides a solution to the problem of excessive test cases and/or poorly

More information

The effect of population history on the distribution of the Tajima s D statistic

The effect of population history on the distribution of the Tajima s D statistic The effect of population history on the distribution of the Tajima s D statistic Deena Schmidt and John Pool May 17, 2002 Abstract The Tajima s D test measures the allele frequency distribution of nucleotide

More information

1 One Dimensional Horizontal Motion Position vs. time Velocity vs. time

1 One Dimensional Horizontal Motion Position vs. time Velocity vs. time PHY132 Experiment 1 One Dimensional Horizontal Motion Position vs. time Velocity vs. time One of the most effective methods of describing motion is to plot graphs of distance, velocity, and acceleration

More information

Assessment Schedule 2013 Biology: Demonstrate understanding of genetic variation and change (91157)

Assessment Schedule 2013 Biology: Demonstrate understanding of genetic variation and change (91157) NCEA Level 2 Biology (91157) 2013 page 1 of 5 Assessment Schedule 2013 Biology: Demonstrate understanding of genetic variation and change (91157) Assessment Criteria with with Excellence Demonstrate understanding

More information

A Robust Method for Solving Transcendental Equations

A Robust Method for Solving Transcendental Equations www.ijcsi.org 413 A Robust Method for Solving Transcendental Equations Md. Golam Moazzam, Amita Chakraborty and Md. Al-Amin Bhuiyan Department of Computer Science and Engineering, Jahangirnagar University,

More information

Genetic diagnostics the gateway to personalized medicine

Genetic diagnostics the gateway to personalized medicine Micronova 20.11.2012 Genetic diagnostics the gateway to personalized medicine Kristiina Assoc. professor, Director of Genetic Department HUSLAB, Helsinki University Central Hospital The Human Genome Packed

More information

Hints for Success on the AP Statistics Exam. (Compiled by Zack Bigner)

Hints for Success on the AP Statistics Exam. (Compiled by Zack Bigner) Hints for Success on the AP Statistics Exam. (Compiled by Zack Bigner) The Exam The AP Stat exam has 2 sections that take 90 minutes each. The first section is 40 multiple choice questions, and the second

More information

Sample Size Determination

Sample Size Determination Sample Size Determination Population A: 10,000 Population B: 5,000 Sample 10% Sample 15% Sample size 1000 Sample size 750 The process of obtaining information from a subset (sample) of a larger group (population)

More information

CAP BIOINFORMATICS Su-Shing Chen CISE. 10/5/2005 Su-Shing Chen, CISE 1

CAP BIOINFORMATICS Su-Shing Chen CISE. 10/5/2005 Su-Shing Chen, CISE 1 CAP 5510-8 BIOINFORMATICS Su-Shing Chen CISE 10/5/2005 Su-Shing Chen, CISE 1 Genomic Mapping & Mapping Databases High resolution, genome-wide maps of DNA markers. Integrated maps, genome catalogs and comprehensive

More information

SNP Data Integration and Analysis for Drug- Response Biomarker Discovery

SNP Data Integration and Analysis for Drug- Response Biomarker Discovery B. Comp Dissertation SNP Data Integration and Analysis for Drug- Response Biomarker Discovery By Chen Jieqi Pauline Department of Computer Science School of Computing National University of Singapore 2008/2009

More information

(Refer Slide Time: 00:00:56 min)

(Refer Slide Time: 00:00:56 min) Numerical Methods and Computation Prof. S.R.K. Iyengar Department of Mathematics Indian Institute of Technology, Delhi Lecture No # 3 Solution of Nonlinear Algebraic Equations (Continued) (Refer Slide

More information

Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm

Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm Journal of Al-Nahrain University Vol.15 (2), June, 2012, pp.161-168 Science Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm Manal F. Younis Computer Department, College

More information

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 seema@iasri.res.in Genomics A genome is an organism s

More information

Assessment Schedule 2014 Biology: Demonstrate understanding of genetic variation and change (91157) Evidence Statement

Assessment Schedule 2014 Biology: Demonstrate understanding of genetic variation and change (91157) Evidence Statement NCEA Level 2 Biology (91157) 2014 page 1 of 5 Assessment Schedule 2014 Biology: Demonstrate understanding of genetic variation and change (91157) Evidence Statement NCEA Level 2 Biology (91157) 2014 page

More information

Genetic Drift Simulation. Experimental Question: How do random events cause evolution (a change in the gene pool)?

Genetic Drift Simulation. Experimental Question: How do random events cause evolution (a change in the gene pool)? Genetic Drift Simulation Experimental Question: How do random events cause evolution (a change in the gene pool)? Hypothesis: Introduction: What is Genetic Drift? Let's examine a simple model of a population

More information

Monotone Partitioning. Polygon Partitioning. Monotone polygons. Monotone polygons. Monotone Partitioning. ! Define monotonicity

Monotone Partitioning. Polygon Partitioning. Monotone polygons. Monotone polygons. Monotone Partitioning. ! Define monotonicity Monotone Partitioning! Define monotonicity Polygon Partitioning Monotone Partitioning! Triangulate monotone polygons in linear time! Partition a polygon into monotone pieces Monotone polygons! Definition

More information

COMPLEX GENETIC DISEASES

COMPLEX GENETIC DISEASES COMPLEX GENETIC DISEASES Date: Sept 28, 2005* Time: 9:30 am 10:20 am* Room: G-202 Biomolecular Building Lecturer: David Threadgill 4340 Biomolecular Building dwt@med.unc.edu Office Hours: by appointment

More information

One-Sample t-test. Example 1: Mortgage Process Time. Problem. Data set. Data collection. Tools

One-Sample t-test. Example 1: Mortgage Process Time. Problem. Data set. Data collection. Tools One-Sample t-test Example 1: Mortgage Process Time Problem A faster loan processing time produces higher productivity and greater customer satisfaction. A financial services institution wants to establish

More information

Application Note. Introduction AN2395/D 12/2002. PC Master Software Usage

Application Note. Introduction AN2395/D 12/2002. PC Master Software Usage Application Note 12/2002 PC Master Software Usage By Milan Brejl and Pavel Kania S 3 L Applications Engineerings MCSL Roznov pod Radhostem Introduction The PC master software is a PC Windows -based application

More information

Confidence Intervals for One Standard Deviation Using Standard Deviation

Confidence Intervals for One Standard Deviation Using Standard Deviation Chapter 640 Confidence Intervals for One Standard Deviation Using Standard Deviation Introduction This routine calculates the sample size necessary to achieve a specified interval width or distance from

More information

CASSI: Genome-Wide Interaction Analysis Software

CASSI: Genome-Wide Interaction Analysis Software CASSI: Genome-Wide Interaction Analysis Software 1 Contents 1 Introduction 3 2 Installation 3 3 Using CASSI 3 3.1 Input Files................................... 4 3.2 Options....................................

More information

Use Excel to Analyse Data. Use Excel to Analyse Data

Use Excel to Analyse Data. Use Excel to Analyse Data Introduction This workbook accompanies the computer skills training workshop. The trainer will demonstrate each skill and refer you to the relevant page at the appropriate time. This workbook can also

More information

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Chapter Seven Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Section : An introduction to multiple regression WHAT IS MULTIPLE REGRESSION? Multiple

More information

SOP 3 v2: web-based selection of oligonucleotide primer trios for genotyping of human and mouse polymorphisms

SOP 3 v2: web-based selection of oligonucleotide primer trios for genotyping of human and mouse polymorphisms W548 W552 Nucleic Acids Research, 2005, Vol. 33, Web Server issue doi:10.1093/nar/gki483 SOP 3 v2: web-based selection of oligonucleotide primer trios for genotyping of human and mouse polymorphisms Steven

More information

Phasing the Chromosomes of a Family Group When One Parent is Missing

Phasing the Chromosomes of a Family Group When One Parent is Missing Journal of Genetic Genealogy, 6(1), 2010 Phasing the Chromosomes of a Family Group When One Parent is Missing T. Whit Athey Abstract A technique is presented for the phasing of sets of SNP data collected

More information

Expected values, standard errors, Central Limit Theorem. Statistical inference

Expected values, standard errors, Central Limit Theorem. Statistical inference Expected values, standard errors, Central Limit Theorem FPP 16-18 Statistical inference Up to this point we have focused primarily on exploratory statistical analysis We know dive into the realm of statistical

More information

3 An Illustrative Example

3 An Illustrative Example Objectives An Illustrative Example Objectives - Theory and Examples -2 Problem Statement -2 Perceptron - Two-Input Case -4 Pattern Recognition Example -5 Hamming Network -8 Feedforward Layer -8 Recurrent

More information

Using CrunchIt (http://bcs.whfreeman.com/crunchit/bps4e) or StatCrunch (www.calvin.edu/go/statcrunch)

Using CrunchIt (http://bcs.whfreeman.com/crunchit/bps4e) or StatCrunch (www.calvin.edu/go/statcrunch) Using CrunchIt (http://bcs.whfreeman.com/crunchit/bps4e) or StatCrunch (www.calvin.edu/go/statcrunch) 1. In general, this package is far easier to use than many statistical packages. Every so often, however,

More information

CCCR Outreach FAQ and User Manual

CCCR Outreach FAQ and User Manual CCCR Outreach FAQ and User Manual Q.1 What is the CCCR Outreach Application used for? The CCCR Outreach Application is a Web interface for displaying data. The CCCR Outreach Application Access enables

More information

Bill Burton Albert Einstein College of Medicine william.burton@einstein.yu.edu April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

Bill Burton Albert Einstein College of Medicine william.burton@einstein.yu.edu April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1 Bill Burton Albert Einstein College of Medicine william.burton@einstein.yu.edu April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1 Calculate counts, means, and standard deviations Produce

More information

Y Chromosome Markers

Y Chromosome Markers Y Chromosome Markers Lineage Markers Autosomal chromosomes recombine with each meiosis Y and Mitochondrial DNA does not This means that the Y and mtdna remains constant from generation to generation Except

More information

Week 4: Standard Error and Confidence Intervals

Week 4: Standard Error and Confidence Intervals Health Sciences M.Sc. Programme Applied Biostatistics Week 4: Standard Error and Confidence Intervals Sampling Most research data come from subjects we think of as samples drawn from a larger population.

More information