SNP and destroy - a discussion of a weighted distance-based SNP selection algorithm

Size: px
Start display at page:

Download "SNP and destroy - a discussion of a weighted distance-based SNP selection algorithm"

Transcription

1 SNP and destroy - a discussion of a weighted distance-based SNP selection algorithm David A. Hall Rodney A. Lea November 14, 2005 Abstract Recent developments in bioinformatics have introduced a number of methods for quickly typing a large number of Single Nucleotide Polymorphisms (SNPs) in the human genome. Due to time and cost constraints, carrying out similar typing in large populations may not be viable. Further to this, due to linkage between SNPs, any typing done at one location may be fully predictive for typing carried out at another adjacent SNP. A Java program has been developed (SNPBlaster) that is able to carry out a weighted iterative SNP selection procedure, which may be useful in weeding out SNPs that are not likely to be useful in a population SNP screen, based on frequency differences between two populations. Included in this discussion is an application of the algorithm for finding SNPs on chromosome 4, approximately 1MB apart, that are likely to be informative for ancestry. While the program is useful in its current form, there are a few cautions that should be taken into account related to the prototype / test nature of the program. 1 Introduction An increasingly large number of Single Nucleotide Polymorphisms (SNPs) are being found in the human genome, with the help of large-scale genotyping projects such as HapMap (The International HapMap Consortium, 2003). However, due to cost constraints, most studies will need to select a much smaller number of SNPs, preferably ones with a high information content. A Java application, SNPBlaster, has been designed that attempts to remove SNPs from a large set, retaining only those SNPs that would be useful or informative. The program combines positional information with a arbitrary (user defined) information measure, referred to in this paper as a weighting factor. It is assumed that this measure at each SNP has already been calculated, and 1

2 uses that measure to determine, within a small region of the chromosome (window), what SNPs should be removed. This paper gives an example of the use of SNPBlaster using population differences as the measure, with a window size of 1MB. 2 Algorithm 2.1 Summary of the algorithm j l a b c d e f g h i k m n o Figure 1: This figure indicates a hypothetical situation in which a number of equally weighted markers are given to the SNPBlaster program to remove. The SNPBlaster program will remove markers b,c,e,g,i,j,k,l and o with the given window size. The algorithm begins with a group of SNPs with a high weighting factor ( ), removing those that are closer than a certain distance (the window size) to other SNPs. If there is more than one SNP that may be removed, then the algorithm will attempt to remove the SNP that will result in a better distribution of SNPs, such that the variance in distance between SNPs is low. Once this group is dealt with, the algorithm locks the SNPs so that they can t be removed, then continues on with the next group of SNPs, until all SNPs have been tested for potential removal from the marker set. 2.2 Detail The SNPBlaster program used an O(n) iterative algorithm to select SNPs from a list for a given window size, given the SNP location and an associated weighting factor. The algorithm begins with an empty list of SNPs to be worked on, and at the beginning of each iteration, moves SNPs from the complete list of SNPs into the working list. Each group of SNPs is anything still in the complete list that has a weight factor greater than or equal to the threshold value (which starts at 1.0, and is reduced by 0.1 after each iterative step). After each step, the SNPs that remain in the working list are locked, 2

3 so that they will not be removed, even if another more appropriate SNP, location wise, is found the assumption being that a high weight factor is more important than the location of the SNP. The core part of the algorithm involves a sweep through the chromosome, selecting up to four adjacent SNPs in order to decide which to remove. For the purpose of explanation, they are named p2, p1, c1, and c2 ( previous and current ), with the main decision process involving deciding which of p1 or c1 (if any) to remove. After each decision, markers are shifted as necessary based on the decision that has been made. Looking at figure 1, it may help to note that p2, p1, c1 and c2 start out as a, b, c and d respectively. There are a few trivial cases in the process, which may be useful in pointing out before discussing more complex cases. Firstly, if p1 and c1 are further apart than the window size and p1 and p2 are also further apart than the window size, then no SNP removal is carried out. Also, no removal is carried out if the closest markers (within a window) are locked a situation that should only occur if the SNPs are explicitly locked by the user. In cases where p1 is closer than the window size to either c1 or p2 (and at least one is unlocked), then a SNP will be removed. If one SNP is locked, then the other will be removed (with a decision being made on the p1,c1 pair first, if possible). If c1 and p1 are closer than the window size to each other, and neither are locked, then the SNP that is removed will be the one that results in the most even spread between the remaining three (of the four) markers. In the example in figure 1, b will be the first marker removed, because the distance between a and b is less than the distance between c and d. 3 Application The following details an example application of the SNPBlaster algorithm to SNPs recorded in the HapMap database (The International HapMap Consortium, 2003). 3

4 3.1 Preparation All available non-redundant SNPs for chromosome 4 as at 15 August 2005 were loaded into a database. The difference in genotype frequency for the non-reference ( rare ) alleles between the CEU population (Utah residents with ancestry from northern and western Europe) and the CHB population (Han Chinese in Beijing, China) was used to weight the usefulness of the alleles - a value of 1 indicated 100% difference in allele frequencies, while 0 indicated no difference in allele frequencies (a more rigorous study may take the minimum difference for all population combinations). The rs#, chromosome position and frequency difference for each SNP was exported into a text file, and the SNPBlaster header (with a chromosome length of base pairs) was added to prepare the file for the program. 3.2 Running Figure 2: A plot of the SNPs on chromosome 4 chosen by SNPBlaster that appear to be informative for ancestry. The measure (given on the y axis) is the allele frequency difference between two populations, labelled CEU and CHB in the HapMap project. After the input file was prepared, the program was run, loading up the prepared input file, and setting a window size of 1MB. From an input file of approximately markers, an output file containing 146 markers was generated, giving an average SNP 4

5 separation distance of about 1.35MB. A graphical representation of the SNPs that were selected is shown in figure Speed Loading all the HapMap information for chromosome 4 into the database certainly took the longest time. The process took approximately 17 minutes, but it may be possible to reduce this to around 2-5 minutes depending on the database format, and program used to import the data. In comparison to this, the SNPBlaster algorithm was significantly faster, typically taking 2-5 seconds for the loading process, and a similar time for the iterative algorithm. This suggests that the algorithm is reasonably fast, and is likely to be very useful for carrying out SNP selection tasks in the future. 4 Discussion 4.1 Distance-based information [Something about crossover etc, centimorgans] One paper has decided that a crossover frequency of 1% is sufficient for detecting recent population structure. This relates to a base pair distance of approximately one megabase [is it reasonable to say that we can treat the mutation rate as effectively zero? does the mutation rate matter? does it remove information at a SNP, or add information to a SNP?]. Regardless of the approach, it is reasonable to assume that within a certain distance, two SNP mutations are likely to have a high degree of linkage a specific mutation in one SNP will almost always correspond to a specific mutation in the other SNP. For this reason, it is not useful to record mutations at both of those SNPs in a study. This minimum allowed distance between SNPs will be referred to as the window size. The more SNPs that are typed, the greater the cost per individual, meaning that a smaller number of individuals will be able to be typed for the same amount of money. With a large window size, the cost per individual will be low, but it is also likely that there will be a loss of information because some of the variation will be missed by not typing enough SNPs. 5

6 4.2 Measure-based information Another approach to determining the information derived from specific SNPs is to carry out some function on each SNP to give an idea of the information content of that SNP. This function, regardless of its method, will typically identify the SNPs that will be the most useful in any investigation. It would be expected that something that did this would also be able to choose which of two SNPs would be more appropriate for an investigation Population differences One way to get an estimate of the information content with respect to the ancestry of an individual is to determine differences between populations at specific SNPs. An example of this (the example used in this paper) is one that compares allele frequency differences of a specific mutation of a SNP. The reasoning behind this is as follows: if a certain mutation is always present in one population (or more correctly a small subset of that population), but never present in another population, then that SNP will be informative in determinining the proportion of ancestry [or something similar] that an unknown individual has relative to those population groups. In this sense, a SNP with a high frequency difference between populations will be considered useful, and one that has a low difference between populations will not be useful. Other methods for determining the information content of SNPs are available. A reader who is interested in these may like to read [cite some relevant papers]. 4.3 Binocular vision There are issues involved with choosing only one of these two information procedures in SNP selection. Working purely on distance based information is likely to mean that some of the SNPs that are chosen will not be informative enough for the investigation, even if a more appropriate SNP is available nearby. Working purely on measure-based information may mean that some variation will be missed, and there may be a lot of unnecessary content. To further explain this, if many highly informative SNPs were in a single area of the chromosome (none were further apart from each other than the window size), then all the SNPs would be expected to carry linked mutations any one 6

7 of them could be used to infer the mutations at the other locations. In addition to this, some parts of the chromosome may be missed out, because those parts only have SNPs with a very low measure-based information content. It is probably worth noting that some SNP information measures will also take into account local distance information. However, the algorithms used to generate these measures may be very processor intensive, because they require recalculation after each SNP removal. The SNPBlaster algorithm gets around the issue of complexity due to recalculation by grouping SNPs of the same measure together, and from then on using distance methods to select SNPs. While the measure is the key factor in determing which SNP is chosen locally, the window size is typically more important on a whole chromosome level. 4.4 Similar programs Another application, CHOISS (Lee and Kang, 2004), is currently available to carry out a similar SNP selection process, choosing SNPs (either by stating an interval, or by stating the required number of SNPs) to minimise variance. After attempting to use the web-based version of CHOISS with a data set derived from chromosome 4 (approximately 50,000 markers), the web interface timed out at fifteen minutes. The algorithmic complexity of CHOISS is reported to be O(n 2 ), while the complexity of SNPBlaster is O(n). For small to medium numbers of SNPs (possibly up to around 5000), this may not make a significant difference, but above that level, the solution time for SNPBlaster is likely to be significantly less than that of CHOISS. However, there may be situations in which CHOISS is more appropriate, most likely those situations where a rough guess at the best SNPs is not appropriate. After downloading CHOISS, it was noticed that the algorithm did in fact run in a reasonably short period of time (2 minutes, compared with around 5 seconds for SNPBlaster). However, the output was not what was expected. When SNPs in the input file were in a random order, the CHOISS algorithm selected approximately 2000 SNPs, with a reported average distance of 1MB for a chromosome of total length around 200MB. When the SNPs were sorted, the algorithm did not select any SNPs. It is likely that this selection process had more to do with range overflows (i.e. numbers being too large) rather than problems with the algorithm itself, but it will be difficult to work out 7

8 for sure without a more thorough analysis of the program. [I should probably contact the authors then] 4.5 Caveats While the algorithm as described is useful for quick large scale selection of SNPs, there are a number of factors that may make it not as reliable as would be expected. At the moment, the program will only work for SNPs on a single chromosome. If more chromosomes are desired (as would be expected for a full genome selection process), then it is necessary to work on each chromosome individually. The algorithm is iterative, but does not have soft divisions between different weight factors. This means that SNPs with a weighting at the low end of a grouping (e.g. 0.91) are considered to be just as important as SNPs at the high end of that grouping (e.g. 0.99). In some cases, this may be alleviated by increasing the number of iterative divisions in the algorithm, but this would remove the linking between similar weights. An alternative procedure would attempt to determine a relationship between the distance between markers and the weighting factor, although it is likely that such a procedure would be only able to be applied to specific types of studies. Such an approach will certainly increase the amount of processing required (as all SNPs within a window will need to be tested, rather than just the closest), but as long as the window size is small, this increase is unlikely to make the algorithm unmanageably slow. The window size defined in the algorithm refers to the minimum allowed distance between SNPs. If there is enough coverage, then the maximum distance between SNPs will be just under twice the window size. Knowledge of this variation in SNP distance may be important in choosing a window size for the algorithm, because the average SNP distance is likely to be closer to 1.5 the window size. An increase in the number of iterative divisions is likely to increase the average distance between SNPs, as it is less likely that two SNPs within one of the weight ranges will be close to each other. 4.6 Other Applications While the algorithm was designed for the purpose of removing SNPs in order to obtain a panel of useful markers for further studies, attempts have been made to make the algorithm as generic as possible. There is potential for similar processes to be carried 8

9 out in other areas anywhere in which a well-distributed selection is required along a linear track, and there are a limited number of choices along that track. References Lee, S. and Kang, C. (2004). Choiss for selection of single nucleotide polymorphism markers on interval regularity, Bioinformatics 20(4): Liu, Z. and Lin, S. (2005). Multilocus LD measure and tagging SNP selection with generalized mutual information, Genetic Epidemiology?(?): 1 10? Sebastiani, P., Lazarus, R., Weiss, S. T., Kunkel, L. M., Kohane, I. S. and Ramoni, M. F. (2003). Minimal haplotype tagging, PNAS 100(17): The International HapMap Consortium (2003). The international hapmap project, Nature 426:

Globally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the

Globally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the Chapter 5 Analysis of Prostate Cancer Association Study Data 5.1 Risk factors for Prostate Cancer Globally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the disease has

More information

SNPbrowser Software v3.5

SNPbrowser Software v3.5 Product Bulletin SNP Genotyping SNPbrowser Software v3.5 A Free Software Tool for the Knowledge-Driven Selection of SNP Genotyping Assays Easily visualize SNPs integrated with a physical map, linkage disequilibrium

More information

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis Goal: This tutorial introduces several websites and tools useful for determining linkage disequilibrium

More information

Journal of Statistical Software

Journal of Statistical Software JSS Journal of Statistical Software October 2006, Volume 16, Code Snippet 3. http://www.jstatsoft.org/ LDheatmap: An R Function for Graphical Display of Pairwise Linkage Disequilibria between Single Nucleotide

More information

GAW 15 Problem 3: Simulated Rheumatoid Arthritis Data Full Model and Simulation Parameters

GAW 15 Problem 3: Simulated Rheumatoid Arthritis Data Full Model and Simulation Parameters GAW 15 Problem 3: Simulated Rheumatoid Arthritis Data Full Model and Simulation Parameters Michael B Miller , Michael Li , Gregg Lind , Soon-Young

More information

Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER

Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER JMP Genomics Step-by-Step Guide to Bi-Parental Linkage Mapping Introduction JMP Genomics offers several tools for the creation of linkage maps

More information

Benchmarking Student Learning Outcomes using Shewhart Control Charts

Benchmarking Student Learning Outcomes using Shewhart Control Charts Benchmarking Student Learning Outcomes using Shewhart Control Charts Steven J. Peterson, MBA, PE Weber State University Ogden, Utah This paper looks at how Shewhart control charts a statistical tool used

More information

Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 Genomes Project phase I data

Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 Genomes Project phase I data Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 Genomes Project phase I data Débora Y. C. Brandt*, Vitor R. C. Aguiar*, Bárbara D. Bitarello*, Kelly Nunes*, Jérôme

More information

Investigating the genetic basis for intelligence

Investigating the genetic basis for intelligence Investigating the genetic basis for intelligence Steve Hsu University of Oregon and BGI www.cog-genomics.org Outline: a multidisciplinary subject 1. What is intelligence? Psychometrics 2. g and GWAS: a

More information

Combining Data from Different Genotyping Platforms. Gonçalo Abecasis Center for Statistical Genetics University of Michigan

Combining Data from Different Genotyping Platforms. Gonçalo Abecasis Center for Statistical Genetics University of Michigan Combining Data from Different Genotyping Platforms Gonçalo Abecasis Center for Statistical Genetics University of Michigan The Challenge Detecting small effects requires very large sample sizes Combined

More information

SNP Essentials The same SNP story

SNP Essentials The same SNP story HOW SNPS HELP RESEARCHERS FIND THE GENETIC CAUSES OF DISEASE SNP Essentials One of the findings of the Human Genome Project is that the DNA of any two people, all 3.1 billion molecules of it, is more than

More information

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012 Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

More information

Gene Mapping Techniques

Gene Mapping Techniques Gene Mapping Techniques OBJECTIVES By the end of this session the student should be able to: Define genetic linkage and recombinant frequency State how genetic distance may be estimated State how restriction

More information

Lecture 6: Single nucleotide polymorphisms (SNPs) and Restriction Fragment Length Polymorphisms (RFLPs)

Lecture 6: Single nucleotide polymorphisms (SNPs) and Restriction Fragment Length Polymorphisms (RFLPs) Lecture 6: Single nucleotide polymorphisms (SNPs) and Restriction Fragment Length Polymorphisms (RFLPs) Single nucleotide polymorphisms or SNPs (pronounced "snips") are DNA sequence variations that occur

More information

BAPS: Bayesian Analysis of Population Structure

BAPS: Bayesian Analysis of Population Structure BAPS: Bayesian Analysis of Population Structure Manual v. 6.0 NOTE: ANY INQUIRIES CONCERNING THE PROGRAM SHOULD BE SENT TO JUKKA CORANDER (first.last at helsinki.fi). http://www.helsinki.fi/bsg/software/baps/

More information

DNA-Analytik III. Genetische Variabilität

DNA-Analytik III. Genetische Variabilität DNA-Analytik III Genetische Variabilität Genetische Variabilität Lexikon Scherer et al. Nat Genet Suppl 39:s7 (2007) Genetische Variabilität Sequenzvariation Mutationen (Mikro~) Basensubstitution Insertion

More information

(1-p) 2. p(1-p) From the table, frequency of DpyUnc = ¼ (p^2) = #DpyUnc = p^2 = 0.0004 ¼(1-p)^2 + ½(1-p)p + ¼(p^2) #Dpy + #DpyUnc

(1-p) 2. p(1-p) From the table, frequency of DpyUnc = ¼ (p^2) = #DpyUnc = p^2 = 0.0004 ¼(1-p)^2 + ½(1-p)p + ¼(p^2) #Dpy + #DpyUnc Advanced genetics Kornfeld problem set_key 1A (5 points) Brenner employed 2-factor and 3-factor crosses with the mutants isolated from his screen, and visually assayed for recombination events between

More information

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Chapter Seven Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Section : An introduction to multiple regression WHAT IS MULTIPLE REGRESSION? Multiple

More information

3 An Illustrative Example

3 An Illustrative Example Objectives An Illustrative Example Objectives - Theory and Examples -2 Problem Statement -2 Perceptron - Two-Input Case -4 Pattern Recognition Example -5 Hamming Network -8 Feedforward Layer -8 Recurrent

More information

Study the following diagrams of the States of Matter. Label the names of the Changes of State between the different states.

Study the following diagrams of the States of Matter. Label the names of the Changes of State between the different states. Describe the strength of attractive forces between particles. Describe the amount of space between particles. Can the particles in this state be compressed? Do the particles in this state have a definite

More information

Week 4: Standard Error and Confidence Intervals

Week 4: Standard Error and Confidence Intervals Health Sciences M.Sc. Programme Applied Biostatistics Week 4: Standard Error and Confidence Intervals Sampling Most research data come from subjects we think of as samples drawn from a larger population.

More information

Comparison of Major Domination Schemes for Diploid Binary Genetic Algorithms in Dynamic Environments

Comparison of Major Domination Schemes for Diploid Binary Genetic Algorithms in Dynamic Environments Comparison of Maor Domination Schemes for Diploid Binary Genetic Algorithms in Dynamic Environments A. Sima UYAR and A. Emre HARMANCI Istanbul Technical University Computer Engineering Department Maslak

More information

Next Generation Sequencing: Technology, Mapping, and Analysis

Next Generation Sequencing: Technology, Mapping, and Analysis Next Generation Sequencing: Technology, Mapping, and Analysis Gary Benson Computer Science, Biology, Bioinformatics Boston University gbenson@bu.edu http://tandem.bu.edu/ The Human Genome Project took

More information

Confidence Intervals for the Difference Between Two Means

Confidence Intervals for the Difference Between Two Means Chapter 47 Confidence Intervals for the Difference Between Two Means Introduction This procedure calculates the sample size necessary to achieve a specified distance from the difference in sample means

More information

Bioinformatics Resources at a Glance

Bioinformatics Resources at a Glance Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences

More information

Lab 11. Simulations. The Concept

Lab 11. Simulations. The Concept Lab 11 Simulations In this lab you ll learn how to create simulations to provide approximate answers to probability questions. We ll make use of a particular kind of structure, called a box model, that

More information

Simple linear regression

Simple linear regression Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

More information

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable

More information

Regular Expressions and Automata using Haskell

Regular Expressions and Automata using Haskell Regular Expressions and Automata using Haskell Simon Thompson Computing Laboratory University of Kent at Canterbury January 2000 Contents 1 Introduction 2 2 Regular Expressions 2 3 Matching regular expressions

More information

Logistic Regression (1/24/13)

Logistic Regression (1/24/13) STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

More information

APP INVENTOR. Test Review

APP INVENTOR. Test Review APP INVENTOR Test Review Main Concepts App Inventor Lists Creating Random Numbers Variables Searching and Sorting Data Linear Search Binary Search Selection Sort Quick Sort Abstraction Modulus Division

More information

2. True or False? The sequence of nucleotides in the human genome is 90.9% identical from one person to the next. False (it s 99.

2. True or False? The sequence of nucleotides in the human genome is 90.9% identical from one person to the next. False (it s 99. 1. True or False? A typical chromosome can contain several hundred to several thousand genes, arranged in linear order along the DNA molecule present in the chromosome. True 2. True or False? The sequence

More information

Genotyping and quality control of UK Biobank, a large- scale, extensively phenotyped prospective resource

Genotyping and quality control of UK Biobank, a large- scale, extensively phenotyped prospective resource Genotyping and quality control of UK Biobank, a large- scale, extensively phenotyped prospective resource Information for researchers Interim Data Release, 2015 1 Introduction... 3 1.1 UK Biobank... 3

More information

Roots of Equations (Chapters 5 and 6)

Roots of Equations (Chapters 5 and 6) Roots of Equations (Chapters 5 and 6) Problem: given f() = 0, find. In general, f() can be any function. For some forms of f(), analytical solutions are available. However, for other functions, we have

More information

5 GENETIC LINKAGE AND MAPPING

5 GENETIC LINKAGE AND MAPPING 5 GENETIC LINKAGE AND MAPPING 5.1 Genetic Linkage So far, we have considered traits that are affected by one or two genes, and if there are two genes, we have assumed that they assort independently. However,

More information

ASSIsT: An Automatic SNP ScorIng Tool for in and out-breeding species Reference Manual

ASSIsT: An Automatic SNP ScorIng Tool for in and out-breeding species Reference Manual ASSIsT: An Automatic SNP ScorIng Tool for in and out-breeding species Reference Manual Di Guardo M, Micheletti D, Bianco L, Koehorst-van Putten HJJ, Longhi S, Costa F, Aranzana MJ, Velasco R, Arús P, Troggio

More information

Descriptive Statistics and Measurement Scales

Descriptive Statistics and Measurement Scales Descriptive Statistics 1 Descriptive Statistics and Measurement Scales Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample

More information

Tutorial on gplink. http://pngu.mgh.harvard.edu/~purcell/plink/gplink.shtml. PLINK tutorial, December 2006; Shaun Purcell, shaun@pngu.mgh.harvard.

Tutorial on gplink. http://pngu.mgh.harvard.edu/~purcell/plink/gplink.shtml. PLINK tutorial, December 2006; Shaun Purcell, shaun@pngu.mgh.harvard. Tutorial on gplink http://pngu.mgh.harvard.edu/~purcell/plink/gplink.shtml Basic gplink analyses Data management Summary statistics Association analysis Population stratification IBD-based analysis gplink

More information

Step by Step Guide to Importing Genetic Data into JMP Genomics

Step by Step Guide to Importing Genetic Data into JMP Genomics Step by Step Guide to Importing Genetic Data into JMP Genomics Page 1 Introduction Data for genetic analyses can exist in a variety of formats. Before this data can be analyzed it must imported into one

More information

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate

More information

Confidence Intervals for One Standard Deviation Using Standard Deviation

Confidence Intervals for One Standard Deviation Using Standard Deviation Chapter 640 Confidence Intervals for One Standard Deviation Using Standard Deviation Introduction This routine calculates the sample size necessary to achieve a specified interval width or distance from

More information

Asexual Versus Sexual Reproduction in Genetic Algorithms 1

Asexual Versus Sexual Reproduction in Genetic Algorithms 1 Asexual Versus Sexual Reproduction in Genetic Algorithms Wendy Ann Deslauriers (wendyd@alumni.princeton.edu) Institute of Cognitive Science,Room 22, Dunton Tower Carleton University, 25 Colonel By Drive

More information

1 One Dimensional Horizontal Motion Position vs. time Velocity vs. time

1 One Dimensional Horizontal Motion Position vs. time Velocity vs. time PHY132 Experiment 1 One Dimensional Horizontal Motion Position vs. time Velocity vs. time One of the most effective methods of describing motion is to plot graphs of distance, velocity, and acceleration

More information

Answer Key Problem Set 5

Answer Key Problem Set 5 7.03 Fall 2003 1 of 6 1. a) Genetic properties of gln2- and gln 3-: Answer Key Problem Set 5 Both are uninducible, as they give decreased glutamine synthetase (GS) activity. Both are recessive, as mating

More information

Genetic diagnostics the gateway to personalized medicine

Genetic diagnostics the gateway to personalized medicine Micronova 20.11.2012 Genetic diagnostics the gateway to personalized medicine Kristiina Assoc. professor, Director of Genetic Department HUSLAB, Helsinki University Central Hospital The Human Genome Packed

More information

Garbage Collection in NonStop Server for Java

Garbage Collection in NonStop Server for Java Garbage Collection in NonStop Server for Java Technical white paper Table of contents 1. Introduction... 2 2. Garbage Collection Concepts... 2 3. Garbage Collection in NSJ... 3 4. NSJ Garbage Collection

More information

SNP Data Integration and Analysis for Drug- Response Biomarker Discovery

SNP Data Integration and Analysis for Drug- Response Biomarker Discovery B. Comp Dissertation SNP Data Integration and Analysis for Drug- Response Biomarker Discovery By Chen Jieqi Pauline Department of Computer Science School of Computing National University of Singapore 2008/2009

More information

Financial Mathematics and Simulation MATH 6740 1 Spring 2011 Homework 2

Financial Mathematics and Simulation MATH 6740 1 Spring 2011 Homework 2 Financial Mathematics and Simulation MATH 6740 1 Spring 2011 Homework 2 Due Date: Friday, March 11 at 5:00 PM This homework has 170 points plus 20 bonus points available but, as always, homeworks are graded

More information

GOBII. Genomic & Open-source Breeding Informatics Initiative

GOBII. Genomic & Open-source Breeding Informatics Initiative GOBII Genomic & Open-source Breeding Informatics Initiative My Background BS Animal Science, University of Tennessee MS Animal Breeding, University of Georgia Random regression models for longitudinal

More information

Genomes and SNPs in Malaria and Sickle Cell Anemia

Genomes and SNPs in Malaria and Sickle Cell Anemia Genomes and SNPs in Malaria and Sickle Cell Anemia Introduction to Genome Browsing with Ensembl Ensembl The vast amount of information in biological databases today demands a way of organising and accessing

More information

The Functional but not Nonfunctional LILRA3 Contributes to Sex Bias in Susceptibility and Severity of ACPA-Positive Rheumatoid Arthritis

The Functional but not Nonfunctional LILRA3 Contributes to Sex Bias in Susceptibility and Severity of ACPA-Positive Rheumatoid Arthritis The Functional but not Nonfunctional LILRA3 Contributes to Sex Bias in Susceptibility and Severity of ACPA-Positive Rheumatoid Arthritis Yan Du Peking University People s Hospital 100044 Beijing CHINA

More information

Linear Models in STATA and ANOVA

Linear Models in STATA and ANOVA Session 4 Linear Models in STATA and ANOVA Page Strengths of Linear Relationships 4-2 A Note on Non-Linear Relationships 4-4 Multiple Linear Regression 4-5 Removal of Variables 4-8 Independent Samples

More information

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 seema@iasri.res.in Genomics A genome is an organism s

More information

DnaSP, DNA polymorphism analyses by the coalescent and other methods.

DnaSP, DNA polymorphism analyses by the coalescent and other methods. DnaSP, DNA polymorphism analyses by the coalescent and other methods. Author affiliation: Julio Rozas 1, *, Juan C. Sánchez-DelBarrio 2,3, Xavier Messeguer 2 and Ricardo Rozas 1 1 Departament de Genètica,

More information

Excel -- Creating Charts

Excel -- Creating Charts Excel -- Creating Charts The saying goes, A picture is worth a thousand words, and so true. Professional looking charts give visual enhancement to your statistics, fiscal reports or presentation. Excel

More information

How To Check For Differences In The One Way Anova

How To Check For Differences In The One Way Anova MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

More information

Optimal Binary Search Trees Meet Object Oriented Programming

Optimal Binary Search Trees Meet Object Oriented Programming Optimal Binary Search Trees Meet Object Oriented Programming Stuart Hansen and Lester I. McCann Computer Science Department University of Wisconsin Parkside Kenosha, WI 53141 {hansen,mccann}@cs.uwp.edu

More information

Follow links Class Use and other Permissions. For more information, send email to: permissions@pupress.princeton.edu

Follow links Class Use and other Permissions. For more information, send email to: permissions@pupress.princeton.edu COPYRIGHT NOTICE: David A. Kendrick, P. Ruben Mercado, and Hans M. Amman: Computational Economics is published by Princeton University Press and copyrighted, 2006, by Princeton University Press. All rights

More information

An analysis of the 2003 HEFCE national student survey pilot data.

An analysis of the 2003 HEFCE national student survey pilot data. An analysis of the 2003 HEFCE national student survey pilot data. by Harvey Goldstein Institute of Education, University of London h.goldstein@ioe.ac.uk Abstract The summary report produced from the first

More information

Y Chromosome Markers

Y Chromosome Markers Y Chromosome Markers Lineage Markers Autosomal chromosomes recombine with each meiosis Y and Mitochondrial DNA does not This means that the Y and mtdna remains constant from generation to generation Except

More information

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions. Algebra I Overview View unit yearlong overview here Many of the concepts presented in Algebra I are progressions of concepts that were introduced in grades 6 through 8. The content presented in this course

More information

Statistical Analysis for Genetic Epidemiology (S.A.G.E.) Version 6.2 Graphical User Interface (GUI) Manual

Statistical Analysis for Genetic Epidemiology (S.A.G.E.) Version 6.2 Graphical User Interface (GUI) Manual Statistical Analysis for Genetic Epidemiology (S.A.G.E.) Version 6.2 Graphical User Interface (GUI) Manual Department of Epidemiology and Biostatistics Wolstein Research Building 2103 Cornell Rd Case Western

More information

The Graphical Method: An Example

The Graphical Method: An Example The Graphical Method: An Example Consider the following linear program: Maximize 4x 1 +3x 2 Subject to: 2x 1 +3x 2 6 (1) 3x 1 +2x 2 3 (2) 2x 2 5 (3) 2x 1 +x 2 4 (4) x 1, x 2 0, where, for ease of reference,

More information

UNDERSTANDING THE TWO-WAY ANOVA

UNDERSTANDING THE TWO-WAY ANOVA UNDERSTANDING THE e have seen how the one-way ANOVA can be used to compare two or more sample means in studies involving a single independent variable. This can be extended to two independent variables

More information

Software Engineering. Introduc)on

Software Engineering. Introduc)on Software Engineering Introduc)on Software engineering The economies of ALL developed nations are dependent on software. More and more systems are software controlled Software engineering is concerned with

More information

Web Data Extraction: 1 o Semestre 2007/2008

Web Data Extraction: 1 o Semestre 2007/2008 Web Data : Given Slides baseados nos slides oficiais do livro Web Data Mining c Bing Liu, Springer, December, 2006. Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008

More information

Minesweeper as a Constraint Satisfaction Problem

Minesweeper as a Constraint Satisfaction Problem Minesweeper as a Constraint Satisfaction Problem by Chris Studholme Introduction To Minesweeper Minesweeper is a simple one player computer game commonly found on machines with popular operating systems

More information

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number 1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number A. 3(x - x) B. x 3 x C. 3x - x D. x - 3x 2) Write the following as an algebraic expression

More information

Representing Vector Fields Using Field Line Diagrams

Representing Vector Fields Using Field Line Diagrams Minds On Physics Activity FFá2 5 Representing Vector Fields Using Field Line Diagrams Purpose and Expected Outcome One way of representing vector fields is using arrows to indicate the strength and direction

More information

Forces between charges

Forces between charges Forces between charges Two small objects each with a net charge of Q (where Q is a positive number) exert a force of magnitude F on each other. We replace one of the objects with another whose net charge

More information

Single Nucleotide Polymorphisms (SNPs)

Single Nucleotide Polymorphisms (SNPs) Single Nucleotide Polymorphisms (SNPs) Additional Markers 13 core STR loci Obtain further information from additional markers: Y STRs Separating male samples Mitochondrial DNA Working with extremely degraded

More information

Genetic Algorithms commonly used selection, replacement, and variation operators Fernando Lobo University of Algarve

Genetic Algorithms commonly used selection, replacement, and variation operators Fernando Lobo University of Algarve Genetic Algorithms commonly used selection, replacement, and variation operators Fernando Lobo University of Algarve Outline Selection methods Replacement methods Variation operators Selection Methods

More information

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r), Chapter 0 Key Ideas Correlation, Correlation Coefficient (r), Section 0-: Overview We have already explored the basics of describing single variable data sets. However, when two quantitative variables

More information

Lab 4: 26 th March 2012. Exercise 1: Evolutionary algorithms

Lab 4: 26 th March 2012. Exercise 1: Evolutionary algorithms Lab 4: 26 th March 2012 Exercise 1: Evolutionary algorithms 1. Found a problem where EAs would certainly perform very poorly compared to alternative approaches. Explain why. Suppose that we want to find

More information

Petrel TIPS&TRICKS from SCM

Petrel TIPS&TRICKS from SCM Petrel TIPS&TRICKS from SCM Knowledge Worth Sharing Histograms and SGS Modeling Histograms are used daily for interpretation, quality control, and modeling in Petrel. This TIPS&TRICKS document briefly

More information

Special Situations in the Simplex Algorithm

Special Situations in the Simplex Algorithm Special Situations in the Simplex Algorithm Degeneracy Consider the linear program: Maximize 2x 1 +x 2 Subject to: 4x 1 +3x 2 12 (1) 4x 1 +x 2 8 (2) 4x 1 +2x 2 8 (3) x 1, x 2 0. We will first apply the

More information

Chapter 8 Hypothesis Testing Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing

Chapter 8 Hypothesis Testing Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing Chapter 8 Hypothesis Testing 1 Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing 8-3 Testing a Claim About a Proportion 8-5 Testing a Claim About a Mean: s Not Known 8-6 Testing

More information

Towards running complex models on big data

Towards running complex models on big data Towards running complex models on big data Working with all the genomes in the world without changing the model (too much) Daniel Lawson Heilbronn Institute, University of Bristol 2013 1 / 17 Motivation

More information

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

More information

Improvement of Data Quality Assurance in the EIA Weekly Gasoline Prices Survey

Improvement of Data Quality Assurance in the EIA Weekly Gasoline Prices Survey Improvement of Data Quality Assurance in the EIA Weekly Gasoline Prices Survey Bin Zhang, Paula Mason, Amerine Woodyard, and Benita O Colmain Bin Zhang, EIA, 1000 Independence Ave., SW, Washington, DC

More information

Independent samples t-test. Dr. Tom Pierce Radford University

Independent samples t-test. Dr. Tom Pierce Radford University Independent samples t-test Dr. Tom Pierce Radford University The logic behind drawing causal conclusions from experiments The sampling distribution of the difference between means The standard error of

More information

Commonly Used STR Markers

Commonly Used STR Markers Commonly Used STR Markers Repeats Satellites 100 to 1000 bases repeated Minisatellites VNTR variable number tandem repeat 10 to 100 bases repeated Microsatellites STR short tandem repeat 2 to 6 bases repeated

More information

Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm

Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm Journal of Al-Nahrain University Vol.15 (2), June, 2012, pp.161-168 Science Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm Manal F. Younis Computer Department, College

More information

Basics of Marker Assisted Selection

Basics of Marker Assisted Selection asics of Marker ssisted Selection Chapter 15 asics of Marker ssisted Selection Julius van der Werf, Department of nimal Science rian Kinghorn, Twynam Chair of nimal reeding Technologies University of New

More information

A Review And Evaluations Of Shortest Path Algorithms

A Review And Evaluations Of Shortest Path Algorithms A Review And Evaluations Of Shortest Path Algorithms Kairanbay Magzhan, Hajar Mat Jani Abstract: Nowadays, in computer networks, the routing is based on the shortest path problem. This will help in minimizing

More information

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

More information

ANTS SuperGuppy. ANTS for esignal Installation Guide. A Guppy Collaboration For Success PAGE 1

ANTS SuperGuppy. ANTS for esignal Installation Guide. A Guppy Collaboration For Success PAGE 1 ANTS SuperGuppy A Guppy Collaboration For Success ANTS for esignal Installation Guide PAGE 1 IMPORTANT INFORMATION Copyright Under copyright legislation, this publication may not be reproduced or transmitted

More information

01 In any business, or, indeed, in life in general, hindsight is a beautiful thing. If only we could look into a

01 In any business, or, indeed, in life in general, hindsight is a beautiful thing. If only we could look into a 01 technical cost-volumeprofit relevant to acca qualification paper F5 In any business, or, indeed, in life in general, hindsight is a beautiful thing. If only we could look into a crystal ball and find

More information

Design a Line Maze Solving Robot

Design a Line Maze Solving Robot Design a Line Maze Solving Robot Teaching a Robot to Solve a Line Maze By Richard T. Vannoy II April 2009 RoboticsProfessor@gmail.com Please email me at the address above if you have questions or comments.

More information

Forensic DNA Testing Terminology

Forensic DNA Testing Terminology Forensic DNA Testing Terminology ABI 310 Genetic Analyzer a capillary electrophoresis instrument used by forensic DNA laboratories to separate short tandem repeat (STR) loci on the basis of their size.

More information

Bill Burton Albert Einstein College of Medicine william.burton@einstein.yu.edu April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

Bill Burton Albert Einstein College of Medicine william.burton@einstein.yu.edu April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1 Bill Burton Albert Einstein College of Medicine william.burton@einstein.yu.edu April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1 Calculate counts, means, and standard deviations Produce

More information

6.080/6.089 GITCS Feb 12, 2008. Lecture 3

6.080/6.089 GITCS Feb 12, 2008. Lecture 3 6.8/6.89 GITCS Feb 2, 28 Lecturer: Scott Aaronson Lecture 3 Scribe: Adam Rogal Administrivia. Scribe notes The purpose of scribe notes is to transcribe our lectures. Although I have formal notes of my

More information

CASSI: Genome-Wide Interaction Analysis Software

CASSI: Genome-Wide Interaction Analysis Software CASSI: Genome-Wide Interaction Analysis Software 1 Contents 1 Introduction 3 2 Installation 3 3 Using CASSI 3 3.1 Input Files................................... 4 3.2 Options....................................

More information

Multiple Regression: What Is It?

Multiple Regression: What Is It? Multiple Regression Multiple Regression: What Is It? Multiple regression is a collection of techniques in which there are multiple predictors of varying kinds and a single outcome We are interested in

More information

Evolutionary Detection of Rules for Text Categorization. Application to Spam Filtering

Evolutionary Detection of Rules for Text Categorization. Application to Spam Filtering Advances in Intelligent Systems and Technologies Proceedings ECIT2004 - Third European Conference on Intelligent Systems and Technologies Iasi, Romania, July 21-23, 2004 Evolutionary Detection of Rules

More information

MONOPOLIES HOW ARE MONOPOLIES ACHIEVED?

MONOPOLIES HOW ARE MONOPOLIES ACHIEVED? Monopoly 18 The public, policy-makers, and economists are concerned with the power that monopoly industries have. In this chapter I discuss how monopolies behave and the case against monopolies. The case

More information

1. The Kinetic Theory of Matter states that all matter is composed of atoms and molecules that are in a constant state of constant random motion

1. The Kinetic Theory of Matter states that all matter is composed of atoms and molecules that are in a constant state of constant random motion Physical Science Period: Name: ANSWER KEY Date: Practice Test for Unit 3: Ch. 3, and some of 15 and 16: Kinetic Theory of Matter, States of matter, and and thermodynamics, and gas laws. 1. The Kinetic

More information

The Human Side of Test Automation

The Human Side of Test Automation White Paper Silk The Human Side of Test Automation Improving User Experience in an Increasingly Complex Environment White Paper The Human Side of Test Automation Introduction: Human Interaction We ve dreamed

More information

HENIPAVIRUS ANTIBODY ESCAPE SEQUENCING REPORT

HENIPAVIRUS ANTIBODY ESCAPE SEQUENCING REPORT HENIPAVIRUS ANTIBODY ESCAPE SEQUENCING REPORT Kimberly Bishop Lilly 1,2, Truong Luu 1,2, Regina Cer 1,2, and LT Vishwesh Mokashi 1 1 Naval Medical Research Center, NMRC Frederick, 8400 Research Plaza,

More information