SNP and destroy - a discussion of a weighted distance-based SNP selection algorithm

SNP and destroy - a discussion of a weighted distance-based SNP selection algorithm David A. Hall Rodney A. Lea November 14, 2005 Abstract Recent developments in bioinformatics have introduced a number of methods for quickly typing a large number of Single Nucleotide Polymorphisms (SNPs) in the human genome. Due to time and cost constraints, carrying out similar typing in large populations may not be viable. Further to this, due to linkage between SNPs, any typing done at one location may be fully predictive for typing carried out at another adjacent SNP. A Java program has been developed (SNPBlaster) that is able to carry out a weighted iterative SNP selection procedure, which may be useful in weeding out SNPs that are not likely to be useful in a population SNP screen, based on frequency differences between two populations. Included in this discussion is an application of the algorithm for finding SNPs on chromosome 4, approximately 1MB apart, that are likely to be informative for ancestry. While the program is useful in its current form, there are a few cautions that should be taken into account related to the prototype / test nature of the program. 1 Introduction An increasingly large number of Single Nucleotide Polymorphisms (SNPs) are being found in the human genome, with the help of large-scale genotyping projects such as HapMap (The International HapMap Consortium, 2003). However, due to cost constraints, most studies will need to select a much smaller number of SNPs, preferably ones with a high information content. A Java application, SNPBlaster, has been designed that attempts to remove SNPs from a large set, retaining only those SNPs that would be useful or informative. The program combines positional information with a arbitrary (user defined) information measure, referred to in this paper as a weighting factor. It is assumed that this measure at each SNP has already been calculated, and 1

uses that measure to determine, within a small region of the chromosome (window), what SNPs should be removed. This paper gives an example of the use of SNPBlaster using population differences as the measure, with a window size of 1MB. 2 Algorithm 2.1 Summary of the algorithm j l a b c d e f g h i k m n o Figure 1: This figure indicates a hypothetical situation in which a number of equally weighted markers are given to the SNPBlaster program to remove. The SNPBlaster program will remove markers b,c,e,g,i,j,k,l and o with the given window size. The algorithm begins with a group of SNPs with a high weighting factor (0.9-1.0), removing those that are closer than a certain distance (the window size) to other SNPs. If there is more than one SNP that may be removed, then the algorithm will attempt to remove the SNP that will result in a better distribution of SNPs, such that the variance in distance between SNPs is low. Once this group is dealt with, the algorithm locks the SNPs so that they can t be removed, then continues on with the next group of SNPs, until all SNPs have been tested for potential removal from the marker set. 2.2 Detail The SNPBlaster program used an O(n) iterative algorithm to select SNPs from a list for a given window size, given the SNP location and an associated weighting factor. The algorithm begins with an empty list of SNPs to be worked on, and at the beginning of each iteration, moves SNPs from the complete list of SNPs into the working list. Each group of SNPs is anything still in the complete list that has a weight factor greater than or equal to the threshold value (which starts at 1.0, and is reduced by 0.1 after each iterative step). After each step, the SNPs that remain in the working list are locked, 2

so that they will not be removed, even if another more appropriate SNP, location wise, is found the assumption being that a high weight factor is more important than the location of the SNP. The core part of the algorithm involves a sweep through the chromosome, selecting up to four adjacent SNPs in order to decide which to remove. For the purpose of explanation, they are named p2, p1, c1, and c2 ( previous and current ), with the main decision process involving deciding which of p1 or c1 (if any) to remove. After each decision, markers are shifted as necessary based on the decision that has been made. Looking at figure 1, it may help to note that p2, p1, c1 and c2 start out as a, b, c and d respectively. There are a few trivial cases in the process, which may be useful in pointing out before discussing more complex cases. Firstly, if p1 and c1 are further apart than the window size and p1 and p2 are also further apart than the window size, then no SNP removal is carried out. Also, no removal is carried out if the closest markers (within a window) are locked a situation that should only occur if the SNPs are explicitly locked by the user. In cases where p1 is closer than the window size to either c1 or p2 (and at least one is unlocked), then a SNP will be removed. If one SNP is locked, then the other will be removed (with a decision being made on the p1,c1 pair first, if possible). If c1 and p1 are closer than the window size to each other, and neither are locked, then the SNP that is removed will be the one that results in the most even spread between the remaining three (of the four) markers. In the example in figure 1, b will be the first marker removed, because the distance between a and b is less than the distance between c and d. 3 Application The following details an example application of the SNPBlaster algorithm to SNPs recorded in the HapMap database (The International HapMap Consortium, 2003). 3

3.1 Preparation All available non-redundant SNPs for chromosome 4 as at 15 August 2005 were loaded into a database. The difference in genotype frequency for the non-reference ( rare ) alleles between the CEU population (Utah residents with ancestry from northern and western Europe) and the CHB population (Han Chinese in Beijing, China) was used to weight the usefulness of the alleles - a value of 1 indicated 100% difference in allele frequencies, while 0 indicated no difference in allele frequencies (a more rigorous study may take the minimum difference for all population combinations). The rs#, chromosome position and frequency difference for each SNP was exported into a text file, and the SNPBlaster header (with a chromosome length of 200000000 base pairs) was added to prepare the file for the program. 3.2 Running Figure 2: A plot of the SNPs on chromosome 4 chosen by SNPBlaster that appear to be informative for ancestry. The measure (given on the y axis) is the allele frequency difference between two populations, labelled CEU and CHB in the HapMap project. After the input file was prepared, the program was run, loading up the prepared input file, and setting a window size of 1MB. From an input file of approximately 52000 markers, an output file containing 146 markers was generated, giving an average SNP 4

separation distance of about 1.35MB. A graphical representation of the SNPs that were selected is shown in figure 2. 3.3 Speed Loading all the HapMap information for chromosome 4 into the database certainly took the longest time. The process took approximately 17 minutes, but it may be possible to reduce this to around 2-5 minutes depending on the database format, and program used to import the data. In comparison to this, the SNPBlaster algorithm was significantly faster, typically taking 2-5 seconds for the loading process, and a similar time for the iterative algorithm. This suggests that the algorithm is reasonably fast, and is likely to be very useful for carrying out SNP selection tasks in the future. 4 Discussion 4.1 Distance-based information [Something about crossover etc, centimorgans] One paper has decided that a crossover frequency of 1% is sufficient for detecting recent population structure. This relates to a base pair distance of approximately one megabase [is it reasonable to say that we can treat the mutation rate as effectively zero? does the mutation rate matter? does it remove information at a SNP, or add information to a SNP?]. Regardless of the approach, it is reasonable to assume that within a certain distance, two SNP mutations are likely to have a high degree of linkage a specific mutation in one SNP will almost always correspond to a specific mutation in the other SNP. For this reason, it is not useful to record mutations at both of those SNPs in a study. This minimum allowed distance between SNPs will be referred to as the window size. The more SNPs that are typed, the greater the cost per individual, meaning that a smaller number of individuals will be able to be typed for the same amount of money. With a large window size, the cost per individual will be low, but it is also likely that there will be a loss of information because some of the variation will be missed by not typing enough SNPs. 5

4.2 Measure-based information Another approach to determining the information derived from specific SNPs is to carry out some function on each SNP to give an idea of the information content of that SNP. This function, regardless of its method, will typically identify the SNPs that will be the most useful in any investigation. It would be expected that something that did this would also be able to choose which of two SNPs would be more appropriate for an investigation. 4.2.1 Population differences One way to get an estimate of the information content with respect to the ancestry of an individual is to determine differences between populations at specific SNPs. An example of this (the example used in this paper) is one that compares allele frequency differences of a specific mutation of a SNP. The reasoning behind this is as follows: if a certain mutation is always present in one population (or more correctly a small subset of that population), but never present in another population, then that SNP will be informative in determinining the proportion of ancestry [or something similar] that an unknown individual has relative to those population groups. In this sense, a SNP with a high frequency difference between populations will be considered useful, and one that has a low difference between populations will not be useful. Other methods for determining the information content of SNPs are available. A reader who is interested in these may like to read [cite some relevant papers]. 4.3 Binocular vision There are issues involved with choosing only one of these two information procedures in SNP selection. Working purely on distance based information is likely to mean that some of the SNPs that are chosen will not be informative enough for the investigation, even if a more appropriate SNP is available nearby. Working purely on measure-based information may mean that some variation will be missed, and there may be a lot of unnecessary content. To further explain this, if many highly informative SNPs were in a single area of the chromosome (none were further apart from each other than the window size), then all the SNPs would be expected to carry linked mutations any one 6

of them could be used to infer the mutations at the other locations. In addition to this, some parts of the chromosome may be missed out, because those parts only have SNPs with a very low measure-based information content. It is probably worth noting that some SNP information measures will also take into account local distance information. However, the algorithms used to generate these measures may be very processor intensive, because they require recalculation after each SNP removal. The SNPBlaster algorithm gets around the issue of complexity due to recalculation by grouping SNPs of the same measure together, and from then on using distance methods to select SNPs. While the measure is the key factor in determing which SNP is chosen locally, the window size is typically more important on a whole chromosome level. 4.4 Similar programs Another application, CHOISS (Lee and Kang, 2004), is currently available to carry out a similar SNP selection process, choosing SNPs (either by stating an interval, or by stating the required number of SNPs) to minimise variance. After attempting to use the web-based version of CHOISS with a data set derived from chromosome 4 (approximately 50,000 markers), the web interface timed out at fifteen minutes. The algorithmic complexity of CHOISS is reported to be O(n 2 ), while the complexity of SNPBlaster is O(n). For small to medium numbers of SNPs (possibly up to around 5000), this may not make a significant difference, but above that level, the solution time for SNPBlaster is likely to be significantly less than that of CHOISS. However, there may be situations in which CHOISS is more appropriate, most likely those situations where a rough guess at the best SNPs is not appropriate. After downloading CHOISS, it was noticed that the algorithm did in fact run in a reasonably short period of time (2 minutes, compared with around 5 seconds for SNPBlaster). However, the output was not what was expected. When SNPs in the input file were in a random order, the CHOISS algorithm selected approximately 2000 SNPs, with a reported average distance of 1MB for a chromosome of total length around 200MB. When the SNPs were sorted, the algorithm did not select any SNPs. It is likely that this selection process had more to do with range overflows (i.e. numbers being too large) rather than problems with the algorithm itself, but it will be difficult to work out 7

for sure without a more thorough analysis of the program. [I should probably contact the authors then] 4.5 Caveats While the algorithm as described is useful for quick large scale selection of SNPs, there are a number of factors that may make it not as reliable as would be expected. At the moment, the program will only work for SNPs on a single chromosome. If more chromosomes are desired (as would be expected for a full genome selection process), then it is necessary to work on each chromosome individually. The algorithm is iterative, but does not have soft divisions between different weight factors. This means that SNPs with a weighting at the low end of a grouping (e.g. 0.91) are considered to be just as important as SNPs at the high end of that grouping (e.g. 0.99). In some cases, this may be alleviated by increasing the number of iterative divisions in the algorithm, but this would remove the linking between similar weights. An alternative procedure would attempt to determine a relationship between the distance between markers and the weighting factor, although it is likely that such a procedure would be only able to be applied to specific types of studies. Such an approach will certainly increase the amount of processing required (as all SNPs within a window will need to be tested, rather than just the closest), but as long as the window size is small, this increase is unlikely to make the algorithm unmanageably slow. The window size defined in the algorithm refers to the minimum allowed distance between SNPs. If there is enough coverage, then the maximum distance between SNPs will be just under twice the window size. Knowledge of this variation in SNP distance may be important in choosing a window size for the algorithm, because the average SNP distance is likely to be closer to 1.5 the window size. An increase in the number of iterative divisions is likely to increase the average distance between SNPs, as it is less likely that two SNPs within one of the weight ranges will be close to each other. 4.6 Other Applications While the algorithm was designed for the purpose of removing SNPs in order to obtain a panel of useful markers for further studies, attempts have been made to make the algorithm as generic as possible. There is potential for similar processes to be carried 8

out in other areas anywhere in which a well-distributed selection is required along a linear track, and there are a limited number of choices along that track. References Lee, S. and Kang, C. (2004). Choiss for selection of single nucleotide polymorphism markers on interval regularity, Bioinformatics 20(4): 581 582. Liu, Z. and Lin, S. (2005). Multilocus LD measure and tagging SNP selection with generalized mutual information, Genetic Epidemiology?(?): 1 10? Sebastiani, P., Lazarus, R., Weiss, S. T., Kunkel, L. M., Kohane, I. S. and Ramoni, M. F. (2003). Minimal haplotype tagging, PNAS 100(17): 9900 9905. The International HapMap Consortium (2003). The international hapmap project, Nature 426: 789 796. 9