1 Analysis of DNA sequence data p. 1 Analysis of DNA sequence data using MEGA and DNAsp. Analysis of two genes from the X and Y chromosomes of plant species from the genus Silene The first two computer classes will be a set of exercises that you can work through at your own speed, getting as far as seems reasonable. All the software is freely available (we sites are given below) and you will get copies of the sequence alignment so that you can use it as an example file and work on it later, if needed. The questions are intended to show you how to use the software for analysis of diversity within species and divergence between them, and to focus on some of the concepts covered in the lectures. We shall analyse some data on sequences of two genes from some closely related plants, the white campion, Silene latifolia, and two related species, the closely related S. dioica and the more distant relatives S. vulgaris and S. conica. S. latifolia and S. dioica have separate males and females, Y with an X/Y male sex determining system, while S. vulgaris and S. conica are X hermaphroditic species, and this is presumably the ancestral state. Several genes have been found on the Y chromosomes of S. latifolia. In such cases, it is interesting to test whether the Y-linked gene is evolving as expected for a functional gene, or is showing signs of losing function. One kind of test that can be done is based on analyses of the sequences. If the Y-linked gene has remained functional, its sequence should diverge from other sequences of homologous genes more slowly for non-synonymous (amino acid replacement) sites than for synonymous sites. In other words, the KA value should be lower than KS (the KA/KS ratio <1). If the Y-linked copy is losing function (or is nonfunctional), its sequence should accumulate non-synonymous changes more often than the X-linked copy. Detailed instructions and description of the data files You will be given a file with 60 sequences of the gene (the sequences are in FASTA format, which can be read by many software packages, including both MEGA and DNAsp). These two programs are very useful for preliminary analysis of sequence data and both are free (see web sites below). The FASTA files were made by aligning the sequences using SeAl2 (a good program for adjusting alignments by hand, available from this web site: and then exporting in FASTA format, opening them in BioEdit (http://www.mbio.ncsu.edu/bioedit/bioedit.html) and saving
2 Analysis of DNA sequence data p. 2 again as a new FASTA file that other programs are happy with. Because MEGA gives an annoying error when one tries to save the file opened from the FASTA file from either of these programs, I have made a separate file in MEGA format. The file names are: SileneXYgene4BioEditdata.fas SileneXYgene4data.meg You can get the file from this web site (or from a CD which will be provided): and the files are SileneXYgene4BioEditdata.fas SileneXYgene4data.meg Copy the files into the Workspace folder (and give back the CD, if you used that). NOTE that the file will not remain after you log off. The files contains sets of genomic DNA sequences from the three species, two sets (X and Y) from S. latifolia and S. dioica, and one sequence from each of the two hermaphrodite species, which are used purely as outgroups. Analyses using MEGA 3.1 (Molecular Evolutionary Genetics Analysis) You can get the software for yourself (free) from: (1) Find the MEGA software (under the Programs menus) and start it up. In the Start menu, select the following sequence of folders All programs School Applications Science & Engineering Biological Science Summer School MEGA3.1 (2) From MEGA, open the file for gene 4: SileneXYgene4data.meg ( activate a data file ). It will ask if it is nucleotide sequence, so click ok. Also click OK to the question about whether it is coding sequence. The text file editor opens again, showing your sequence data. (3) Look at the sequences, using the Display option. You will see the sequence names listed in the left-hand column. The file contains the following sequences (a total of 144): First are sequences from two dioecious species (X and Y sequences are included from both) Silene latifolia X (40 sequences, from various populations, labeled X4) and Y (45 sequences, from the same populations, labeled Y4)
3 Analysis of DNA sequence data p. 3 Silene dioica X and Y (30 and 27 sequences, respectively, labeled X4D and Y4D) Then sequences from the two hermaphrodite species (only one sequence, non-sex-linked, per species) Silene conica Silene vulgaris The sequences are partial. They do not include the complete coding sequence, so the sequence does not start with the start codon, but the coding sequence does end with the stop codon (the last part is 3' non-coding sequence). The total length is 1543 nucleotides. (4) At this point, the program does not know where the coding sequence regions are, since the reading frame has not been specified. Gene 4 has few introns, so it is simple to analyse. In this case, the coding sequence starts with the first nucleotide of the first codon, but the first and last parts of the sequence are non-coding. To give the program this information, you need to use the first and last nucleotide positions of each non-coding region from the following tables, which also give, for each exon, the position in its codon of the first nucleotide. MEGA has a menu to assign domains as coding or non-coding, and also to select the correct number for the position in its codon of the first nucleotide in the sequence (from the table below; note that the stop codon starts after position 1353, and will therefore be treated by the program as non-coding if you include it, some analysis will give an error message saying that a stop codon is found in the coding sequence). This is "select & edit genes/domains" (3rd tab from the left). Continue to enter the relevant data for each coding and non-coding sequence. Then close the window. Gene 4 information Exon Exon positions Intron or noncoding positions First nucleotide position in codon It is often a good idea to draw a rough picture of the gene, showing the region sequenced, the length of the region, and the introns and exon positions. Now notice the labelling along the top of the window. The codons are shown. Select the menu to translate the sequences, and check that the amino acids look OK and that there are no stop codons (indicated by *; the actual stop codon is at positions in the alignment. If you made a mistake, it is easy to correct just open the window again and change the positions, but NOTE that a position cannot be named as part of 2 different domains, so you have to work around this limitation. Now look at the nucleotide sequence again. Note the numerous indels (gaps)? What regions of the sequence are they in? (5) To identify sets of sequences, you can make groups using the Edit/Select taxa and
4 Analysis of DNA sequence data p. 4 groups menu (the 2 nd icon from the left in the row of icons at the top of the screen). You will see the S. latifolia X-linked sequences first, then the Y-linked set, then the X- and Y- linked sets for S. dioica. You can make named sets, which appear in the left-hand window, and transfer sequences into these from the right-hand window, using the small arrow. Make the 4 sets for these sequences. There is no need to make sets for the single S. vulgaris and S. conica sequences. Again, if you made a mistake, it is easy to correct it. When you return to the sequence viewer, you will see that the names of the groups of sequences are displayed in the left-hand column. You can save the file by selecting Write data to file and giving it a new name (e.g. adding your initials or something to indicate that the new version contains the intron-exon and species data). (6) You can select options to have the program mark with a colour all sites of some particular type, e.g. variable sites, and to output a table showing them, which can be imported into Excel to make a figure. You must of course select the sequences to include, otherwise it will include all of them, and then the variable sites will include both (i) polymorphic sites within either of the species, including X-Y fixed differences, as well as (ii) differences between the species. You can select the sequences to include with the Edit/Select taxa and groups menu. Click on the box by any set you want to omit, and the tick mark should go away. This sequence (or set of sequences) will no longer be included in analyses. To reverse this decision, click to get the tick back again. To make a list of all polymorphic sites within S. latifolia (including X-Y differences), remove all other sequences from the analysis, using the function just described. Then select the function to mark variable sites. NOTE the count of the number of such sites in the bar at the bottom of the data viewer screen. If it looks ok, you can ask it to Write data to file, choosing the option marked sites only. (7) Use the Construct Phylogeny function in the main MEGA window to make a tree using the sequences. You are given various options for the type of tree (we can use Neighbour- Joining, or NJ), the site type you want to analyse, the statistics you are interested in, and the region of the sequence you want to consider. It can use all sites, just synonymous sites, etc. and there are several options for evolutionary models, depending on whether the sites are coding or not, and also whether to use Jukes-Cantor correction, a correction for saturation of the sites that occurs when the sequence are highly divergent. There is an option for whether to display the bootstrap values on the tree figure (choose 1000 bootstraps). If you see only S. latifolia in your tree, you probably failed to reverse the decision to restrict analyses to just this species (see 6 above); you can do this and re-do the tree. If you did it correctly, it labels the sequences with their names and also shows the group names. If you want to save the tree, select Copy to clipboard under the Image menu. You can then paste it into PowerPoint, and then can add text to record the details of what analysis you actually did. You might try a different analysis to see what difference it makes.
5 Analysis of DNA sequence data p. 5 What do the results tell us? Here are some things to look for. (i) Is either gene a pseudogene? (ii) Are Y sequences less variable than X sequences? What do the trees suggest? To find numbers of polymorphisms in the two data sets, go back to (6) above and compare the numbers of X and Y polymorphisms in S. latifolia, and the number of fixed X-Y differences. Table 1. Numbers of differences between X and Y sequences of S. latifolia, and umbers of variants in each of them. Chromosome X polymorphisms Y polymorphisms Number of sequences X-Y fixed Number of variants These results and conclusions are helpful, but are only a preliminary analysis, and ideally we want to quantify variability and test the significance of any possible difference. We will see how to do this with DNAsp. (iii) Do the gene trees of these X and Y-linked genes agree with the species tree of the species that have sex chromosomes? How do you interpret what you see?
6 Analysis of DNA sequence data p. 6 Analyses using DnaSP You can get the software for yourself (free) from: (1) Find the DnaSP 4 software and start it up. (2) From DNAsp, open the file for gene 4: SileneXYgene4BioEditdata.fas (it allows only one file to be open at a time). (3) Look at the sequences, using the Display option. You will see the same sequences as with MEGA (if you did not do the MEGA exercise, look at part (3) above for an explanation). (4) To tell the program where the coding sequence regions are, choose Assign coding regions from the Data menu and assign the regions (see table above) folloing the instructions in the dialog box. Now notice the labelling along the top of the window: the codons and their amino acids should be shown (when introns are present, these sites are labelled N for non-coding). Also look at the alignment To save the file after adding this information, use the File menu to save the file under a new name in Nexus format; it will appear as FileName.nex. Choosing this format ensures that these details remain available next time DNAsp opens the file, so you don t have to do all this work over again. (5) We will first estimate diversity, to test whether the impression of different values for the X and Y genes is correct. As with MEGA, you must first name the sequence sets, to identify them for analyses. In the Data menu, select the Define sequence sets option. You will see the 144 sequences listed in the left-hand window. First select the S. latifolia Y-linked sequences and define a set for your analyses, and name it. Then define a set of the X-linked sequences from the same species. Then make a third set with just the S. vulgaris sequence (which will be needed later, and also two sets, one with a single S. latifolia Y sequence, and one X). Use the Polymorphism and divergence function to analyse the diversity. You are given various options for the site type to analyse, the statistics you are interested in, and the region of the sequence you want to consider. You must of course select the sequences to include. Click on the Data set option. You will see the sets of sequences with the names you gave them. First choose to estimate diversity for S. latifolia X-linked alleles, then Y-linked. If you use the default option to include divergence, the analysis will use only the regions of sequence that are present in the sequence chosen for divergence; thus, if you include divergence between the S. latifolia Y- and X- linked pair, you will have a fair comparison of diversity of each of the two sets. Now click OK. The program will calculate divergence values for different types of sites,
7 Analysis of DNA sequence data p. 7 such as synonymous and non-synonymous sites, or intron sites. As you get the results, enter them in the table below. You will find it helpful to use the Pi(a)/Pi(s) ratio option. This gives you all the different types of sites in one output screen. Enter the results in Table 2 (if time is running short, there is no need to complete everything the point is to understand what the items are, and you can return to the exercise later if you want). The numbers of polymorphisms at synonymous and non-synonymous sites are also given. Non-coding sites include intron and other non-coding sites. Table 2. Variability within S. latifolia Type of site X-linked Number of sites Numbers of variable sites Within-species diversity (π values) π = All site types Synonymous π S = Non-synonymous π A = Non-coding π Non-coding = Silent π Silent = Y-linked All site types Synonymous Non-synonymous Non-coding Silent (i) Does diversity appear to differ between X- and Y-linked samples? (6) Next, analyse divergence between the sets of sequences: S. latifolia X-and Y-linked versus the hermaphrodite species S. vulgaris, using the groups you defined earlier. Use the analysis you already used, plus DNA divergence between populations. Enter the results in Table 3. Be sure that you understand how the numbers of synonymous and non-synonymous sites are calculated, and why these are not whole numbers (a brief outline of one method is in the Selection Basics lecture, and Graur and Li's book has an outline on page 81 onwards). Briefly, the reason is that only fourfold degenerate sites have only synonymous changes and only non-degenerate sites have only non-synonymous changes some changes at twofold degenerate sites are synonymous and some are non-synonymous, and this has to be taken into account. Thus it is not a simple matter of counting sites, but the numbers are estimated and a model is involved, e.g. some methods assume that any change from one nucleotide to another is equally likely (which is untrue transition rates in sequences are generally > transversion rates).
8 Analysis of DNA sequence data p. 8
9 Analysis of DNA sequence data p. 9 Table 3. Divergence between S. latifolia and S. vulgaris, and between S. latifolia Y and autosomal sequences. Type of site and comparison Divergence (K A K S, etc.) Numbers of sites analysed Ka/Ks S. latifolia X-linked versus Y- linked All site types Synonymous. K S Non-synonymous, K A Non-coding, K Non-coding Silent, K Silent X-linked versus S. vulgaris All site types Synonymous, K S Non-synonymous. K A Non-coding, K Non-coding Silent, K Silent Y-linked versus S. vulgaris All site types Synonymous, K S Non-synonymous. K A Non-coding, K Non-coding Silent, K Silent Some further questions. (ii) Is the Y-linked gene degenerated? (iii) Has the Y-linked gene undergone more non-synonymous substitutions than the X? To test this, use estimated numbers of differences at synonymous and nonsynonymous sites, you can use the 'Preferred and Unpreferred Synonymous Substitutions' function (also in the Analysis menu) to determine the numbers of substitutions in the S. latifolia X and Y sequences, since they started to diverge from the outgroup (S. vulgaris). The analysis window has a diagram that will help you to understand the idea (use the analysis with a near and far outgroup). Enter the results in Table 4. Table 4. Non-synonymous and synonymous substitutions in the S. latifolia X and Y sequences, using an outgroup (S. vulgaris). Y X Non-synonymous Synonymous
10 Analysis of DNA sequence data p. 10 McDonald-Kreitman tests This is a very simple, but important, test that has a good chance of correctly detecting selection even when the population is subdivided (see Graur and Li pages 63-64). There is an item under the Analysis windows. This is not the most appropriate data set for applying this test, but I added it because of its importance. There are 2 interesting questions we could use it for. (a) We might wonder whether the higher diversity of the X-linked locus (see above) could be due to balancing selection at this gene, and this test is one way to examine this. If so, we expect an excess of non-synonymous polymorphisms within S. latifolia against the number expected based on divergence from an outgroup (we can use S. vulgaris). The test uses synonymous or silent sites to take account of possible mutation rate differences (see Answers section). (b) We could also use this test to see whether it is likely that the Y has undergone an excess of non-synonymous substitutions since the split S. vulgaris, which might suggest that slightly deleterious mutations are accumulating, due to the low effective size of the Y-linked gene (see Brian Charlesworth s lecture). HKA tests (iv) Is the X-Y diversity difference statistically significant? We can do an HKA test in DNAsp, using the results in Tables 2 and 3. The analysis is in the Tools menu (don t use the one in the Analysis menu it is for comparing 2 parts of a single gene). How about silent sites? Here is a table to enter the results needed to do the HKA test: Intra-specific variability Number of segregating sites Total number of sites Sample size Inter-species divergence from S. vulgaris Average number of differences (D XY ) Total number of sites All sites Silent sites X Y X Y (v) What might explain the low diversity of the Y-linked gene?
11 Analysis of DNA sequence data p. 11 (7) Raw versus net divergence. Because there is variation among the sequences in a group, the mean divergence includes two components: I: the differences between the sequences of the two species, and also II: differences between the S. latifolia sequences. If these latter differences were very high, we would not want to include them in estimates of divergence (since we are interested in the extent of substitutions between the species, or fixed differences). It is therefore reasonable to subtract the mean within-species diversity from the raw divergence, D XY, to get the net divergence D a = K - (k species1 + k species2 )/2 To illustrate the difference between the two measures, do a divergence between S. latifolia and S. dioica X- and Y-linked sequences, using the analysis DNA divergence between populations. Enter the results in Table 5. Table 5. Divergence and net divergence from S. latifolia for all site types. X-linked versus S. dioica versus S. vulgaris D XY (JC= ) (JC= ) D a (JC= ) (JC= ) Y-linked versus S. dioica versus S. vulgaris D XY (JC= ) (JC= ) D a (JC= ) (JC= ) The output shows the divergence values and also values with the Jukes-Cantor correction (labelled JC ). You might want to write down both versions and consider whether this correction is required for these species. NOTE that this analysis does not calculate separate divergence values for synonymous and non-synonymous sites, but just deals with all sites. To calculate synonymous and non-synonymous divergence, you have to use the Polymorphism and divergence analysis, which does give you that option (but that analysis doesn t give net divergence how could you estimate those values?). A further analysis would be to compare Fst values between the two closely related dioecious species with values between populations within either species. To do this, you would need to define more sequence sets, using the same menu as before (the sequence names indicate the populations). The analysis is Gene flow and genetic differentiation. To get statistical tests of subdivision, choose the option in the dialog box Perform the "Permutation test.
12 Analysis of DNA sequence data p. 12 (vi) Test for recombination in the X-linked genes is there evidence for recombination? (vii) Another X-Y gene pair was studied, and the X-Y divergence value was 2%. What could account for the different values for the two genes? If you have time, you can try other analyses. For instance, you can compare F st values between the two closely related dioecious species with values between populations within either species.
13 APPENDIX: Some tips for using MEGA and DNAsp Analysis of DNA sequence data p. 13 MEGA should open FASTA files. This is how to do it. Before this, check that each sequence has a different name, and that names are not too long. Another tip is that DNAsp doesn't accept alignments where the first sequence starts with gaps. (1) Open MEGA (2) From the file menu, choose open the file you need. It generally says error # 5520 line too long on line 1. Ignore this (click OK). The select the funny little icon (the one to the right of the print icon) with the arrow pointing downwards; this converts to MEGA format (to see text indicating the icons meanings, move the cursor over the icon and at the right position, text will appear). A dialog box appears. Select ok (don t try to select data format). The text file then appears in the window in MEGA format. Save it as a new name, so you can tell which version is in MEGA format (ignore the save as type selecting window). Close all windows in the text editor. It often crashed MEGA. Press Control/Alt/Delete keys simulataneously to get the Task Manager window, which allows you to quit from MEGA. It will have saved the file ok and you can then re-start the program and open the file to carry on with analyses as below. (3) Click the text to activate a data file, and select the file you just made. It will ask if it is nucleotide sequence, so click ok, as it normally will be that kind of sequence. Also click OK to the question about whether it is coding sequence. The text file editor opens again, showing your sequence data. Again, it often says error # line XX and the cursor moves to the offending line. a. Sometimes this means that it thinks 2 different sequences have the same name (because the software doesn t seem to take in the complete names you gave your sequences). You can edit the name where it stopped, or the one with the duplicate name (often the sequence before the one where it stopped) to change it slightly, or whatever is needed to make it acceptable. Then save and close the file. Repeat the process of activating the data file. b. Generally, there is nothing wrong with the sequence names, but if you change them slightly (e.g. add x at some position) the software is happy with this sequence next time you activate the file (but it may stop at a different one, and so one sometimes has to change lots of names, each time saving and closing the file, then re-activating it). When it is all satisfactory to MEGA, it will ask if it is coding or not. Click Yes, unless the entire sequence is non-coding. (4) From the Data menu, choose Data explorer. This displays the entire set of sequences, and (if cdna sequences are included) it indicates which parts are coding and which non-coding; by showing codons at the top of the window (indicated by a faint box around each 3 nucleotides), so you can check that all is as you expected, and you can note the position in the codon of the first base of each exon (needed in the next step). (5) Enter the exon and intron positions a. This property is very helpful, so it is often best, after aligning and importing sequences, to work first in MEGA, to determine exon and intron positions, and note them down before opening data in DNAsp. Clicking on
14 Analysis of DNA sequence data p. 14 the base in a column (position) gives its number in small characters at the bottom left of the data explorer window. b. You can check the translation with the right hand icon. If you need to change the reading frame, use the Data menu: Select genes and domains menus to tell it the first site s codon position. (6) Set up the groups of sequences, using the Data menu. NOTE: take care that a group name is selected before trying to add sequences to a group, otherwise it crashes in a bad way. (7) Save data to file. Use MEGA format, and give it a different name. Then, after closing the data file, the information will still be there next time you use MEGA to open it. Now you are ready to do analyses. Menus on the small MEGA window are simple to use, and pretty obvious what they do. You can select which sequences to include, and you can un-select some sequences in the Data explorer window, and use the remaining ones to see variable sites, and some other things that can be useful.
15 Analysis of DNA sequence data p. 15 ANSWERS MEGA Table 1. Numbers of differences between X and Y sequences of S. latifolia, and umbers of variants in each of them. Chromosome Number of sequences Number of variants X polymorphisms Y polymorphisms 45 5 X-Y fixed 101 The number 101 is calculated assuming that there are no shared variants, so that the total number of variants when both X and Y sequences are included (216) is the sum of the numbers in the first 2 rows + the number of X-Y fixed differences. Thus I subtracted from this, to get the 101. It is quite simple to check that the X and Y share no polymorphic sites (if they did, this calculation would give the wrong numbers of fixed differences); I just listed the 5 Y polymorphic sites and had MEGA display the X polymorphisms and looked to see if any of those sites is among the 5. What do the results tell us? Here are some things to look for. (i) Is either gene a pseudogene? Probably not no stop codons in coding sequence, and no frame-shifts (ii) Are Y sequences less variable than X sequences? Yes, but these results are only a preliminary analysis, and ideally we want to quantify variability and test the significance of any possible difference. We will see how to do this with DNAsp. (iii)do the gene trees of these X and Y-linked genes agree with the species tree of the species that have sex chromosomes? No, the Y-linked sequences do, but not the X. The higher X variability suggests that diversity was high in the ancestor, and thus lineage-sorting occurred, so that different sequences are present in the X sets of both species. An alternative is that introgression has occurred, but that, for some reason, the Y does not introgress.
16 Analysis of DNA sequence data p. 16 DNAsp Table 2. Variability within S. latifolia NOTE 1: I used the analysis including divergence between S. latifolia X and Y, and entered those results in Table 3; other choices will give different numbers in these tables). NOTE 2: It is not simple to get numbers of variable sites. The numbers given in the output of this diversity analysis are numbers of polymorphisms, so a site that has 3 different nucleotides may be counted as 2 different polymorphisms (as at least 2 mutations must have occurred). This explains why numbers of variable sites differ when you extract them from other analyses. For example, the number of "substitutions" is 111 in the diversity analysis of X alleles, using all sites, but the analysis says that there are 103 polymorphic sites (so that is the number I put in this table); presumably 8 sites have > 2 different nucleotides. For the coding region sites, the two counts are the same. For non-coding regions, we should thus have =53 polymorphic sites, again suggesting 61-53=8 sites with > 2 different nucleotides (and the McDonald-Kreitman table using all silent sites, not just synonymous ones, says 55). Then for silent sites we should have 35+53=89 (against 96"substitutions" as counted by the program). Table 2. Variability within S. latifolia Type of site Number of sites Number of variable sites Within-species diversity (π values) X-linked All site types π = Synonymous π A = Non-synonymous π S = Non-coding π Non-coding = Silent π Silent = Y-linked All site types As above Synonymous (see note 1) 0 0 Non-synonymous 0 0 Non-coding Silent NOTE 3: π A /π S = for X, but cannot be estimated for Y (no variants).
17 Analysis of DNA sequence data p. 17 Table 3. Divergence between S. latifolia and S. vulgaris, and between S. latifolia Y and autosomal sequences. See NOTE 1 above. Type of site and comparison Divergence (K A or K S with JC correction) Number of sites analysed Ka/Ks S. latifolia X-linked versus Y- linked All site types Synonymous Non-synonymous Non-coding Silent X-linked versus S. vulgaris All site types Synonymous Non-synonymous Non-coding Silent Y-linked versus S. vulgaris All site types Synonymous Non-synonymous Non-coding Silent Questions. (i) Does diversity appear to differ between X- and Y-linked samples? Yes X >> Y. (ii) Is the Y-linked gene degenerated? Not evidently Ka << Ks in all comparisons in Table 3 above. (iii)has the Y-linked gene undergone more non-synonymous substitutions than the X? No. According to the analysis using the outgroup S. vulgaris, the numbers of changes are as follows, and the difference is not significant by a 2 x 2 contingency test (see DNAsp Tools menu).
18 Analysis of DNA sequence data p. 18 Table 4. Non-synonymous and synonymous substitutions in the S. latifolia X and Y sequences, using an outgroup (S. vulgaris). Non-synonymous Synonymous Y 6 20 X 0 5 McDonald-Kreitman tests (a) Is the higher diversity of the X-linked locus (see above) could be due to balancing selection at this gene? If so, we expect an excess of non-synonymous polymorphisms within S. latifolia against the number expected based on divergence from an outgroup (we can use S. vulgaris). The test uses synonymous or silent sites to take account of possible mutation rate differences. The results are as follows and the test statistic is non-significant. X vs. S. vulgaris Fixed Polymorphic Synonymous Non-synonymous 2 15 (b) Is it likely that the Y has undergone an excess of non-synonymous substitutions since the split with S. vulgaris, which might suggest that slightly deleterious mutations are accumulating, due to the low effective size of the Y-linked. The results are as follows: Y vs. S. vulgaris Fixed Polymorphic Synonymous Non-synonymous 11 0 The test cannot be done, because of the absence of polymorphisms. To try and see if there is any likelihood that the above process might be happening, we could set that value to 1 and use DNAsp s Tools menu to do a 2x2 contingency table test. It is non-significant, i.e. there is no evidence for an undue amount of non-synonymous substitution. NOTE that this test is a less good one than the previous one, because all substitutions are included and we cannot separate them into those that occurred specifically in the Y-linked lineage. (iv) Is the difference between X- and Y-linked samples statistically significant? Yes, by HKA test using divergence from S. vulgaris. According to my results, the numbers are as in the table below, and both sets give significant results after selecting the appropriate type of gene (X and Y linked think about why this is needed).
19 Analysis of DNA sequence data p. 19 All sites Silent sites X Y X Y Intra-specific variability Number of segregating sites (NOTE 4) Total number of sites Sample size Inter-species divergence from S. vulgaris Average number of differences (D XY ) (NOTE 5) (NOTE 5) Total number of sites *328 =51.51 (NOTE 6) *678 = (NOTE 6) NOTE 4: From the DNA diversity and divergence analysis, using S. vulgaris for divergence. NOTE 5: I got these from the DNA Divergence between populations analysis NOTE 6: I multiplied divergence per silent site by the number of silent sites (data from Table 3). (v) What might explain the low diversity of the Y-linked gene? One possibility is that degeneration is occurring, and that hitch-hiking events are reducing its diversity. Another possibility is a much lower effective size for the Y than the X, e.g. due to strong sexual selection such that there is a high variance of male reproductive success. This predicts that autosomal genes diversity should be reduced, relative to that of X-linked genes (because X-linked genes are carried in males 1/3 of the time, versus ½ for autosomal genes). (7) Raw versus net divergence. Table 5. Divergence and net divergence from S. latifolia for all site types. X-linked versus S. dioica versus S. vulgaris D XY (JC= ) (JC= ) D a (JC= ) (JC= ) Y-linked versus S. dioica versus S. vulgaris D XY (JC= ) (JC= ) D a (JC= ) (JC= )
20 Analysis of DNA sequence data p. 20 Jukes-Cantor correction is required for the more distant species for the X data (where diversity is high within S. latifolia), but not for Y, where there are few variants within the species. Compare F st values between the two closely related dioecious species with values between populations within either species. To do this, you would need to define more sequence sets, using the same menu as before (the sequence names indicate the populations). The analysis is Gene Flow & Genetic Differentiation? To get statistical tests of subdivision, choose the option in the dialog box Perform the Permutation Test. (vi) Test for recombination in the X-linked genes is there evidence for recombination? The X sequences yield a minimum number of recombination events, Rm = 6 (and Y gives zero). The X don't fit zero recombination. I did an analysis of diversity within the X sequence set, and then a coalescent simulation for haplotype diversity, given theta. The program takes the results from the diversity analysis and enters them in the relevant boxes, and you can run the simulation and see the 95% confidence intervals, which show that the observed value has a low probability if the simulation is run with zero recombination, but it is not quite significant (other variants of this analysis, e.g. using the number of segregating sites, are highly significant). The Y data set fits better (but also fits recombination > 0, so it's not really conclusive). (vii) Another X-Y gene pair was studied, and the X-Y divergence value was 2%. What could account for the different values for the two genes? One possibility is that degeneration is occurring, and that hitch-hiking events are reducing its diversity. Another possibility is a much lower effective size for the Y than the X, e.g. due to strong sexual selection such that there is a high variance of male reproductive success. This predicts that autosomal genes diversity should be reduced, relative to that of X-linked genes (because X-linked genes are carried in males 1/3 of the time, versus ½ for autosomal genes).
Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences
DnaSP, DNA polymorphism analyses by the coalescent and other methods. Author affiliation: Julio Rozas 1, *, Juan C. Sánchez-DelBarrio 2,3, Xavier Messeguer 2 and Ricardo Rozas 1 1 Departament de Genètica,
RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the
Introduction to Bioinformatics AS 250.265 Laboratory Assignment 6 In the last lab, you learned how to perform basic multiple sequence alignments. While useful in themselves for determining conserved residues
1 Lecture 5 Mutation and Genetic Variation I. Review of DNA structure and function you should already know this. A. The Central Dogma DNA mrna Protein where the mistakes are made. 1. Some definitions based
TEESSIDE UNIVERSITY SCHOOL OF HEALTH & SOCIAL CARE SPSS Workbook 1 Data Entry : Questionnaire Data Prepared by: Sylvia Storey email@example.com SPSS data entry 1 This workbook is designed to introduce
How to Build a Phylogenetic Tree Phylogenetics tree is a structure in which species are arranged on branches that link them according to their relationship and/or evolutionary descent. A typical rooted
Sequence Analysis 15: lecture 5 Substitution matrices Multiple sequence alignment A teacher's dilemma To understand... Multiple sequence alignment Substitution matrices Phylogenetic trees You first need
Phylogenetic Trees Made Easy A How-To Manual Fourth Edition Barry G. Hall University of Rochester, Emeritus and Bellingham Research Institute Sinauer Associates, Inc. Publishers Sunderland, Massachusetts
Biological Sciences Initiative HHMI Human Genome Introduction In 2000, researchers from around the world published a draft sequence of the entire genome. 20 labs from 6 countries worked on the sequence.
9 Calculated Members and Embedded Summaries 9.1 Chapter Outline The crosstab seemed like a pretty useful report object prior to Crystal Reports 2008. Then with the release of Crystal Reports 2008 we saw
PRINCIPLES OF POPULATION GENETICS FOURTH EDITION Daniel L. Hartl Harvard University Andrew G. Clark Cornell University UniversitSts- und Landesbibliothek Darmstadt Bibliothek Biologie Sinauer Associates,
Searching Nucleotide Databases 1 When we search a nucleic acid databases, Mascot always performs a 6 frame translation on the fly. That is, 3 reading frames from the forward strand and 3 reading frames
GenBank, Entrez, & FASTA Nucleotide Sequence Databases First generation GenBank is a representative example started as sort of a museum to preserve knowledge of a sequence from first discovery great repositories,
GeoGebra Statistics and Probability Project Maths Development Team 2013 www.projectmaths.ie Page 1 of 24 Index Activity Topic Page 1 Introduction GeoGebra Statistics 3 2 To calculate the Sum, Mean, Count,
1 of 10 Instructions for applying data validation(s) to data fields in Microsoft Excel According to Microsoft Excel, a data validation is used to control the type of data or the values that users enter
Core Category Nature of Genetic Material Nature of Genetic Material Core Concepts in Genetics (in bold)/example Learning Objectives How is DNA organized? Describe the types of DNA regions that do not encode
Basic Analysis of Microarray Data A User Guide and Tutorial Scott A. Ness, Ph.D. Co-Director, Keck-UNM Genomics Resource and Dept. of Molecular Genetics and Microbiology University of New Mexico HSC Tel.
Document Management Quick Start and Shortcut Guide For the attention of SystmOne users: This document explains the basic Document Management functionality. It is highly advisable that you read the in-detail
1 2 3 4 Database Studio is the new tool to administrate SAP MaxDB database instances as of version 7.5. It replaces the previous tools Database Manager GUI and SQL Studio from SAP MaxDB version 7.7 onwards
DataPA OpenAnalytics End User Training DataPA End User Training Lesson 1 Course Overview DataPA Chapter 1 Course Overview Introduction This course covers the skills required to use DataPA OpenAnalytics
There are several ways to eliminate having too much email on the Exchange mail server. To reduce your mailbox size it is recommended that you practice the following tasks: Delete items from your Mailbox:
Genomes and SNPs in Malaria and Sickle Cell Anemia Introduction to Genome Browsing with Ensembl Ensembl The vast amount of information in biological databases today demands a way of organising and accessing
Product Bulletin Sequencing Software SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications Comprehensive reference sequence handling Helps interpret the role of each
Tutorial 2: Using Excel in Data Analysis This tutorial guide addresses several issues particularly relevant in the context of the level 1 Physics lab sessions at Durham: organising your work sheet neatly,
Paternity Testing Chapter 23 Kinship and Paternity DNA analysis can also be used for: Kinship testing determining whether individuals are related Paternity testing determining the father of a child Missing
Pairwise Sequence Alignment firstname.lastname@example.org SS 2013 Outline Pairwise sequence alignment global - Needleman Wunsch Gotoh algorithm local - Smith Waterman algorithm BLAST - heuristics What
MCB41: Second Midterm Spring 2009 Before you start, print your name and student identification number (S.I.D) at the top of each page. There are 7 pages including this page. You will have 50 minutes for
Amino Acids and Their Properties Recap: ss-rrna and mutations Ribosomal RNA (rrna) evolves very slowly Much slower than proteins ss-rrna is typically used So by aligning ss-rrna of one organism with that
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
Sales Person Commission Table of Contents INTRODUCTION...1 Technical Support...1 Overview...2 GETTING STARTED...3 Adding New Salespersons...3 Commission Rates...7 Viewing a Salesperson's Invoices or Proposals...11
Tour Guide for Windows and Macintosh 2007 Gene Codes Corporation Gene Codes Corporation 775 Technology Drive, Suite 100A, Ann Arbor, MI 48108 USA phone 1.800.497.4939 or 1.734.769.7249 (fax) 1.734.769.7074
Intro to Excel spreadsheets What are the objectives of this document? The objectives of document are: 1. Familiarize you with what a spreadsheet is, how it works, and what its capabilities are; 2. Using
Real Estate Reports Overview Quick Reference Guide Overview This guide shows you the options available for customising the standard RE reports available in SAP. It covers the following: Using individual
Microsoft Access 2010: Basics & Database Fundamentals This workshop assumes you are comfortable with a computer and have some knowledge of other Microsoft Office programs. Topics include database concepts,
ISS, NEWCASTLE UNIVERSITY IBM SPSS Statistics for Beginners for Windows A Training Manual for Beginners Dr. S. T. Kometa A Training Manual for Beginners Contents 1 Aims and Objectives... 3 1.1 Learning
MICROSOFT ACCESS 2007 BOOK 2 4.1 INTRODUCTION TO ACCESS FIRST ENCOUNTER WITH ACCESS 2007 P 205 Access is activated by means of Start, Programs, Microsoft Access or clicking on the icon. The window opened
TimeValue Software Amortization Software Version 5 User s Guide s o f t w a r e User's Guide TimeValue Software Amortization Software Version 5 ii s o f t w a r e ii TValue Amortization Software, Version
Databases in Microsoft Access David M. Marcovitz, Ph.D. Introduction Schools have been using integrated programs, such as Microsoft Works and Claris/AppleWorks, for many years to fulfill word processing,
Muscular Dystrophy A Case Study of Positional Cloning Described by Benjamin Duchenne (1868) X-linked recessive disease causing severe muscular degeneration. 100 % penetrance X d Y affected male Frequency
STEPS Epi Info Training Guide Department of Chronic Diseases and Health Promotion World Health Organization 20 Avenue Appia, 1211 Geneva 27, Switzerland For further information: www.who.int/chp/steps WHO
Microsoft Access Basics 2006 ipic Development Group, LLC Authored by James D Ballotti Microsoft, Access, Excel, Word, and Office are registered trademarks of the Microsoft Corporation Version 1 - Revision
Software Application Tutorial Copyright 2005, Software Application Training Unit, West Chester University. No Portion of this document may be reproduced without the written permission of the authors. For
Contents Contents... 1 How to set up a database in Microsoft Access... 1 Creating a new database... 3 Enter field names and select data types... 4 Format date fields: how do you want fields with date data
Some Tips for Using WebAssign in Calculus The problems you see on your WebAssign homework are generally questions taken from your textbook but sometimes randomized so that the numbers and functions may
An Introduction to SPSS Workshop Session conducted by: Dr. Cyndi Garvan Grace-Anne Jackman Topics to be Covered Starting and Entering SPSS Main Features of SPSS Entering and Saving Data in SPSS Importing
Paper RF-05-2014 File Management and Backup Considerations When Using SAS Enterprise Guide (EG) Software Roger Muller, Data To Events, Inc., Carmel, IN ABSTRACT SAS Enterprise Guide provides a state-of-the-art
Y Chromosome Markers Lineage Markers Autosomal chromosomes recombine with each meiosis Y and Mitochondrial DNA does not This means that the Y and mtdna remains constant from generation to generation Except
KSTAT MINI-MANUAL Decision Sciences 434 Kellogg Graduate School of Management Kstat is a set of macros added to Excel and it will enable you to do the statistics required for this course very easily. To
Clone Manager for Windows Professional Edition Volume 2 Alignment, Primer Operations Version 9.5 Getting Started Copyright 1994-2015 Scientific & Educational Software. All rights reserved. The software
Subjects of this lecture Introduction to Phylogenetic nalysis Irit Orr 1 Introducing some of the terminology of phylogenetics. 2 Introducing some of the most commonly used methods for phylogenetic analysis.
MultiExperiment Viewer Quickstart Guide Table of Contents: I. Preface - 2 II. Installing MeV - 2 III. Opening a Data Set - 2 IV. Filtering - 6 V. Clustering a. HCL - 8 b. K-means - 11 VI. Modules a. T-test
Dreamweaver and Fireworks MX Integration Brian Hogan This tutorial will take you through the necessary steps to create a template-based web site using Macromedia Dreamweaver and Macromedia Fireworks. The
Exercises for the UCSC Genome Browser Introduction 1) Find out if the mouse Brca1 gene has non-synonymous SNPs, color them blue, and get external data about a codon-changing SNP. Skills: basic text search;
Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes
FACT SHEET Using Excel for descriptive statistics Introduction Biologists no longer routinely plot graphs by hand or rely on calculators to carry out difficult and tedious statistical calculations. These
MSP How to guide session 2 (Resources & Cost) 1. Introduction Before considering resourcing the schedule it is important to ask yourself one key question as it will require effort from the scheduler or
1 of 8 11/7/2004 11:00 AM National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Human Genome Resources Model Organisms Guide Outreach and Education Databases and Tools
MAS 500 Intelligence Tips and Tricks Booklet Vol. 1 1 Contents Accessing the Sage MAS Intelligence Reports... 3 Copying, Pasting and Renaming Reports... 4 To create a new report from an existing report...
ISSN 1911-2173 Using Microsoft PowerPoint to creating Interactive Digital Keys for the Canadian Journal of Arthropod Identification: An illustrated guide Interactive keys can be easily made using Microsoft
Chironomid DNA Barcode Database Search System User Manual National Institute for Environmental Studies Center for Environmental Biology and Ecosystem Studies December 2015 Contents 1. Overview 1 2. Search
DnaSP Version 5 Help Contents Running DnaSP, press F1 to view the context-sensitive help. What DnaSP can do Introduction System requirements Input and Output Input Data Files (FASTA format; MEGA format;
Tutorial #7A: LC Segmentation with Ratings-based Conjoint Data This tutorial shows how to use the Latent GOLD Choice program when the scale type of the dependent variable corresponds to a Rating as opposed
Integrative Biology 200B University of California, Berkeley Principals of Phylogenetics: Ecology and Evolution Spring 2011 Updated by Nick Matzke Molecular Clocks and Tree Dating with r8s and BEAST Today
Blackbaud FundWare Accounts Receivable Guide VOLUME 1 SETTING UP ACCOUNTS RECEIVABLE VERSION 7.50, JULY 2008 Blackbaud FundWare Accounts Receivable Guide Volume 1 USER GUIDE HISTORY Date Changes June 2000
SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis Goal: This tutorial introduces several websites and tools useful for determining linkage disequilibrium
Appendix A How to create a data-sharing lab Creating a lab involves completing five major steps: creating lists, then graphs, then the page for lab instructions, then adding forms to the lab instructions,
Tools for Excel Modeling Introduction to Data Tables and Data Table Exercises EXCEL REVIEW 2000-2001 Data Tables are among the most useful of Excel s tools for analyzing data in spreadsheet models. Some
For Windows Updated: August 2012 Table of Contents Section 1: Overview... 3 1.1 Introduction to SPSS Tutorials... 3 1.2 Introduction to SPSS... 3 1.3 Overview of SPSS for Windows... 3 Section 2: Entering
Lab 10 Mitosis Background Reproduction means producing a new organism from an existing organism. The new offspring must receive hereditary information and enough cytoplasmic material to maintain its own
Novell ZENworks Asset Management 7.5 w w w. n o v e l l. c o m October 2006 USING THE WEB CONSOLE Table Of Contents Getting Started with ZENworks Asset Management Web Console... 1 How to Get Started...
REUTERS/TIM WIMBORNE SCHOLARONE MANUSCRIPTS COGNOS REPORTS 28-APRIL-2015 TABLE OF CONTENTS Select an item in the table of contents to go to that topic in the document. USE GET HELP NOW & FAQS... 1 SYSTEM
Meiosis and Sexual Life Cycles Chapter 13 1 Ojectives Distinguish between the following terms: somatic cell and gamete; autosome and sex chromosomes; haploid and diploid. List the phases of meiosis I and
Mitochondrial DNA Analysis Lineage Markers Lineage markers are passed down from generation to generation without changing Except for rare mutation events They can help determine the lineage (family tree)
ProSightPC 3.0 Quick Start Guide The Thermo ProSightPC 3.0 application is the only proteomics software suite that effectively supports high-mass-accuracy MS/MS experiments performed on LTQ FT and LTQ Orbitrap
- 1 - PloneSurvey User Guide (draft 3) This short document will hopefully contain enough information to allow people to begin creating simple surveys using the new Plone online survey tool. Caveat PloneSurvey
SonicWALL GMS Custom Reports Document Scope This document describes how to configure and use the SonicWALL GMS 6.0 Custom Reports feature. This document contains the following sections: Feature Overview