DNA Sequence Alignment Analysis

Transcription

1 Analysis of DNA sequence data p. 1 Analysis of DNA sequence data using MEGA and DNAsp. Analysis of two genes from the X and Y chromosomes of plant species from the genus Silene The first two computer classes will be a set of exercises that you can work through at your own speed, getting as far as seems reasonable. All the software is freely available (we sites are given below) and you will get copies of the sequence alignment so that you can use it as an example file and work on it later, if needed. The questions are intended to show you how to use the software for analysis of diversity within species and divergence between them, and to focus on some of the concepts covered in the lectures. We shall analyse some data on sequences of two genes from some closely related plants, the white campion, Silene latifolia, and two related species, the closely related S. dioica and the more distant relatives S. vulgaris and S. conica. S. latifolia and S. dioica have separate males and females, Y with an X/Y male sex determining system, while S. vulgaris and S. conica are X hermaphroditic species, and this is presumably the ancestral state. Several genes have been found on the Y chromosomes of S. latifolia. In such cases, it is interesting to test whether the Y-linked gene is evolving as expected for a functional gene, or is showing signs of losing function. One kind of test that can be done is based on analyses of the sequences. If the Y-linked gene has remained functional, its sequence should diverge from other sequences of homologous genes more slowly for non-synonymous (amino acid replacement) sites than for synonymous sites. In other words, the KA value should be lower than KS (the KA/KS ratio <1). If the Y-linked copy is losing function (or is nonfunctional), its sequence should accumulate non-synonymous changes more often than the X-linked copy. Detailed instructions and description of the data files You will be given a file with 60 sequences of the gene (the sequences are in FASTA format, which can be read by many software packages, including both MEGA and DNAsp). These two programs are very useful for preliminary analysis of sequence data and both are free (see web sites below). The FASTA files were made by aligning the sequences using SeAl2 (a good program for adjusting alignments by hand, available from this web site: and then exporting in FASTA format, opening them in BioEdit ( and saving

2 Analysis of DNA sequence data p. 2 again as a new FASTA file that other programs are happy with. Because MEGA gives an annoying error when one tries to save the file opened from the FASTA file from either of these programs, I have made a separate file in MEGA format. The file names are: SileneXYgene4BioEditdata.fas SileneXYgene4data.meg You can get the file from this web site (or from a CD which will be provided): and the files are SileneXYgene4BioEditdata.fas SileneXYgene4data.meg Copy the files into the Workspace folder (and give back the CD, if you used that). NOTE that the file will not remain after you log off. The files contains sets of genomic DNA sequences from the three species, two sets (X and Y) from S. latifolia and S. dioica, and one sequence from each of the two hermaphrodite species, which are used purely as outgroups. Analyses using MEGA 3.1 (Molecular Evolutionary Genetics Analysis) You can get the software for yourself (free) from: (1) Find the MEGA software (under the Programs menus) and start it up. In the Start menu, select the following sequence of folders All programs School Applications Science & Engineering Biological Science Summer School MEGA3.1 (2) From MEGA, open the file for gene 4: SileneXYgene4data.meg ( activate a data file ). It will ask if it is nucleotide sequence, so click ok. Also click OK to the question about whether it is coding sequence. The text file editor opens again, showing your sequence data. (3) Look at the sequences, using the Display option. You will see the sequence names listed in the left-hand column. The file contains the following sequences (a total of 144): First are sequences from two dioecious species (X and Y sequences are included from both) Silene latifolia X (40 sequences, from various populations, labeled X4) and Y (45 sequences, from the same populations, labeled Y4)

3 Analysis of DNA sequence data p. 3 Silene dioica X and Y (30 and 27 sequences, respectively, labeled X4D and Y4D) Then sequences from the two hermaphrodite species (only one sequence, non-sex-linked, per species) Silene conica Silene vulgaris The sequences are partial. They do not include the complete coding sequence, so the sequence does not start with the start codon, but the coding sequence does end with the stop codon (the last part is 3' non-coding sequence). The total length is 1543 nucleotides. (4) At this point, the program does not know where the coding sequence regions are, since the reading frame has not been specified. Gene 4 has few introns, so it is simple to analyse. In this case, the coding sequence starts with the first nucleotide of the first codon, but the first and last parts of the sequence are non-coding. To give the program this information, you need to use the first and last nucleotide positions of each non-coding region from the following tables, which also give, for each exon, the position in its codon of the first nucleotide. MEGA has a menu to assign domains as coding or non-coding, and also to select the correct number for the position in its codon of the first nucleotide in the sequence (from the table below; note that the stop codon starts after position 1353, and will therefore be treated by the program as non-coding if you include it, some analysis will give an error message saying that a stop codon is found in the coding sequence). This is "select & edit genes/domains" (3rd tab from the left). Continue to enter the relevant data for each coding and non-coding sequence. Then close the window. Gene 4 information Exon Exon positions Intron or noncoding positions First nucleotide position in codon It is often a good idea to draw a rough picture of the gene, showing the region sequenced, the length of the region, and the introns and exon positions. Now notice the labelling along the top of the window. The codons are shown. Select the menu to translate the sequences, and check that the amino acids look OK and that there are no stop codons (indicated by *; the actual stop codon is at positions in the alignment. If you made a mistake, it is easy to correct just open the window again and change the positions, but NOTE that a position cannot be named as part of 2 different domains, so you have to work around this limitation. Now look at the nucleotide sequence again. Note the numerous indels (gaps)? What regions of the sequence are they in? (5) To identify sets of sequences, you can make groups using the Edit/Select taxa and

4 Analysis of DNA sequence data p. 4 groups menu (the 2 nd icon from the left in the row of icons at the top of the screen). You will see the S. latifolia X-linked sequences first, then the Y-linked set, then the X- and Y- linked sets for S. dioica. You can make named sets, which appear in the left-hand window, and transfer sequences into these from the right-hand window, using the small arrow. Make the 4 sets for these sequences. There is no need to make sets for the single S. vulgaris and S. conica sequences. Again, if you made a mistake, it is easy to correct it. When you return to the sequence viewer, you will see that the names of the groups of sequences are displayed in the left-hand column. You can save the file by selecting Write data to file and giving it a new name (e.g. adding your initials or something to indicate that the new version contains the intron-exon and species data). (6) You can select options to have the program mark with a colour all sites of some particular type, e.g. variable sites, and to output a table showing them, which can be imported into Excel to make a figure. You must of course select the sequences to include, otherwise it will include all of them, and then the variable sites will include both (i) polymorphic sites within either of the species, including X-Y fixed differences, as well as (ii) differences between the species. You can select the sequences to include with the Edit/Select taxa and groups menu. Click on the box by any set you want to omit, and the tick mark should go away. This sequence (or set of sequences) will no longer be included in analyses. To reverse this decision, click to get the tick back again. To make a list of all polymorphic sites within S. latifolia (including X-Y differences), remove all other sequences from the analysis, using the function just described. Then select the function to mark variable sites. NOTE the count of the number of such sites in the bar at the bottom of the data viewer screen. If it looks ok, you can ask it to Write data to file, choosing the option marked sites only. (7) Use the Construct Phylogeny function in the main MEGA window to make a tree using the sequences. You are given various options for the type of tree (we can use Neighbour- Joining, or NJ), the site type you want to analyse, the statistics you are interested in, and the region of the sequence you want to consider. It can use all sites, just synonymous sites, etc. and there are several options for evolutionary models, depending on whether the sites are coding or not, and also whether to use Jukes-Cantor correction, a correction for saturation of the sites that occurs when the sequence are highly divergent. There is an option for whether to display the bootstrap values on the tree figure (choose 1000 bootstraps). If you see only S. latifolia in your tree, you probably failed to reverse the decision to restrict analyses to just this species (see 6 above); you can do this and re-do the tree. If you did it correctly, it labels the sequences with their names and also shows the group names. If you want to save the tree, select Copy to clipboard under the Image menu. You can then paste it into PowerPoint, and then can add text to record the details of what analysis you actually did. You might try a different analysis to see what difference it makes.

5 Analysis of DNA sequence data p. 5 What do the results tell us? Here are some things to look for. (i) Is either gene a pseudogene? (ii) Are Y sequences less variable than X sequences? What do the trees suggest? To find numbers of polymorphisms in the two data sets, go back to (6) above and compare the numbers of X and Y polymorphisms in S. latifolia, and the number of fixed X-Y differences. Table 1. Numbers of differences between X and Y sequences of S. latifolia, and umbers of variants in each of them. Chromosome X polymorphisms Y polymorphisms Number of sequences X-Y fixed Number of variants These results and conclusions are helpful, but are only a preliminary analysis, and ideally we want to quantify variability and test the significance of any possible difference. We will see how to do this with DNAsp. (iii) Do the gene trees of these X and Y-linked genes agree with the species tree of the species that have sex chromosomes? How do you interpret what you see?

6 Analysis of DNA sequence data p. 6 Analyses using DnaSP You can get the software for yourself (free) from: (1) Find the DnaSP 4 software and start it up. (2) From DNAsp, open the file for gene 4: SileneXYgene4BioEditdata.fas (it allows only one file to be open at a time). (3) Look at the sequences, using the Display option. You will see the same sequences as with MEGA (if you did not do the MEGA exercise, look at part (3) above for an explanation). (4) To tell the program where the coding sequence regions are, choose Assign coding regions from the Data menu and assign the regions (see table above) folloing the instructions in the dialog box. Now notice the labelling along the top of the window: the codons and their amino acids should be shown (when introns are present, these sites are labelled N for non-coding). Also look at the alignment To save the file after adding this information, use the File menu to save the file under a new name in Nexus format; it will appear as FileName.nex. Choosing this format ensures that these details remain available next time DNAsp opens the file, so you don t have to do all this work over again. (5) We will first estimate diversity, to test whether the impression of different values for the X and Y genes is correct. As with MEGA, you must first name the sequence sets, to identify them for analyses. In the Data menu, select the Define sequence sets option. You will see the 144 sequences listed in the left-hand window. First select the S. latifolia Y-linked sequences and define a set for your analyses, and name it. Then define a set of the X-linked sequences from the same species. Then make a third set with just the S. vulgaris sequence (which will be needed later, and also two sets, one with a single S. latifolia Y sequence, and one X). Use the Polymorphism and divergence function to analyse the diversity. You are given various options for the site type to analyse, the statistics you are interested in, and the region of the sequence you want to consider. You must of course select the sequences to include. Click on the Data set option. You will see the sets of sequences with the names you gave them. First choose to estimate diversity for S. latifolia X-linked alleles, then Y-linked. If you use the default option to include divergence, the analysis will use only the regions of sequence that are present in the sequence chosen for divergence; thus, if you include divergence between the S. latifolia Y- and X- linked pair, you will have a fair comparison of diversity of each of the two sets. Now click OK. The program will calculate divergence values for different types of sites,

7 Analysis of DNA sequence data p. 7 such as synonymous and non-synonymous sites, or intron sites. As you get the results, enter them in the table below. You will find it helpful to use the Pi(a)/Pi(s) ratio option. This gives you all the different types of sites in one output screen. Enter the results in Table 2 (if time is running short, there is no need to complete everything the point is to understand what the items are, and you can return to the exercise later if you want). The numbers of polymorphisms at synonymous and non-synonymous sites are also given. Non-coding sites include intron and other non-coding sites. Table 2. Variability within S. latifolia Type of site X-linked Number of sites Numbers of variable sites Within-species diversity (π values) π = All site types Synonymous π S = Non-synonymous π A = Non-coding π Non-coding = Silent π Silent = Y-linked All site types Synonymous Non-synonymous Non-coding Silent (i) Does diversity appear to differ between X- and Y-linked samples? (6) Next, analyse divergence between the sets of sequences: S. latifolia X-and Y-linked versus the hermaphrodite species S. vulgaris, using the groups you defined earlier. Use the analysis you already used, plus DNA divergence between populations. Enter the results in Table 3. Be sure that you understand how the numbers of synonymous and non-synonymous sites are calculated, and why these are not whole numbers (a brief outline of one method is in the Selection Basics lecture, and Graur and Li's book has an outline on page 81 onwards). Briefly, the reason is that only fourfold degenerate sites have only synonymous changes and only non-degenerate sites have only non-synonymous changes some changes at twofold degenerate sites are synonymous and some are non-synonymous, and this has to be taken into account. Thus it is not a simple matter of counting sites, but the numbers are estimated and a model is involved, e.g. some methods assume that any change from one nucleotide to another is equally likely (which is untrue transition rates in sequences are generally > transversion rates).

8 Analysis of DNA sequence data p. 8

9 Analysis of DNA sequence data p. 9 Table 3. Divergence between S. latifolia and S. vulgaris, and between S. latifolia Y and autosomal sequences. Type of site and comparison Divergence (K A K S, etc.) Numbers of sites analysed Ka/Ks S. latifolia X-linked versus Y- linked All site types Synonymous. K S Non-synonymous, K A Non-coding, K Non-coding Silent, K Silent X-linked versus S. vulgaris All site types Synonymous, K S Non-synonymous. K A Non-coding, K Non-coding Silent, K Silent Y-linked versus S. vulgaris All site types Synonymous, K S Non-synonymous. K A Non-coding, K Non-coding Silent, K Silent Some further questions. (ii) Is the Y-linked gene degenerated? (iii) Has the Y-linked gene undergone more non-synonymous substitutions than the X? To test this, use estimated numbers of differences at synonymous and nonsynonymous sites, you can use the 'Preferred and Unpreferred Synonymous Substitutions' function (also in the Analysis menu) to determine the numbers of substitutions in the S. latifolia X and Y sequences, since they started to diverge from the outgroup (S. vulgaris). The analysis window has a diagram that will help you to understand the idea (use the analysis with a near and far outgroup). Enter the results in Table 4. Table 4. Non-synonymous and synonymous substitutions in the S. latifolia X and Y sequences, using an outgroup (S. vulgaris). Y X Non-synonymous Synonymous

10 Analysis of DNA sequence data p. 10 McDonald-Kreitman tests This is a very simple, but important, test that has a good chance of correctly detecting selection even when the population is subdivided (see Graur and Li pages 63-64). There is an item under the Analysis windows. This is not the most appropriate data set for applying this test, but I added it because of its importance. There are 2 interesting questions we could use it for. (a) We might wonder whether the higher diversity of the X-linked locus (see above) could be due to balancing selection at this gene, and this test is one way to examine this. If so, we expect an excess of non-synonymous polymorphisms within S. latifolia against the number expected based on divergence from an outgroup (we can use S. vulgaris). The test uses synonymous or silent sites to take account of possible mutation rate differences (see Answers section). (b) We could also use this test to see whether it is likely that the Y has undergone an excess of non-synonymous substitutions since the split S. vulgaris, which might suggest that slightly deleterious mutations are accumulating, due to the low effective size of the Y-linked gene (see Brian Charlesworth s lecture). HKA tests (iv) Is the X-Y diversity difference statistically significant? We can do an HKA test in DNAsp, using the results in Tables 2 and 3. The analysis is in the Tools menu (don t use the one in the Analysis menu it is for comparing 2 parts of a single gene). How about silent sites? Here is a table to enter the results needed to do the HKA test: Intra-specific variability Number of segregating sites Total number of sites Sample size Inter-species divergence from S. vulgaris Average number of differences (D XY ) Total number of sites All sites Silent sites X Y X Y (v) What might explain the low diversity of the Y-linked gene?

11 Analysis of DNA sequence data p. 11 (7) Raw versus net divergence. Because there is variation among the sequences in a group, the mean divergence includes two components: I: the differences between the sequences of the two species, and also II: differences between the S. latifolia sequences. If these latter differences were very high, we would not want to include them in estimates of divergence (since we are interested in the extent of substitutions between the species, or fixed differences). It is therefore reasonable to subtract the mean within-species diversity from the raw divergence, D XY, to get the net divergence D a = K - (k species1 + k species2 )/2 To illustrate the difference between the two measures, do a divergence between S. latifolia and S. dioica X- and Y-linked sequences, using the analysis DNA divergence between populations. Enter the results in Table 5. Table 5. Divergence and net divergence from S. latifolia for all site types. X-linked versus S. dioica versus S. vulgaris D XY (JC= ) (JC= ) D a (JC= ) (JC= ) Y-linked versus S. dioica versus S. vulgaris D XY (JC= ) (JC= ) D a (JC= ) (JC= ) The output shows the divergence values and also values with the Jukes-Cantor correction (labelled JC ). You might want to write down both versions and consider whether this correction is required for these species. NOTE that this analysis does not calculate separate divergence values for synonymous and non-synonymous sites, but just deals with all sites. To calculate synonymous and non-synonymous divergence, you have to use the Polymorphism and divergence analysis, which does give you that option (but that analysis doesn t give net divergence how could you estimate those values?). A further analysis would be to compare Fst values between the two closely related dioecious species with values between populations within either species. To do this, you would need to define more sequence sets, using the same menu as before (the sequence names indicate the populations). The analysis is Gene flow and genetic differentiation. To get statistical tests of subdivision, choose the option in the dialog box Perform the "Permutation test.

12 Analysis of DNA sequence data p. 12 (vi) Test for recombination in the X-linked genes is there evidence for recombination? (vii) Another X-Y gene pair was studied, and the X-Y divergence value was 2%. What could account for the different values for the two genes? If you have time, you can try other analyses. For instance, you can compare F st values between the two closely related dioecious species with values between populations within either species.

13 APPENDIX: Some tips for using MEGA and DNAsp Analysis of DNA sequence data p. 13 MEGA should open FASTA files. This is how to do it. Before this, check that each sequence has a different name, and that names are not too long. Another tip is that DNAsp doesn't accept alignments where the first sequence starts with gaps. (1) Open MEGA (2) From the file menu, choose open the file you need. It generally says error # 5520 line too long on line 1. Ignore this (click OK). The select the funny little icon (the one to the right of the print icon) with the arrow pointing downwards; this converts to MEGA format (to see text indicating the icons meanings, move the cursor over the icon and at the right position, text will appear). A dialog box appears. Select ok (don t try to select data format). The text file then appears in the window in MEGA format. Save it as a new name, so you can tell which version is in MEGA format (ignore the save as type selecting window). Close all windows in the text editor. It often crashed MEGA. Press Control/Alt/Delete keys simulataneously to get the Task Manager window, which allows you to quit from MEGA. It will have saved the file ok and you can then re-start the program and open the file to carry on with analyses as below. (3) Click the text to activate a data file, and select the file you just made. It will ask if it is nucleotide sequence, so click ok, as it normally will be that kind of sequence. Also click OK to the question about whether it is coding sequence. The text file editor opens again, showing your sequence data. Again, it often says error # line XX and the cursor moves to the offending line. a. Sometimes this means that it thinks 2 different sequences have the same name (because the software doesn t seem to take in the complete names you gave your sequences). You can edit the name where it stopped, or the one with the duplicate name (often the sequence before the one where it stopped) to change it slightly, or whatever is needed to make it acceptable. Then save and close the file. Repeat the process of activating the data file. b. Generally, there is nothing wrong with the sequence names, but if you change them slightly (e.g. add x at some position) the software is happy with this sequence next time you activate the file (but it may stop at a different one, and so one sometimes has to change lots of names, each time saving and closing the file, then re-activating it). When it is all satisfactory to MEGA, it will ask if it is coding or not. Click Yes, unless the entire sequence is non-coding. (4) From the Data menu, choose Data explorer. This displays the entire set of sequences, and (if cdna sequences are included) it indicates which parts are coding and which non-coding; by showing codons at the top of the window (indicated by a faint box around each 3 nucleotides), so you can check that all is as you expected, and you can note the position in the codon of the first base of each exon (needed in the next step). (5) Enter the exon and intron positions a. This property is very helpful, so it is often best, after aligning and importing sequences, to work first in MEGA, to determine exon and intron positions, and note them down before opening data in DNAsp. Clicking on

14 Analysis of DNA sequence data p. 14 the base in a column (position) gives its number in small characters at the bottom left of the data explorer window. b. You can check the translation with the right hand icon. If you need to change the reading frame, use the Data menu: Select genes and domains menus to tell it the first site s codon position. (6) Set up the groups of sequences, using the Data menu. NOTE: take care that a group name is selected before trying to add sequences to a group, otherwise it crashes in a bad way. (7) Save data to file. Use MEGA format, and give it a different name. Then, after closing the data file, the information will still be there next time you use MEGA to open it. Now you are ready to do analyses. Menus on the small MEGA window are simple to use, and pretty obvious what they do. You can select which sequences to include, and you can un-select some sequences in the Data explorer window, and use the remaining ones to see variable sites, and some other things that can be useful.

15 Analysis of DNA sequence data p. 15 ANSWERS MEGA Table 1. Numbers of differences between X and Y sequences of S. latifolia, and umbers of variants in each of them. Chromosome Number of sequences Number of variants X polymorphisms Y polymorphisms 45 5 X-Y fixed 101 The number 101 is calculated assuming that there are no shared variants, so that the total number of variants when both X and Y sequences are included (216) is the sum of the numbers in the first 2 rows + the number of X-Y fixed differences. Thus I subtracted from this, to get the 101. It is quite simple to check that the X and Y share no polymorphic sites (if they did, this calculation would give the wrong numbers of fixed differences); I just listed the 5 Y polymorphic sites and had MEGA display the X polymorphisms and looked to see if any of those sites is among the 5. What do the results tell us? Here are some things to look for. (i) Is either gene a pseudogene? Probably not no stop codons in coding sequence, and no frame-shifts (ii) Are Y sequences less variable than X sequences? Yes, but these results are only a preliminary analysis, and ideally we want to quantify variability and test the significance of any possible difference. We will see how to do this with DNAsp. (iii)do the gene trees of these X and Y-linked genes agree with the species tree of the species that have sex chromosomes? No, the Y-linked sequences do, but not the X. The higher X variability suggests that diversity was high in the ancestor, and thus lineage-sorting occurred, so that different sequences are present in the X sets of both species. An alternative is that introgression has occurred, but that, for some reason, the Y does not introgress.

16 Analysis of DNA sequence data p. 16 DNAsp Table 2. Variability within S. latifolia NOTE 1: I used the analysis including divergence between S. latifolia X and Y, and entered those results in Table 3; other choices will give different numbers in these tables). NOTE 2: It is not simple to get numbers of variable sites. The numbers given in the output of this diversity analysis are numbers of polymorphisms, so a site that has 3 different nucleotides may be counted as 2 different polymorphisms (as at least 2 mutations must have occurred). This explains why numbers of variable sites differ when you extract them from other analyses. For example, the number of "substitutions" is 111 in the diversity analysis of X alleles, using all sites, but the analysis says that there are 103 polymorphic sites (so that is the number I put in this table); presumably 8 sites have > 2 different nucleotides. For the coding region sites, the two counts are the same. For non-coding regions, we should thus have =53 polymorphic sites, again suggesting 61-53=8 sites with > 2 different nucleotides (and the McDonald-Kreitman table using all silent sites, not just synonymous ones, says 55). Then for silent sites we should have 35+53=89 (against 96"substitutions" as counted by the program). Table 2. Variability within S. latifolia Type of site Number of sites Number of variable sites Within-species diversity (π values) X-linked All site types π = Synonymous π A = Non-synonymous π S = Non-coding π Non-coding = Silent π Silent = Y-linked All site types As above Synonymous (see note 1) 0 0 Non-synonymous 0 0 Non-coding Silent NOTE 3: π A /π S = for X, but cannot be estimated for Y (no variants).

17 Analysis of DNA sequence data p. 17 Table 3. Divergence between S. latifolia and S. vulgaris, and between S. latifolia Y and autosomal sequences. See NOTE 1 above. Type of site and comparison Divergence (K A or K S with JC correction) Number of sites analysed Ka/Ks S. latifolia X-linked versus Y- linked All site types Synonymous Non-synonymous Non-coding Silent X-linked versus S. vulgaris All site types Synonymous Non-synonymous Non-coding Silent Y-linked versus S. vulgaris All site types Synonymous Non-synonymous Non-coding Silent Questions. (i) Does diversity appear to differ between X- and Y-linked samples? Yes X >> Y. (ii) Is the Y-linked gene degenerated? Not evidently Ka << Ks in all comparisons in Table 3 above. (iii)has the Y-linked gene undergone more non-synonymous substitutions than the X? No. According to the analysis using the outgroup S. vulgaris, the numbers of changes are as follows, and the difference is not significant by a 2 x 2 contingency test (see DNAsp Tools menu).

18 Analysis of DNA sequence data p. 18 Table 4. Non-synonymous and synonymous substitutions in the S. latifolia X and Y sequences, using an outgroup (S. vulgaris). Non-synonymous Synonymous Y 6 20 X 0 5 McDonald-Kreitman tests (a) Is the higher diversity of the X-linked locus (see above) could be due to balancing selection at this gene? If so, we expect an excess of non-synonymous polymorphisms within S. latifolia against the number expected based on divergence from an outgroup (we can use S. vulgaris). The test uses synonymous or silent sites to take account of possible mutation rate differences. The results are as follows and the test statistic is non-significant. X vs. S. vulgaris Fixed Polymorphic Synonymous Non-synonymous 2 15 (b) Is it likely that the Y has undergone an excess of non-synonymous substitutions since the split with S. vulgaris, which might suggest that slightly deleterious mutations are accumulating, due to the low effective size of the Y-linked. The results are as follows: Y vs. S. vulgaris Fixed Polymorphic Synonymous Non-synonymous 11 0 The test cannot be done, because of the absence of polymorphisms. To try and see if there is any likelihood that the above process might be happening, we could set that value to 1 and use DNAsp s Tools menu to do a 2x2 contingency table test. It is non-significant, i.e. there is no evidence for an undue amount of non-synonymous substitution. NOTE that this test is a less good one than the previous one, because all substitutions are included and we cannot separate them into those that occurred specifically in the Y-linked lineage. (iv) Is the difference between X- and Y-linked samples statistically significant? Yes, by HKA test using divergence from S. vulgaris. According to my results, the numbers are as in the table below, and both sets give significant results after selecting the appropriate type of gene (X and Y linked think about why this is needed).

19 Analysis of DNA sequence data p. 19 All sites Silent sites X Y X Y Intra-specific variability Number of segregating sites (NOTE 4) Total number of sites Sample size Inter-species divergence from S. vulgaris Average number of differences (D XY ) (NOTE 5) (NOTE 5) Total number of sites *328 =51.51 (NOTE 6) *678 = (NOTE 6) NOTE 4: From the DNA diversity and divergence analysis, using S. vulgaris for divergence. NOTE 5: I got these from the DNA Divergence between populations analysis NOTE 6: I multiplied divergence per silent site by the number of silent sites (data from Table 3). (v) What might explain the low diversity of the Y-linked gene? One possibility is that degeneration is occurring, and that hitch-hiking events are reducing its diversity. Another possibility is a much lower effective size for the Y than the X, e.g. due to strong sexual selection such that there is a high variance of male reproductive success. This predicts that autosomal genes diversity should be reduced, relative to that of X-linked genes (because X-linked genes are carried in males 1/3 of the time, versus ½ for autosomal genes). (7) Raw versus net divergence. Table 5. Divergence and net divergence from S. latifolia for all site types. X-linked versus S. dioica versus S. vulgaris D XY (JC= ) (JC= ) D a (JC= ) (JC= ) Y-linked versus S. dioica versus S. vulgaris D XY (JC= ) (JC= ) D a (JC= ) (JC= )

20 Analysis of DNA sequence data p. 20 Jukes-Cantor correction is required for the more distant species for the X data (where diversity is high within S. latifolia), but not for Y, where there are few variants within the species. Compare F st values between the two closely related dioecious species with values between populations within either species. To do this, you would need to define more sequence sets, using the same menu as before (the sequence names indicate the populations). The analysis is Gene Flow & Genetic Differentiation? To get statistical tests of subdivision, choose the option in the dialog box Perform the Permutation Test. (vi) Test for recombination in the X-linked genes is there evidence for recombination? The X sequences yield a minimum number of recombination events, Rm = 6 (and Y gives zero). The X don't fit zero recombination. I did an analysis of diversity within the X sequence set, and then a coalescent simulation for haplotype diversity, given theta. The program takes the results from the diversity analysis and enters them in the relevant boxes, and you can run the simulation and see the 95% confidence intervals, which show that the observed value has a low probability if the simulation is run with zero recombination, but it is not quite significant (other variants of this analysis, e.g. using the number of segregating sites, are highly significant). The Y data set fits better (but also fits recombination > 0, so it's not really conclusive). (vii) Another X-Y gene pair was studied, and the X-Y divergence value was 2%. What could account for the different values for the two genes? One possibility is that degeneration is occurring, and that hitch-hiking events are reducing its diversity. Another possibility is a much lower effective size for the Y than the X, e.g. due to strong sexual selection such that there is a high variance of male reproductive success. This predicts that autosomal genes diversity should be reduced, relative to that of X-linked genes (because X-linked genes are carried in males 1/3 of the time, versus ½ for autosomal genes).