This task contains question. Please answer these questions in groups of two persons and make a small report.

Tasks Monday January 21st 2006 Goals: - to work with public databases on the internet to find gene and protein information. - To use tools to analyse and compare DNA sequences - To find homologous sequences in other organisms and to learn the concept of orthologs and paralogs. - To make a phylogenetic analysis using ClustalW - To analyse genome sequences from multiple organisms using VISTA We will make use of public DNA (NCBI and UCSC) and protein databases (EBI). The underlying information in the various databases is mostly identical but visualisation and search options as well as annotation may vary. This task contains question. Please answer these questions in groups of two persons and make a small report. Task 1: Homologs of the E. coli photolyase gene The bacterium E. coli can repair UV-induced DNA damage. UV-light can result in the formation of cyclobutane-thymidine dimers. The enzyme photolyase can repair the damage but it needs visible light to be activated. The energy of a photon is absorbed by the enzyme and used by FADH to free an electron needed for repair of the DNA damage. In this task you will search for the E. coli K12 photolyase gene and protein and you will try to find and compare homologs in other model organisms from Page 1 of 6

other 'kingdoms'. You will collect information for these homologs (e.g. protein size, protein domains present). Using this information, you will try to find out the possible evolution for this gene and how it did arise in various organisms. Find the amino acid sequence of the E. coli photolyase protein at NCBI. Go to http://www.ncbi.nih.gov/ and search all databases for photolyase. Find the protein sequence starting with NP_, which means that it is reference sequence for a given organism. How many amino acids does the protein consist of? In the pull-down menu 'display' you can select another view. Select 'FASTA' for a short version of the sequence without extensive annotation. Copy the sequence, including its description which is preceded by ">" and paste it on your notepad. Save this file for later use. As depicted in the picture above, the protein contains two important activities. We will now analyse the protein for known 'domains' residing in the protein using the program "Interproscan" (http://www.ebi.ac.uk/interproscan/). Copy the protein sequence into this screen and start the analysis. Which two large protein domains are found in the E. coli photolyase protein? What are the functions of these two domains? To find homologs in other organisms you can choose to use the complete protein or one of the two domains. We will first search for homologs in the one-cellular organism bakers yeast Saccharomyces cerevisiae. Copy the E. coli photolyase amino acid sequence and find homologs in yeast using Blast (http://www.ncbi.nlm.nih.gov/blast/). Which Blast program would be best suited for this task? Paste the protein sequence in the 'search' window and limit your search to "Saccharomyces cerevisiae" in the options panel. After you started the Blast comparison, a new screen will pop up, again showing the two domains present in the photolyase protein. Hit the 'format' button to go the results. Click on the best hit, preferably again a 'NP_xxxxx' sequence and retrieve and copy this sequence in FASTA format to your notepad with the E. coli sequence. Blast also returns an alignment of the E. coli and yeast protein sequence, but this is a local alignment that only shows those parts that match best. In this case, homology information for the start and end of the protein is missing. To make a global alignment, we will use the program Align (http://www.ebi.ac.uk/emboss/align/index.html). The alignment method is Page 2 of 6

standard put on global alignment. Paste the E. coli and yeast protein sequences in the two different input fields. Hit 'run' to see the aligned output. What is the most striking difference between the two sequences? This part of the protein may play a role in the subcellular targeting of the protein. Go to the Saccharomyces Genome Database (SGD) (http://www.yeastgenome.org/) and find out what the subcellular localisation of the yeast photolyase protein is to search for "Phr1", which is the yeast name for this protein. What is the subcellular localisation of the protein? Could this be expected? Assuming that the domain that is present in yeast and not in E. coli is responsible for this localisation, why can E. coli do without this segment? Now go back and try to find other homologs of the E. coli photolyase in Eukaryotes. Include homologs in human, mouse and a plant and some organisms of your own choice. Copy all sequences in FASTA format (preferentially sequences with NP_xxxxx names) to a new notepad file. Note that organisms may contain multiple different homologs! Collect all of them. Once you have a nice collection of sequences we will compare them with each other using the multiple sequence alignment program ClustalW (http://www.ebi.ac.uk/clustalw/). Read the Frequently Asked Questions for more background on this tool. On the bottom of the page you will find the 'Upload a file' field. Select your saved notepad file and run the program. Discuss your findings in your report. You can improve your alignment by removing distantly related sequences. Delete these sequences (e.g. E. coli) from your notepad file and reanalyse your sequences. The human and mouse genome both contain two clear photolyase homologs: cryptochrome 1 and 2. Describe which genes are likely to be orthologs and which are paralogs. Page 3 of 6

Task 2: Comparative genome analysis of the human cry2 locus. From Task 1 you have learnt that you can find protein sequences and identify homologs in other organisms. However, sometimes the protein sequence is not available for a given organism or it may be questionable if the gene structure is properly predicted from the genome sequence. In this task, you will search for homologous regions in mouse, rat, chimpanzee, fugu, etcetera using the comparative genome browser VISTA (http://genome.lbl.gov/vista/index.shtml). You will find various programs on the VISTA home page for specific types of searches. Go to the VISTA Browser (http://pipeline.lbl.gov/cgi-bin/gateway2) and search in the 'Human July 2003' genome for the human photolyase gene 'cry2' by filling out this term in the position field. You will now graphically see the degree of conservation between the homologous human and mouse genome sequences. Try to understand the figure and colouring using the legend. Extend the comparison by adding more organisms using the pull down menu on the left. Which parts of the gene are clearly conserved in all organisms? Which organisms are best suited for the identification of this kind of conserved regions? Which are less suited? Explain why. Which organism would be best suited for finding conserved and potentially functional promoter elements that regulate the expression of this gene? Zoom out by clicking on the magnification icon with the '-' sign. You will now see a larger genomic region. In the chimpanzee trace you will now see a large gap. What does this mean and what process is underlying this. Page 4 of 6

Task 3: Identification of functional genomic elements using phylogenetic shadowing This morning you have read the paper by Boffelli et al. on phylogenetic shadowing. This method is specifically suited for the identification of small conserved elements in a genome or lineage-specific features. For this task you will be using the sequences from 10 different primates (FASTA format) from the course webpage. Align these sequences using ClustalW. What can you conclude from this alignment? There is clearly another approach needed to extract information from these sequences. Use the eshadow (http://eshadow.dcode.org/) tool to analyse these sequences. Play around with the window size settings to get a clear view. What is shown in the graph? How many potentially functional segments are present in this region and what principle is underlying this hypothesis? What is the estimated size of each conserved region? Page 5 of 6

Now let's go back to the Vista homepage and see if we can retrieve the same information using other genomes. Use the GenomeVISTA tool and use the human sequence in your sequence list to search the genomic coordinates in the human June 2004 genome assembly. Wait until your search is finished and click the 'Vista browser' option. Add all available organisms for comparison. What can you conclude? Which organisms could also be used to identify these individual elements and which are not very informative? What is the estimated size of each conserved region? Close the VISTA browser window and select the 'VISTA track' option in the search results window. You are now redirected to the UCSC genome browser, which displays your results along with existing genome annotation. There is another track with conservation information, showing the cumulative conservation using information from 10 different organisms (this is not a pairwise alignment, as you have seen thus far in VISTA, but a graphical representation of a sort of ClustalW multiple alignment). What would you conclude from the 10-way alignment? What is the estimated size of each conserved region? Under the graph you will find many options that can be displayed as well. Select the 'full' option for the sno/mirna option. Which element(s) reside in the conserved regions? What are their sizes? Page 6 of 6