BIOINFORMATICS TUTORIAL

Transcription

1 Bio 242 BIOINFORMATICS TUTORIAL Bio 242 α Amylase Lab Sequence Sequence Searches: BLAST Sequence Alignment: Clustal Omega 3d Structure & 3d Alignments DO NOT REMOVE FROM LAB. DO NOT WRITE IN THIS DOCUMENT. A pdf of this document is available on the bio 242 website. Acknowledgements The Bates Bioinformatics Tutorial was originally developed as part of the Collaborative Technologies Development project. David Asanuma ('09) created the site under the guidance of Nancy Kleckner, Associate Professor of Biology, and Michael Hanrahan, Assistant Director of Research and Curricular Computing. Revision of the content is performed annually by Greg Anderson and Carolyn Lawson to keep the document up to date with the website. Page 1 of 26 Bioinformatics Tutorial (rev )

2 Bioinformatics Tutorial Bioinformatics is the acquisition, storage, arrangement, identification, analysis, and communication of information related to biology. The term was coined in 1990 with the use of computers in DNA sequence analysis. Think of it as the theoretical branch of molecular biology like the relationship of theoretical physics to the general field of physics. Now that you have obtained information about some of the chemical properties of α amylase, in this exercise you will be comparing the molecular structure of the enzyme among the three (or more!) species. The tutorial will guide you through finding the gene sequences using both the Entrez search and BLAST tools, and then comparing them using the Clustal Omega tool. You will be using the DNA and protein sequence on line databases that are the core of bioinformatics. There are two general types of sequence databases: Primary databases contain experimental results in an accessible format, but are not sequences that are a population consensus. DDBJ, EMBL, and GenBank are primary databases. Secondary databases are curated to reflect consensus sequences from multiple experiments and usually use the primary databases as their sources. Abbreviations DDBJ DNA Databank of Japan EMBL European Molecular Biology Laboratory NCBI National Center for Biotechnology Information BLAST Basic local alignment search tool The standard sequence format is called FASTA. All FASTA sequences start with a definition line which consists of: a unique identification number (the accession number) the version number of the sequence the length of the sequence molecule type (DNA or mrna) taxonomic division (for instance, INV = invertebrate) last release date source organism Every coding sequence also has a unique protein number assigned to it, starting with AA. Reference sequences (which undergo continuing curation) are the most complete and up to date and always start with NT for DNA, NM for mrna, or NP for protein. Hint these are the ones you want to use if possible. Sequence Search Introduction Entrez Entrez is a data retrieval system developed by the National Center for Biotechnology Information (NCBI) that provides integrated access to a wide range of data domains, including literature, nucleotide and protein sequences, complete genomes, three dimensional structures, and more. Entrez includes powerful search features that retrieve not only the exact search results but also related records within a data domain that might not be retrieved otherwise and associated records across data domains. These features enable us to gather previously disparate pieces of an information puzzle for a topic of interest. Page 2 of 26 Bioinformatics Tutorial (rev )

3 Effective and powerful use of Entrez requires an understanding of the available data domains, the variety of data sources and types within each domain, and Entrez s advanced search features. This tutorial uses corn (Zea mays) alpha amylase to demonstrate the wide variety of information that we can rapidly gather for a single gene. The numbers noted in the search results will of course change over time as the databases grow. The same techniques shown here can be used for any topic of interest. The search goals are to: separate the wheat from the chaff identifying a representative, well annotated mrna or protein sequence record retrieve associated literature identify conserved domains within the protein identify similar proteins find a resolved three dimensional structure for the protein or, in its absence, identify structures with homologous sequence Perform VAST alignments of 3d structures of plant and animal amylases to visualize where similarities and differences occur. Let s get started! Go to the NCBI website by entering the URL in the address field of your browser. After accessing the NCBI website, you may now search for corn alpha amylase sequences in either the nucleotide or protein databases by selecting one or the other from the Database dropdown menu. Other points of interest on the NCBI Home Page are the PubMed link, which allows you to search for journal articles on the structure and function of alpha amylases, and the BLAST link, which allows you to search for nucleotide or protein sequences with similarity to your sequence of interest. For now, make sure you are at the NCBI home page (click on the NCBI icon in the upper left of the NCBI page to be sure), and choose "Protein" from the search drop down databases menu. Type "Zea mays alpha amylase" in the line below. These selections are illustrated in Figure 1 (next page). Click "Search" to proceed. Page 3 of 26 Bioinformatics Tutorial (rev )

4 Figure 1. NCBI home page from Entrez searches can be performed. Search results: Fig. 2 shows a typical results page for this search. Yours should look similar, but might be a little different depending on what new information has arrived since the screen shot was made. The sequence of interest has the accession number (identifier) AAA It is highlighted in the screen shot. How do you know this is the one you want? Click on the accession number and study the page that comes up. It should be identical to the one shown in Fig. 3. Figure 2. Typical search results for protein sequences. Page 4 of 26 Bioinformatics Tutorial (rev )

5 Figure 3. Typical record for a typical accession number record. In Figure 3, take note of the DEFINITION, SOURCE and ORGANISM, AUTHORS of the sequence, and the TITLE and JOURNAL name of the article published about it. If you don t already have this article, you can retrieve it simply by clicking on the PUBMED number (in the live window) and print the PDF version. Then find your way back to the results page. Page 5 of 26 Bioinformatics Tutorial (rev )

6 Skip down through the FEATURES and note the ORIGIN section, which gives you the amino acid sequence of your protein. This is the sequence we ll use in a BLAST search, but the default format is not particularly helpful. All further processing of the sequence information requires that the sequence be in FASTA format. FASTA Format: Conversion of the sequence to a universal format Scroll to the top of your results page and note the Display drop down box with "GenPept" selected. The GenPept format is the default setting and gives you all of the information we discussed above. However, the FASTA format is more useful for BLAST searches and alignments of sequences. Select FASTA from the menu as illustrated in Fig 4. Your results should appear like the screen shot in Fig. 5. You now see less information: just the accession number followed by a brief descriptor, and the amino acid sequence preceded by some identifying information. Figure 4. Click FASTA to convert the sequence to proper format for further searching. Figure 5. FASTA conversion results. In the live window, highlight and copy the complete amino acid sequence along with the identifying information (>gi ). From your start menu bring up NotePad and paste the FASTA sequence into the window. You will use this sequence in a BLAST search to identify other amino acid sequences in the NCBI databases with similarity to your sequence. Note that many of the relevant analysis tools that can use this sequence information are linked down the right side of the NCBI page. Once you are comfortable using these tools, you can work more efficiently. Minimize NotePad to return to the NCBI website. Page 6 of 26 Bioinformatics Tutorial (rev )

7 Protein BLAST Introduction To access the BLAST page, in your live window, click on the NCBI icon in the upper left of the page (this takes you to the home page). Click on BLAST in the Popular Resources menu. Carefully read through the list of programs available under "Basic Blast" (Fig. 6) and what they can do for you (Table 1) before proceeding. Figure 6. Basic BLAST search options. Selecting a BLAST Program The "Basic BLAST" menu allows you to do either nucleotide or protein BLAST searches of various types. Because our sequence is a protein sequence, we will do a Protein protein BLAST (blastp). Click on this option. Table 1. Explanation of BLAST program functions for the rest of us. BLAST PROGRAM nucleotide blast or blastn protein blast (or blastp) blastx tblastn tblastx Further details Compares a nucleotide query sequence against a nucleotide sequence database. Compares an amino acid query sequence against a protein sequence database. Compares a nucleotide query sequence translated in all reading frames against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence. Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. Compares the six frame translations of a nucleotide query sequence against the six frame translations of a nucleotide sequence database. Please note that the tblastx program cannot be used with the nr database on the BLAST Web page because it is computationally intensive. Page 7 of 26 Bioinformatics Tutorial (rev )

8 BLAST P Search Paste your copied FASTA sequence into the text box under "Enter Query Sequence" (Fig. 7). Make sure the "Non redundant protein sequence (nr)" database is selected in the Database drop down menu under "Chose Search Set.". Click on BLAST. You may see a window indicating your query has been added to the BLAST Queue. You might have to wait for several seconds for your results during which time you will see a screen like that in Fig. 8. Be patient, remember that your sequence is being compared to thousands of others! Figure 7. The BLAST search screen. Page 8 of 26 Bioinformatics Tutorial (rev )

9 Figure 8. The initial screen during a BLAST search. BLAST P Results Part 1 The blastp results page (Fig. 9) shows around 100 "Hits", or other protein sequences showing at least some similarity to corn alpha amylase. The illustration with the red bars is a diagrammatic representation of how your sequence (the top red bar) lines up with other sequences in the database along the primary structure of the protein (from 0 to over 400 amino acids). Note that some of the sequences lack the amino terminus of your corn alpha amylase sequence. Figure 9. BLAST summary of related sequences. The lines show relative alignment of the hit sequences with the query sequence. Page 9 of 26 Bioinformatics Tutorial (rev )

10 Once your results appear, scroll down past the red diagram and you will see a list of accession numbers and descriptors for sequences in order of decreasing similarity to your sequence (Fig. 10). In fact, the first item in the list is (or should be) your sequence (check the accession number to be sure). The two scores at the right (Ident and E value) indicate the degree of similarity. Both are defined in the glossary of terms in this tutorial. You can click on any of these sequences to go to the GenPept page that describes it. Figure 10. Descriptions of the 100 most related sequences to the query sequence. For now, scroll down to the "Alignment" section of the results (Fig. 11) to see the actual amino acid sequences aligned against yours (illustrated in the screen shot below). Note the amino acid identities to get a measure of how similar the sequences are. The first should be 100 % since it is the identical sequence. As you scroll down through the next several sequences, though, the percent identity should get smaller. Figure 11. Sequence alignment information for the most related protein sequences. Oryza is the genus of rice. Page 10 of 26 Bioinformatics Tutorial (rev )

11 Your immediate goal using BLASTP is to locate sequences for the animal alpha amylases utilized in your experiments. Scroll back up slowly through the list of "hits". What species do you see? If it is not clear from the brief description, click on the accession number to get the GenPept descriptions. In fact, what you will probably find are mostly sequences from plants, some bacteria, and a few insects. Click on the "Distance of Tree Results" button at the top of the list of hits (see Fig. 11) to examine which organisms are represented in the list. If human and oyster (or other bivalve species [Order Pelecypoda]) salivary alpha amylase are not found in this list of BLAST hits, how else might you find those sequences to compare to corn? To broaden your analysis a bit, also search for the sequence for barley (Hordeum vulgare). Design and carry out a strategy to find them, and once you do, copy the FASTA formatted sequences to the same NotePad file your other sequence is in. Make sure to leave one blank line between the sequences. How many species? For this analysis you may find that using more than just the three species we used in lab would be very helpful in seeing larger patterns when comparing plant and animal amylases. We strongly recommend adding at least one more plant (barley is good). Addition of a third plant and animal amylase would be even better to help bring out the patterns of similarity and difference in amino acid sequences in plants and animals. ACCESSION NUMBER CHECK To facilitate a broader comparison of alpha amylase among plant and animals, you should now have four accession numbers: one for corn (Zea mays), humans (Homo sapiens), Pacific oyster (Crassostrea gigas) and barley (Hordeum vulgare). There are now sequences for amylase from two other clam Genera in the databases (Cerastoderma and Corbicula) which could be used as alternatives to the Pacific oyster. Record those accession numbers below and then check with a lab instructor or TA to make sure that you have appropriate sequences before you proceed. SPECIES corn (Zea mays) barley (Hordeum vulgare) humans (Homo sapiens) Pacific oyster (Crassostrea gigas) Other animal: Other plant: ACCESSION NUMBER AAA50161 Page 11 of 26 Bioinformatics Tutorial (rev )

12 Clustal Omega: A DNA and Protein Multiple Sequence Alignment Tool URL: Introduction Once you have found all four usable sequences, you will want to align them to see how similar they are. We will use the program Clustal Omega to do such an alignment. Be sure to read the information below that describes Clustal Omega and the underlying basis for sequence comparisons. When you are finished, enter the URL shown above to bring up the site that hosts the Clustalw2 program. Clustal Omega is a general purpose global multiple sequence alignment program for DNA or proteins for use when you want to align 3 or more sequences (for aligning 2 sequences use the pairwise sequence alignment tool: Clustal Omega produces biologically meaningful multiple sequence alignments of divergent sequences. It calculates the best match for the selected sequences, and lines them up so that the identities, similarities and differences can be seen. Evolutionary relationships can be seen via viewing Cladograms or Phylograms. Alignment scores are returned as a Percent Identity Matrix. Multiple alignments of protein sequences are important tools in studying sequences. The basic information they provide is identification of conserved sequence regions. This is very useful in designing experiments to test and modify the function of specific proteins, in predicting the function and structure of proteins, and in identifying new members of protein families. Sequences can be aligned across their entire length (global alignment) or only in certain regions (local alignment). This is true for pair wise and multiple alignments. Global alignments need to use gaps (representing insertions/deletions) while local alignments can avoid them, aligning regions between gaps. The alignment is progressive and considers the sequence redundancy. Trees can also be calculated from multiple alignments. The program has some adjustable parameters with reasonable defaults. Submission Form You will use the default settings for all menus that appear at the top of the Submission Form (Fig. 12), so don't change these. Copy all of your sequences, in FASTA format including their first descriptor line, into the open frame on the Submission Form; make sure to leave one blank space between them (Fig. 12). Clustal Omega will attempt to align these amino acid sequences based on their similarities. Click Submit. Your results might take a few seconds. Page 12 of 26 Bioinformatics Tutorial (rev )

13 Figure 12. The Clustal Omega submission form. Alignment Results The first screen you ll see shows the alignments of your sequences (Fig. 13a). It will be helpful to click on Show Colors to more easily see locations of similarity and difference among the sequences based on the chemical nature of the amino acid residues. RED (residues AVFPMILW) = Small (small+ hydrophobic (incl.aromatic Y)) BLUE (residues DE) = Acidic MAGENTA (residues RK) = Basic H GREEN (residues STYHCNGQ) = Hydroxyl + sulfhydryl + amine + G GREY (other residues) = Unusual amino/imino acids etc The displayed rows (except last one with the consensus symbols *, :,.) are the aligned amino acid sequences; the last one is an indication of consensus, or which amino acids are conserved across the Page 13 of 26 Bioinformatics Tutorial (rev )

14 compared sequences. By default, an alignment will display the following consensus symbols denoting the degree of conservation observed in each column. Conserved means the amino acid is replaced by one having similar chemical properties. Consensus Symbols: " * " means that the residues, or nucleotides, in that column are identical in all sequences in the alignment. " : " means that conserved substitutions have been observed; amino acids having strongly similar properties. ". " means that semi conserved substitutions are observed, i.e., amino acids having similar shape, but otherwise have weakly similar properties. Click on Results Summary button at the top of the page. A table is returned that allows you to select multiple summaries of information about the analysis. The one you ll want is the last one, the Percent Identity Matrix this returns the alignment scores for the pairwise comparisons of the sequences you submitted. The matrix lists the sequences by accession number by row and column (we added the red labels). The score at the intersection of a row and column is the alignment for that pair. To help you understand the alignment score, review the description below from the Clustal Omega site FAQs. How are pairwise alignment scores calculated? A pairwise score is calculated for every pair of sequences that are to be aligned. These scores are presented as a matrix in the results. Pairwise scores are calculated as the number of identities (same amino acid residue in the best alignment divided by the number of residues compared (gap positions are excluded). Thus, they tell us approximately what percentage of the two sequences have functional identity, or similarity. Page 14 of 26 Bioinformatics Tutorial (rev )

15 Figure 13. (A) A portion of a multiple sequence alignment. The number at the end of the row indicates the amino acid number in the last position of that row relative to the entire molecule. (B) Results Summary options, (C) Matrix of alignment scores. Page 15 of 26 Bioinformatics Tutorial (rev )

16 Be sure to copy the alignments output and matrix scores results to your Notepad file. Look through the entire sequence to look for areas of similarity. How much is there? Can you guess why clam/oyster and human sequences did not appear in the BLAST search with corn alpha amylase? Compare each pair of sequences to see which ones are most similar. You might need to re run ClustalW2 with the different pairs to most efficiently determine this. Are there any areas of the sequence that you expect to be more similar between species than others (i.e., the active site)? If you don t know where the important functional domains are, you should run a search of the literature in PubMed to find out. Simply click on the NCBI icon on the active web page and choose PubMed. Protein Structures Conserved Domain Database (CDD) Since you found that there are few similarities in the amino acid sequences for alpha amylase in the three organisms, how do we account for them being functionally similar? We need to take one more step and examine the three dimensional structure of the enzymes. You can use tools on the NCBI website for this as well. Open the NCBI main page (Fig. 14). Click on Domains and Structure on the left hand menu bar, and then select Conserved Domain Database (CDD) under the resource tab. Figure 14. NCBI website homepage. On the CDD database page, click on "Search Methods" (Fig. 15). Page 16 of 26 Bioinformatics Tutorial (rev )

17 Figure 15. Conserved Domain Database entry page. Type (or paste) the accession number for human salivary alpha amylase into the big center search window (Fig. 16). Select the CDD database from the pull down menu. Click on the SUBMIT button. Figure 16. Conserved domain query submission page. Page 17 of 26 Bioinformatics Tutorial (rev )

18 The results window should confirm that this sequence is for alpha amylase. Click on SEARCH FOR SIMILAR DOMAIN ARCHITECTURE (Fig. 17). Figure 17. Results page from CDD query. Note that the graphic identifies the active, catalytic, and Calcium binding site regions. Select the pfam00128 accession number to continue. In the window displaying the results for the pfam00128 group, expand the "[+] Structure" menu, which is collapsed by default. Then click on Structure View (Fig. 18a). If you are using your own computer, click on Download Cn3D to install the viewing program and follow whatever are your platform s usual instructions for program installation On Bates laptops, the program should open the structure file automatically. A. B. Figure 18. Accessing the Cn3D display program. Page 18 of 26 Bioinformatics Tutorial (rev )

19 The Cn3D application will open enabling you to see the structure of your protein (Fig. 19). You can rotate the 3 D structure by dragging it with your mouse. The catalytic active region is shown in red. Figure 19. 3D rendering of the human salivary amylase molecule. The color key matches the amino acid sequence information (Fig. 20) in the window that appears below the 3 D representation of your protein. The first row is the query sequence. If you select a portion of the sequence by dragging the mouse, it will be highlight in yellow of the model. The same works for individual residues. Figure 20. Amino acid sequences of pfam00128 amylases. The first row is the query sequence. Change the display format of Cn3D by selecting Style > Rendering Shortcuts > Worms (Fig. 21). Now you should be able to rotate the structure to clearly see the α/β barrel site in the center. Figure 21. Commands to change the rendering style of the 3d model. Page 19 of 26 Bioinformatics Tutorial (rev )

20 Protein Structures: Comparisons Now that you know what the catalytic site looks like, you can search for the 3d structure of the enzymes used in this lab and see how they compare. 1. Close the CDD windows and return to the main NCBI website by clicking the NCBI logo in the upper left corner. 2. Click on STRUCTURE at the top of the page. 3. At the Structure Search Entrez, enter Human Salivary Amylase (1SMD) and click GO. 4. Click on VIEW 3D STRUCTURE. 5. Rotate the model of the enzyme can you see the characteristic catalytic site? This site does not show the catalytic site in red, but you can highlight a section of the sequence in the lower window, and it will also be highlighted on the model. 6. Minimize the 3D model, and go back one page. Unfortunately, there are no structure models for either corn or clams in the database, but there is one for barley. Before viewing the structure of the barley enzyme, return to the ClustalW2 page and compare the barley and corn sequences to determine if this substitute is valid. Enter barley alpha amylase (1RPK) and click GO. Click on VIEW 3D STRUCTURE. 7. Rotate the model of the enzyme can you see the characteristic catalytic site? Maximize the window with the human enzyme model and compare the two side by side. Comparing Structures with VAST (now this IS cool!!) While Cn3D does fine with single structures, it's even better suited to displaying structure alignments of multiple proteins, i.e., it enables you to superimpose 3 D structure on top of each other such that differences in structure are readily apparent. NCBI creates and maintains a database of such alignments, called VAST (Vector Alignment Search Tool), for all pairs of proteins from MMDB whose structures have some similar core regions. The VAST tool does two things for each related pair: it calculates an optimal 3 D superimposition for the conserved core, and constructs a sequence alignment based on the correlation of the 3 D structures. 1. From the NCBI home page, choose the Structure database. 2. Search for human salivary alpha amylase. Somewhere on the hit list should be one with a PDB ID= 1SMD. 3. When you select 1SMD, you should get the Structure Summary page. To see the 3 D structure, click on the view structure button. 4. To compare this structure with other molecules, click the VAST+ button on the right. You now have a list of similar structures. Find the structure for barley alpha amylase (1AMY). Hint enter 1AMY for the PDB ID and click Search within Results button. 5. Expand the entry by clicking on the + to the left of 1AMY. Click on the 3 D view button to display the aligned 3 D structures. Page 20 of 26 Bioinformatics Tutorial (rev )

21 6. The default coloring for structure alignments in Cn3D uses magenta and blue for the regions aligned by the VAST algorithm, where residues aligned in 3 D space are magenta, and different residues are blue; unaligned regions are colored gray. Note that because of the way VAST works, the aligned regions tend to correspond to individual or groups of consecutive secondary structure elements helices and strands, while the loops outside the core vary in length and orientation and are often left unaligned. 7. There are some important differences between structure based alignments in Cn3D and sequence alignments from common algorithms like BLAST or Clustal Omega, both in the display and the underlying alignment data. In a structure alignment (e.g. from VAST), one residue is aligned with another because their alpha carbons are nearby in space, not because of the residue identity. 8. Try aligning a molecule that is very similar to human alpha amylase porcine alpha amylase. Search for the PDP ID = 1PIF instead of the barley. 9. Alteromonas halopanctis, the cold adapted marine organism that Feller, et. al., wrote about is in the VAST results too search for PDP ID = 1AQH. Page 21 of 26 Bioinformatics Tutorial (rev )

22 Page 22 of 26 Bioinformatics Tutorial (rev )

23 Glossary Alignment The process of lining up two or more sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology. Algorithm A fixed procedure embodied in a computer program. Bioinformatics Bit score BLAST BLOSUM Conservation Domain DUST The merger of biotechnology and information technology with the goal of revealing new insights and principles in biology. The value S' is derived from the raw alignment score S in which the statistical properties of the scoring system used have been taken into account. Because bit scores have been normalized with respect to the scoring system, they can be used to compare alignment scores from different searches. Basic Local Alignment Search Tool. (Altschul et al.) A sequence comparison algorithm optimized for speed used to search sequence databases for optimal local alignments to a query. The initial search is done for a word of length "W" that scores at least "T" when compared to the query using a substitution matrix. Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of "S". The "T" parameter dictates the speed and sensitivity of the search. For additional details, see one of the BLAST tutorials (Query or BLAST) or the narrative guide to BLAST. Blocks Substitution Matrix. A substitution matrix in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins. Each matrix is tailored to a particular evolutionary distance. In the BLOSUM62 matrix, for example, the alignment from which scores were derived was created using sequences sharing no more than 62% identity. Sequences more identical than 62% are represented by a single sequence in the alignment so as to avoid over weighting closely related family members. (Henikoff and Henikoff) Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico chemical properties of the original residue. A discrete portion of a protein assumed to fold independently of the rest of the protein and possessing its own function. Page 23 of 26 Bioinformatics Tutorial (rev )

24 E value FASTA Filtering Gap A program for filtering low complexity regions from nucleic acid sequences. Expectation value. The number of different alignents with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. The first widely used algorithm for database similarity searching. The program looks for optimal local alignments by scanning the sequence for small matches called "words". Initially, the scores of segments in which there are multiple word hits are calculated ("init1"). Later the scores of several segments may be summed to generate an "initn" score. An optimized alignment that includes gaps is shown in the output as "opt". The sensitivity and speed of the search are inversely related and controlled by the "k tup" variable which specifies the size of a "word". (Pearson and Lipman) Also known as Masking. The process of hiding regions of (nucleic acid or amino acid) sequence having characteristics that frequently lead to spurious high scores. See SEG and DUST. A space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes the deduction of a fixed amount (the gap score) from the alignment score. Extension of the gap to encompass additional nucleotides or amino acid is also penalized in the scoring of an alignment. Global Alignment The alignment of two nucleic acid or protein sequences over their entire length. H Homology HSP Identity K H is the relative entropy of the target and background residue frequencies. (Karlin and Altschul, 1990). H can be thought of as a measure of the average information (in bits) available per position that distinguishes an alignment from chance. At high values of H, short alignments can be distinguished by chance, whereas at lower H values, a longer alignment may be necessary. (Altschul, 1991) Similarity attributed to descent from a common ancestor. High scoring segment pair. Local alignments with no gaps that achieve one of the top alignment scores in a given search. The extent to which two (nucleotide or amino acid) sequences are invariant. Page 24 of 26 Bioinformatics Tutorial (rev )

25 A statistical parameter used in calculating BLAST scores that can be thought of as a natural scale for search space size. The value K is used in converting a raw score (S) to a bit score (S'). Lambda A statistical parameter used in calculating BLAST scores that can be thought of as a natural scale for scoring system. The value lambda is used in converting a raw score (S) to a bit score (S'). Local Alignment The alignment of some portion of two nucleic acid or protein sequences Low Complexity Region (LCR) Regions of biased composition including homopolymeric runs, short period repeats, and more subtle overrepresentation of one or a few residues. The SEG program is used to mask or filter LCRs in amino acid queries. The DUST program is used to mask or filter LCRs in nucleic acid queries. Masking Also known as Filtering. The removal of repeated or low complexity regions from a sequence in order to improve the sensitivity of sequence similarity searches performed with that sequence. Motif A short conserved region in a protein sequence. Motifs are frequently highly conserved parts of domains. Multiple Sequence Alignment An alignment of three or more sequences with gaps inserted in the sequences such that residues with common structural positions and/or ancestral residues are aligned in the same column. Clustal W is one of the most widely used multiple sequence alignment programs Optimal Alignment An alignment of two sequences with the highest possible score. Orthologous Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function. P value The probability of an alignment occurring with the score in question or better. The p value is calculated by relating the observed alignment score, S, to the expected distribution of HSP scores from comparisons of random sequences of the same length and composition as the query to the database. The most highly significant P values will be those close to 0. P values and E values are different ways of representing the significance of the alignment. PAM = Percent Accepted Mutation A unit introduced by Dayhoff et al. to quantify the amount of evolutionary change in a protein sequence. 1.0 PAM unit, is the amount of evolution which will change, on average, 1% of amino acids in a protein sequence. A PAM(x) substitution matrix is a look up table in which scores for each amino acid substitution have been calculated based on the frequency of that substitution Page 25 of 26 Bioinformatics Tutorial (rev )

26 in closely related proteins that have experienced a certain amount (x) of evolutionary divergence. Paralogous Profile Proteomics Homologous sequences within a single species that arose by gene duplication. A table that lists the frequencies of each amino acid in each position of protein sequence. Frequencies are calculated from multiple alignments of sequences containing a domain of interest. See also PSSM. The systematic analysis of protein expression in normal and diseased tissues that involves the separation, identification, and characterization of all of the proteins in an organism. PSI BLAST Position Specific Iterative BLAST An iterative search using the BLAST algorithm. A profile is built after the initial search, which is then used in subsequent searches. The process may be repeated, if desired with new sequences found in each cycle used to refine the profile. Details can be found in this discussion of PSI BLAST. (Altschul et al.) PSSM = Position specific scoring matrix The PSSM gives the log odds score for finding a particular matching amino acid in a target sequence. Query The input sequence (or other type of search term) with which all of the entries in a database are to be compared. VAST Vector Alignment Search Tool. A tool that enables superimposition of multiple 3d structures. The VAST tool does two things for each related pair: it calculates an optimal 3 D superimposition for the conserved core, and constructs a sequence alignment based on the correlation of the 3 D structures. Page 26 of 26 Bioinformatics Tutorial (rev )