Cross-references to the corresponding SWISS-PROT entries as well as to matched sequences from the PDB 3D-structure database 2 are also provided.

Size: px
Start display at page:

Download "Cross-references to the corresponding SWISS-PROT entries as well as to matched sequences from the PDB 3D-structure database 2 are also provided."

Transcription

1 Amos Bairoch and Philipp Bucher are group leaders at the Swiss Institute of Bioinformatics (SIB), whose mission is to promote research, development of software tools and databases as well as to provide education, training and service activities within the field of bioinformatics. Amos Bairoch has developed the SWISS-PROT protein and the PROSITE motif databases, whereas Philipp Bucher has developed the generalised profiles used in PROSITE. Christian J. A. Sigrist, Lorenzo Cerutti, Nicolas Hulo, Laurent Falquet and Marco Pagni are researchers working at the SIB. Alexandre Gattiker is a PhD student doing a thesis in the field of bioinformatics. Keywords: pattern, regular expression, profile, weight matrix, database Christian J. A. Sigrist, Swiss Institute of Bioinformatics (SIB), CMU, University of Geneva, 1 rue Michel Servet, CH-1211 Geneva 4, Switzerland Tel: Fax: christian.sigrist@isb-sib.ch PROSITE: A documented database using patterns and profiles as motif descriptors Christian J. A. Sigrist, Lorenzo Cerutti, Nicolas Hulo, Alexandre Gattiker, Laurent Falquet, Marco Pagni, Amos Bairoch and Philipp Bucher Date received (in revised form): 31st May 2002 Abstract Among the various databases dedicated to the identification of protein families and domains, PROSITE is the first one created and has continuously evolved since. PROSITE currently consists of a large collection of biologically meaningful motifs that are described as patterns or profiles, and linked to documentation briefly describing the protein family or domain they are designed to detect. The close relationship of PROSITE with the SWISS-PROT protein database allows the evaluation of the sensitivity and specificity of the PROSITE motifs and their periodic reviewing. In return, PROSITE is used to help annotate SWISS-PROT entries. The main characteristics and the techniques of family and domain identification used by PROSITE are reviewed in this paper. INTRODUCTION PROSITE is an annotated collection of motif descriptors dedicated to the identification of protein families and domains. The motif descriptors used in PROSITE are either patterns or profiles, which are derived from multiple alignments of homologous sequences. This gives to these motif descriptors the notable advantage of identifying distant relationships between sequences that would have passed unnoticed based solely on pairwise sequence alignment. Patterns and profiles have both their own strengths and weaknesses, which define their area of optimum application. The core of the PROSITE database is composed of two text files: PROSITE.DAT is a computerreadable file that contains all the information necessary to programs that make use of PROSITE to scan sequence(s) for the occurrence of patterns or profiles. This file includes, for each of the entry described, statistics on the number of hits obtained while scanning the SWISS-PROT protein database 1 for a pattern or profile. Cross-references to the corresponding SWISS-PROT entries as well as to matched sequences from the PDB 3D-structure database 2 are also provided. PROSITE.DOC contains textual information that fully documents each pattern or profile. Release of PROSITE (August 4, 2002) contains 1147 documentation entries that describe 1567 different motif descriptors. In addition to these entries, a collection of 152 pre-release profiles (see below) is also available. 3 PROSITE PATTERNS In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by pairwise sequence alignment. However, relationships can be revealed by the occurrence in its sequence of a particular cluster of residue types, which is variously known as a pattern, motif, signature or fingerprint. These motifs, typically around 10 to 20 amino acids in length, arise & HENRY STEWART PUBLICATIONS BRIEFINGS IN BIOINFORMATICS. VOL 3. NO SEPTEMBER

2 Sigrist et al. Patterns are regular expressions matching short sequence motifs usually of biological meaning because specific residues and regions thought or proved to be important to the biological function of a group of proteins are conserved in both structure and sequence during evolution. These biologically significant regions or residues are generally: Patterns are qualitative motif descriptors Enzyme catalytic sites (Figure 1a). Prostethic group attachment sites (heme, pyridoxal-phosphate, biotin, etc.). Amino acids involved in binding a metal ion. Cysteines involved in disulphide bonds. Regions involved in binding a molecule (ADP/ATP, GDP/GTP, calcium, DNA, etc.) or another protein. As the sequence of biologically meaningful motifs is evolutionarily conserved, a multiple alignment of them can be reduced to a consensus expression called a regular expression or pattern. Each position of such a pattern can be occupied by any residue from a specified set of acceptable residues, and in addition can be repeated a variable number of times within a specified range. At strictly conserved positions only one particular amino acid is accepted, whereas at other positions several amino acids with similar physicochemical properties can be accepted. It is also possible to define which amino acid(s) is(are) incompatible with a given position, and conserved residues can be separated by gaps of variable lengths. Finally, the pattern syntax provides features to anchor a pattern either at the beginning or at the end of a sequence (Figure 1b). The complete syntax of a PROSITE pattern is available at tools/scanprosite/scanprosite-doc.html. A regular expression is qualitative; it either does match or does not. There is no threshold above which we consider Figure 1: (a) The ScanProsite tool allows the visualisation of PROSITE motifs on 3D structures. The structure shown (1A46) is that of the serine proteases, trypsin domain (PS50240) of the human prothrombin (P00734). The active signatures are shown in dark and the residues involved in the catalytic mechanism as black balls: I, corresponds to the histidine (H) active site pattern (PS00134); II, to a user defined pattern centred on the aspartate (D) active site; and III, to the serine (S) active site pattern (PS00135). The close proximity between the residues involved in the catalytic mechanism is clearly visible. (b) A hypothetical pattern restricted at the N- terminal of a sequence (,) and translated as Met-any residue-gly-any residue-any residue-any residue-[ile or Val]-[Ile or Val]- any residue-any residue-{any residue but Phe or Trp or Tyr}. If any mismatch occurs at one of these positions, the pattern will not match the match as statistically significant. However, it is possible to evaluate the accuracy of PROSITE patterns thanks to the statistics on the number of hits obtained while scanning the SWISS- PROT database 1 or by scanning randomised databases (see below). Finally, it should be noticed that some 266 & HENRY STEWART PUBLICATIONS BRIEFINGS IN BIOINFORMATICS. VOL 3. NO SEPTEMBER 2002

3 PROSITE Profiles are more sensitive than patterns Profiles usually correspond to protein domains Profiles are quantitative motif descriptors families or domains are defined not just by one pattern but by the co-occurrence of two or more patterns of low specificity. The presence of just one of these patterns is not sufficient to assign a protein to a particular family and/or domain. However, the simultaneous occurrence of linked patterns gives good confidence that the matched protein belongs to the set being considered (see for example the serine protease signatures PS00134 and PS00135, Figure 1a). The advantages of patterns are their easy intelligibility for the user and the fact that patterns are directed against the most conserved residues. As these residues often are the more relevant for the biological function of the protein family or domain, further research can concentrate on them. Another advantage of patterns is that the scan of a protein database with patterns can be performed in reasonable time on any computer. PROSITE PROFILES Although patterns largely proved their usefulness, they also have intrinsic limitations in identifying distant homologues as they do not accept any mismatch. Typical examples of important functional domains that contain only a few very well-conserved sequence positions are the globin, the immunoglobulin, and the SH2 and SH3 domains. The enhanced sensitivity of generalised profiles (or weight matrices) allows the detection of such poorly conserved domains or families. Another advantage of profiles over patterns is that they characterise protein domains over their entire length, not just the most conserved parts of it. This advantage is currently used to define automatically the limits of particular domains in SWISS- PROT entries in order to improve the consistency of the annotation. The increased discriminatory power of profiles is due to intrinsic capabilities of the profile descriptor as well as to the sophistication of the profile construction methods. Profiles are quantitative motif descriptors providing numerical weights for each possible match or mismatch between a sequence residue and a profile position. A mismatch at a highly conserved position can thus be accepted provided that the rest of the sequence displays a sufficiently high level of similarity. The automatic procedure used for deriving profiles from multiple alignments is capable of assigning appropriate weights to residues that have not yet been observed at a given alignment position, making for this purpose use of prior knowledge about amino acid substitutability contained in a substitution matrix. In contrast, the procedure by which patterns are usually developed does not allow rational guesses as to which not yet observed residues might be observed in the future. Despite their obvious advantages, profiles are not superior to patterns for all purposes. In fact the two types of descriptors have complementary qualities. Patterns confined to small regions with high sequence similarity are often powerful predictors of protein functions such as enzymatic activities. Profiles covering complete domains are more suitable for predicting protein structural properties. Hence, a profile (PS50240) is able to detect the structural relationship of the non-enzymatic haptoglobin with the trypsin family of serine proteases (Figure 1a), even though the positions corresponding to the proteolytic active site residues of the proteases are occupied by different amino acids in haptoglobin and, as a consequence, no longer detected by the corresponding patterns (PS00134 and PS00135). 4 The generalised profiles 5,6 used in PROSITE are an extension of the sequence profiles introduced by Gribskov and coworkers. 7 They are sequence-like linear structures consisting of alternating match and insert positions. A match position corresponds to a domain position, which is typically occupied by a single amino acid. It provides weights for each residue type occupying this position plus a deletion extension penalty. Insert positions contain weights for insertions & HENRY STEWART PUBLICATIONS BRIEFINGS IN BIOINFORMATICS. VOL 3. NO SEPTEMBER

4 Sigrist et al. Patterns and profiles are built from multiple sequence alignments relative to the domain model defined by the sequence of match positions. In addition they provide parameters for the opening and closing of a deletion gap, as well as for the initiation and termination of a partial alignment to the profile. The numerical weights of a profile serve to define a quality score for a profile-sequence alignment. A sequence region that can be aligned to a profile with a score higher than a threshold score is considered a match. Searching for multiple occurrences of a particular domain within the same sequence requires the execution of a dynamic programming algorithm that finds a maximal set of high-scoring profilesequence alignments above a threshold score. Different alignment modes, such as global or local, are defined by profileintrinsic parameters. 6 The profile format used in PROSITE comprises fields for socalled accessory parameters which define the search method to be used for a particular domain. They allow specification of appropriate cut-off values, different score normalisation modes, and instructions as to how to treat partly overlapping matches. Construction of PROSITE profiles Generalised profiles were not exclusively designed for characterising protein domains and can in principle be generated by many different methods. Most of the current profiles in PROSITE were generated by a standard automatic procedure implemented in the program pfmake of the PFTOOLS package. This method is based on Gribskov s original method 8 with modifications published later. 9,10 The development of a profile for a protein domain logically involves several steps as shown in Figure 2. The first and perhaps most critical part is the generation of a good multiple alignment of domains extracted from complete sequences. Whenever possible, we try to use correct alignments based on available 3D structures. The next step consists of attributing weights to individual sequences of the multiple alignment in order to eliminate bias due to over-representation of subfamilies. The program pfw from the PFTOOLS package computes Voronoi weights. 11 Several other methods have been proposed for this purpose (see Durbin et al. 12 ) and apparently perform about equally well. The weighted multiple alignment is then converted into a socalled unscaled profile (see below) with the aid of the program pfmake. This process involves as an intermediate step, the generation of a frequency profile, which is in fact a data structure equivalent to a hidden Markov model (HMM). 13,14 For this purpose, each column of the multiple alignment is mapped to either a match or an insert position of the profile, according to the number of gap characters it contains. By default, columns containing less than 50 per cent gap characters are kept as match positions, the others are assigned to insert positions (see supplemental data available online). The last step in the profile construction process consists of converting the frequency profile (HMM) into a searchable scoring profile equivalent to a profile-hmm. 15 The amino acid frequencies of the match positions are transformed into weights using Gribskov s original formula: " M ij ¼ X # s jj9 f ij j9 where M ij is the match weight for residue j at profile position i, s jj9 the substitution score for residue pair jj9 according to the substitution matrix, and f ij the weighted frequency of residue j at profile position i. The substitution scores are usually taken from a PAM 16 or BLOSUM 17 matrix. The resulting numbers represent the weighted average of the substitution scores of the residue in the query sequence compared to the residues observed in the corresponding column of the multiple alignment. According to tests performed by two groups, 9,10 the BLOSUM45 matrix produces profiles 268 & HENRY STEWART PUBLICATIONS BRIEFINGS IN BIOINFORMATICS. VOL 3. NO SEPTEMBER 2002

5 PROSITE Figure 2: Construction of a PROSITE profile entry. The individual steps of the process are described in the text The calibration step adjusts the score to a scale common to all predictors with high sensitivity and selectivity. The so-called transition scores, ie the weights applied to insertions and deletions, are also calculated by a recipe inspired by Gribskov s method. Briefly, the penalty for a gap in the sequence or in the profile is very high at profile positions where no gaps have been observed before. However, it is reduced at positions where gaps do occur in the corresponding multiple alignment in a manner that depends only on the lengths of the gaps observed but not on their frequency. The precise method for scoring gaps used by the program pfmake is explained in the supplemental data available on-line. Profile calibration The method described above leads to an unscaled profile, which assigns a socalled raw score to a potential match. Raw scores are based on arbitrary units and do not lend themselves to a useful biological interpretation. However, the generalised profile format used in PROSITE allows the definition of various mathematical functions to convert raw scores into more sensible normalised scores. When a large protein database is searched for domains with a profile, one is interested in the question whether a match with a given score is likely to occur by chance or not. To estimate this & HENRY STEWART PUBLICATIONS BRIEFINGS IN BIOINFORMATICS. VOL 3. NO SEPTEMBER

6 Sigrist et al. Generalised profiles are roughly equivalent to profile-hmms probability the raw scores in PROSITE profiles are converted into so-called log 10 per residue E-values, allowing the computation of the number of expected matches with equal or higher score in a database of a given size. For instance a match with a normalised score of 9.0 or higher is expected to occur about once in a database of one billion residues. The normalised score is related to the raw score by a linear function whose parameters are equivalent to K and º of the BLAST score statistics. 18 In the jargon of PROSITE, a profile containing a normalisation function that defines log 10 per residue E-values is called a scaled profile. An empirical method is used for scaling PROSITE profiles. In the first step of the calibration process, a randomised protein database is searched with the unscaled profile in order to collect high-scoring matches. The cumulative distribution of the 2000 highest scores is then fitted to an extreme value distribution in order to estimate the parameters of the normalisation function. A more detailed description of the numerical recipe can be found at motif_score.shtml (see also Hofmann and Bucher 19 ). Three types of randomised databases are commonly used for this purpose, each one preserving different properties of real protein sequences such as their length distribution, composition, subfamily relationships and internal repetitiveness, making them more or less suited for a particular type of profile: reversed: created by taking the reverse sequence of each individual entry; window20: created by local shuffling of each individual sequence entry using a window width of 20 residues; db_global: created by global shuffling of each individual sequence entry. The reversed database is used in most cases but it is not very appropriate for profiles containing regularly spaced repeated residues such as cysteines in zinc fingers or hydrophobic amino acids in helix loop helix domains, because these features are conserved in the reversed database. One of the other two databases may be used in such cases. A normalised score of 8.5 is typically defined as the default cut-off value in PROSITE profiles. However, in some instances this threshold is not appropriate. There are profiles producing statistically significant matches to members of structurally related protein families, for which a higher cut-off value is indicated. Conversely, for short structural repeats it is sometimes necessary to choose a lower cut-off value to reach satisfactory sensitivity. Such decisions are always based on the match lists for SWISS- PROT presented as quality control information in each entry (see below). The PROSITE profile format allows specification of multiple cut-off levels. The default level zero is used for the classification of matches as true and false positives and negatives, respectively. Usually, a second low cut-off level with a normalised threshold score of 6.5 is defined for weak matches, which must be interpreted with caution. Nevertheless, they can be very useful for gene discovery and the detection of remote homologues. Additional cut-off levels may be added in order to distinguish subfamilies of proteins or domains or to improve the detection of repeats (see below). Generalised profiles vs. profile- HMMs Generalised profiles and profile-hmms are two widely used methods to model protein domains. The generalised profiles used in PROSITE 5 are an extension of the profiles first described by Gribskov et al., 7 while profile-hmms are a particular case of a class of the HMM probabilistic models. 13,14 Although the two models result from a different historical background, their equivalence has been demonstrated. 6 A simple introduction to the HMM probabilistic models can be found in 270 & HENRY STEWART PUBLICATIONS BRIEFINGS IN BIOINFORMATICS. VOL 3. NO SEPTEMBER 2002

7 PROSITE The PROSITE and SWISS-PROT databases reciprocally check their quality Eddy. 15,20 The important point of profile- HMMs is that they are finite models that describes a probability distribution over an infinite number of possible sequences. A great advantage of profile-hmms on generalised profiles is that they are formally built on the probability theory (reviewed by Rabiner 21 ). The counterpart is that this theory restricts the flexibility of the models because the sum of the probability distribution over all modelled sequences must by definition equal 1. In other words, in a profile-hmm the probability of one sequence cannot be increased without decreasing the probability of another sequence, which makes the manual editing of the model very difficult. Generalised profiles do not suffer from this restriction, making them very flexible and manipulable in a text editor. For example, it is possible to modify scores in a generalised profile to avoid or force a specific residue or family of residues in a specific position, which may be a way to discriminate two subfamilies of sequence motifs. Modification of the constraints on initiation and termination profile scores provides an easy way to change the behaviour of the model (global, local, semiglobal) and its anchoring to the sequence (no-, left-, right-anchoring). 6 Although generalised profiles and profile-hmms are both very effective in detecting motifs in distantly related sequences, 22,23 generalised profiles may be more interesting for the bioinformatician who wants to modify or add some capacities to the models without changing the base algorithms used for the searches. QUALITY CONTROL OF PROSITE THROUGH SWISS-PROT The accuracy of PROSITE patterns and profiles can be evaluated thanks to the intimate connection of PROSITE with the SWISS-PROT knowledge base. 1 Each time a new motif descriptor is added to PROSITE, it is used to scan SWISS- PROT in order to attribute one of the following match statuses to the SWISS- PROT entries concerned by the motif: True positive: a protein belonging to the set being considered and matched by the motif. False positive: a protein that does not belong to the set being considered but picked up by the motif. False negative: a protein belonging to the set being considered but not detected by the motif. Unknown: a protein that could belong to the set being considered and matched by the motif. Partial: a protein belonging to the set being considered but not detected by the motif (possibly because its sequence is incomplete and the region, which should be detected by the motif, missing). Reciprocally, every new protein entering SWISS-PROT is checked for the occurrence of PROSITE patterns and profiles and a match status for the relevant PROSITE entries is assessed. At every new PROSITE release, these SWISS- PROT match statuses are used to establish statistics for most motifs. These statistics allow the user to evaluate the ability of a motif to detect all or most of the sequences it is designed to describe (sensitivity) as well as its ability to give as few false positive results as possible (specificity). In addition, this process allows motifs to be permanently improved to give a better fit to the increasing number of proteins in SWISS-PROT. Skipping short and degenerate motifs Some PROSITE entries are too short or degenerate to have a biological meaning by their own as they are found in the majority of known protein sequences. These motifs, some of which predict post-translational modification sites (eg & HENRY STEWART PUBLICATIONS BRIEFINGS IN BIOINFORMATICS. VOL 3. NO SEPTEMBER

8 Sigrist et al. Some pre-release profiles are not yet in PROSITE, but still available for users Each PROSITE motif descriptor is linked to a document providing biological knowledge N-glycosylation; PS00001), produce matches that are only indicative of a possible function. Independent biological evidence must be considered to confirm the appropriateness of these matches. There is no matchlist and hence statistics provided with these motifs as well as with compositional profiles, which do not characterise biologically defined objects but are directed against sequence regions enriched in a particular amino acid. As these profiles are only defined statistically, it is not possible to speak of true or false matches to these profiles, neither is it possible to assign a false negative status to a sequence. PROSITE motifs belonging to these classes are tagged with the /SKIP-FLAG ¼ TRUE qualifier in their CC lines (for a definition of all the PROSITE lines see PROSITE DOCUMENTATION Each PROSITE motif is linked to a corresponding documentation describing the protein family or domain it detects. The documentation contains a brief description of what is known about this particular protein family or domain: origin of its name, taxonomic occurrence, domain architecture of the proteins, function, 3D structure, main characteristics of the sequence and some references. Recently, for families or domains whose structure is known, a direct link to a representative PDB entry is provided in the documentation, in order to make the description of the 3D structure more comprehensible. All the information providing biological knowledge about a protein family or domain should also be used as an additional quality control for patterns and profiles. If the user has some information about its sequence that makes no sense with the description of the motif detected, the match should be considered with caution. The documentation also contains direct information about the motif descriptors. Hence, for patterns the amino acid residues involved in the catalytic mechanism, metal ion or substrate binding or post-translational modifications are indicated. For profiles, residues are given if they cover the entire domain or protein. Finally, the sensitivity and specificity of the motif are also indicated, as well as an expert to contact if necessary. PRE-RELEASE PROFILES AND DOCUMENTATIONS In addition to PROSITE profiles, there is a collection of pre-release profiles and of their corresponding preliminary documentations (QDOC) that is not integrated in PROSITE but available with InterPro. There can be several reasons why a profile and its documentation are considered as preliminary and not integrated in PROSITE. First, some profiles can be considered as under development and will need to be seriously redefined before their eventual integration in PROSITE. The second class represents profiles and documentation whose quality still needs some improvement in order to reach the PROSITE standards. Finally, the last ones can already be considered as PROSITE profiles. They are just waiting for their integration, which takes some time because of the intimate connection of PROSITE with SWISS-PROT: an integration of a profile not only concerns PROSITE but also implies more or less important changes in SWISS-PROT. FUTURE DEVELOPMENTS Usually it is difficult to detect all repeated copies of a domain in a protein because the most degenerate repeats are generally missed. We plan to improve the detection efficiency of repeats by introducing a new threshold for repeats. Once a repeat above the normal threshold is detected, this new lower threshold would be used to detect the additional degenerate copies of the same repeat. By this way most, if not all, copies of a repeat should be detected. We have already constructed some 272 & HENRY STEWART PUBLICATIONS BRIEFINGS IN BIOINFORMATICS. VOL 3. NO SEPTEMBER 2002

9 PROSITE The PROSITE database and programs to use it are available online or for download profiles using structural information and now want to study if this approach improves the sensitivity and specificity of profiles by restricting the position of insertions or deletions to particular position having only little effects on the 3D structure of the protein/domain-like loops. Another project is to improve the automated annotation of SWISS-PROT thanks to PROSITE. Some information contained in PROSITE could help SWISS-PROT annotators in the annotation of domain or catalytic site features. Such an approach would ensure the consistency of SWISS-PROT. HOW TO OBTAIN A LOCAL COPY OF PROSITE A list of servers which distribute PROSITE has been recently published, 3 but please note that though PROSITE is free for academic users, the documentation entries are under copyright regulations. To obtain a licence, commercial users should the Swiss Institute of Bioinformatics: license@isb-sib.ch. HOW TO MAKE USE OF PROSITE Computer programs We provide programs that have been specifically developed to help use PROSITE for both patterns and profiles searches: ps_scan 24, a program used to scan one or several PROSITE motifs against one or several protein sequences. ps_scan is available from ftp://ftp.expasy.org/ databases/prosite/tools/. PFTOOLS, programs used to construct profiles or scan a sequence or a sequence library against a profile or a profile library. PFTOOLS are available from ftp://ftp.expasy.org/databases/ prosite/tools/or ftp://ftp.isrec.isb-sib. ch/sib-isrec/pftools/. Interactive Web access to PROSITE To browse the PROSITE documentation and motif entries, users should go to Web access to PROSITE allows users to benefit from the latest PROSITE updates and from hyperlinks connecting a PROSITE entry to other relevant sources of information. In addition, it has recently been made possible for the user to display the match list of a PROSITE motif as a multiple alignment available in different formats. To scan a sequence for PROSITE motifs, one can make use of the following tools. ScanProsite ScanProsite 24 allows either to scan a protein sequence from SWISS-PROT or provided by the user for the occurrence of PROSITE motifs or to scan the SWISS-PROT, TrEMBL and/or PDB databases for the occurrence of a pattern that can originate from PROSITE or be provided by the user. ScanProsite also allows the user to visualise the position of a PROSITE motif or of his own pattern on the 3D structure (if known) of the matched proteins (Figure 1a). Recently, we added the possibility for the user to evaluate the specificity of a pattern by using it to scan a randomised version of the current SWISS-PROT database. The URL for ScanProsite is scanprosite. ProfileScan ProfileScan allows a protein sequence from SWISS-PROT or provided by the user to be scanned for the occurrence of profiles stored in PROSITE and in the pre-release collection. The new URL for ProfileScan is cgi-bin/pfscan. Acknowledgements PROSITE is supported by grant no from the Swiss National Science Foundation. & HENRY STEWART PUBLICATIONS BRIEFINGS IN BIOINFORMATICS. VOL 3. NO SEPTEMBER

10 Sigrist et al. References 1. Bairoch, A. and Apweiler, R. (2000), The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000, Nucleic Acids Res., Vol. 28(1), pp Berman, H. M., Westbrook, J., Feng, Z. et al. (2000), The Protein Data Bank, Nucleic Acids Res., Vol. 28(1), pp Falquet, L., Pagni, M., Bucher, P. et al. (2002), The PROSITE database, its status in 2002, Nucleic Acids Res., Vol. 30(1), pp Kurosky, A., Barnett, D. R., Lee, T.-H. et al. (1980), Covalent structure of human haptoglobin: a serine protease homolog, Proc. Natl Acad. Sci. USA, Vol. 77(6), pp Bucher, P. and Bairoch, A. (1994), A generalized profile syntax for biomolecular sequence motifs and its function in automatic sequence interpretation, in Proceedings of the 2nd International Conference on Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park, CA, pp Bucher, P., Karplus, K., Moeri, N. and Hofmann, K. (1996), A flexible motif search technique based on generalized profiles, Comput. Chem., Vol. 20(1), pp Gribskov, M., McLachlan, A. D. and Eisenberg, D. (1987), Profile analysis: detection of distantly related proteins, Proc. Natl Acad. Sci. USA, Vol. 84(13), pp Gribskov, M., Lüthy, R. and Eisenberg, D. (1990), Profile analysis, Methods Enzymol., Vol. 183, pp Lüthy, R., Xenarios, I. and Bucher, P. (1994), Improving the sensitivity of the sequence profile method, Protein Sci., Vol. 3(1), pp Thompson, J. D., Higgins, D. G. and Gibson, T. J. (1994), Improved sensitivity of profile searches through the use of sequence weights and gap excision, Comput. Appl. Biosci., Vol. 10(1), pp Sibbald, P. R. and Argos, P. (1990), Weighting aligned protein or nucleic acid sequences to correct for unequal representation, J. Mol. Biol., Vol. 216(4), pp Durbin, R., Eddy, S. R., Krogh, A. and Mitchison, G. (1998), Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, Cambridge. 13. Krogh, A., Brown, M., Mian, I. S. et al. (1994), Hidden Markov models in computational biology. Applications to protein modeling, J. Mol. Biol., Vol. 235(5), pp Baldi, P., Chauvin, Y., Hunkapiller, T. and McClure, M. A. (1994), Hidden Markov models of biological primary sequence information, Proc. Natl Acad. Sci. USA, Vol. 91(3), pp Eddy, S. R. (1996), Hidden Markov models, Curr. Opin. Struct. Biol., Vol. 6(3), pp Dayhoff, M. O., Schwartz, R. M. and Orcutt, B. C. (1978), A model of evolutionary change in proteins, in Dayhoff, M. O., Ed., Atlas of Protein Sequence and Structure, Vol. 5, National Biomedical Research Foundation, Washington, DC, pp Henikoff, S. and Henikoff, J. G. (1992), Amino acid substitution matrices from protein blocks, Proc. Natl Acad. Sci. USA, Vol. 89(22), pp Karlin, S. and Altschul, S. F. (1990), Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes, Proc. Natl Acad. Sci. USA, Vol. 87(6), pp Hofmann, K. and Bucher, P. (1995), The FHA domain: A putative nuclear signalling domain found in protein kinases and transcription factors, Trends Biochem. Sci., Vol. 20(9), pp Eddy, S. R. (1998), Profile hidden Markov models, Bioinformatics, Vol. 14(9), pp Rabiner, L. R. (1989), A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE, Vol. 77, pp Hofmann, K. (2000), Sensitive protein comparisons with profiles and hidden Markov models, Brief. Bioinform., Vol. 1(2), pp Karplus, K., Barrett, C. and Hughey, R. (1998), Hidden Markov models for detecting remote protein homologies, Bioinformatics, Vol. 14(10), pp Gattiker, A., Gasteiger, E. and Bairoch, A. (2002), ScanProsite: a reference implementation of a PROSITE scanning tool, Applied Bioinform., Vol. 1(2), pp & HENRY STEWART PUBLICATIONS BRIEFINGS IN BIOINFORMATICS. VOL 3. NO SEPTEMBER 2002

Database searching with DNA and protein sequences: An introduction Clare Sansom Date received (in revised form): 12th November 1999

Database searching with DNA and protein sequences: An introduction Clare Sansom Date received (in revised form): 12th November 1999 Dr Clare Sansom works part time at Birkbeck College, London, and part time as a freelance computer consultant and science writer At Birkbeck she coordinates an innovative graduate-level Advanced Certificate

More information

Current Motif Discovery Tools and their Limitations

Current Motif Discovery Tools and their Limitations Current Motif Discovery Tools and their Limitations Philipp Bucher SIB / CIG Workshop 3 October 2006 Trendy Concepts and Hypotheses Transcription regulatory elements act in a context-dependent manner.

More information

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003 Similarity Searches on Sequence Databases: BLAST, FASTA Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003 Outline Importance of Similarity Heuristic Sequence Alignment:

More information

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the

More information

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS O.U. Sezerman 1, R. Islamaj 2, E. Alpaydin 2 1 Laborotory of Computational Biology, Sabancı University, Istanbul, Turkey. 2 Computer Engineering

More information

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD White Paper SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems Haruna Cofer*, PhD January, 2012 Abstract The SGI High Throughput Computing (HTC) Wrapper

More information

Pairwise Sequence Alignment

Pairwise Sequence Alignment Pairwise Sequence Alignment carolin.kosiol@vetmeduni.ac.at SS 2013 Outline Pairwise sequence alignment global - Needleman Wunsch Gotoh algorithm local - Smith Waterman algorithm BLAST - heuristics What

More information

MASCOT Search Results Interpretation

MASCOT Search Results Interpretation The Mascot protein identification program (Matrix Science, Ltd.) uses statistical methods to assess the validity of a match. MS/MS data is not ideal. That is, there are unassignable peaks (noise) and usually

More information

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques Proceedings of the 2007 WSEAS International Conference on Computer Engineering and Applications, Gold Coast, Australia, January 17-19, 2007 402 A Multiple DNA Sequence Translation Tool Incorporating Web

More information

Guide for Bioinformatics Project Module 3

Guide for Bioinformatics Project Module 3 Structure- Based Evidence and Multiple Sequence Alignment In this module we will revisit some topics we started to look at while performing our BLAST search and looking at the CDD database in the first

More information

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS NEW YORK CITY COLLEGE OF TECHNOLOGY The City University Of New York School of Arts and Sciences Biological Sciences Department Course title:

More information

Bioinformatics Grid - Enabled Tools For Biologists.

Bioinformatics Grid - Enabled Tools For Biologists. Bioinformatics Grid - Enabled Tools For Biologists. What is Grid-Enabled Tools (GET)? As number of data from the genomics and proteomics experiment increases. Problems arise for the current sequence analysis

More information

Tutorial for Proteomics Data Submission. Katalin F. Medzihradszky Robert J. Chalkley UCSF

Tutorial for Proteomics Data Submission. Katalin F. Medzihradszky Robert J. Chalkley UCSF Tutorial for Proteomics Data Submission Katalin F. Medzihradszky Robert J. Chalkley UCSF Why Have Guidelines? Large-scale proteomics studies create huge amounts of data. It is impossible/impractical to

More information

Clone Manager. Getting Started

Clone Manager. Getting Started Clone Manager for Windows Professional Edition Volume 2 Alignment, Primer Operations Version 9.5 Getting Started Copyright 1994-2015 Scientific & Educational Software. All rights reserved. The software

More information

Consensus alignment server for reliable comparative modeling with distant templates

Consensus alignment server for reliable comparative modeling with distant templates W50 W54 Nucleic Acids Research, 2004, Vol. 32, Web Server issue DOI: 10.1093/nar/gkh456 Consensus alignment server for reliable comparative modeling with distant templates Jahnavi C. Prasad 1, Sandor Vajda

More information

Sequence homology search tools on the world wide web

Sequence homology search tools on the world wide web 44 Sequence Homology Search Tools Sequence homology search tools on the world wide web Ian Holmes Berkeley Drosophila Genome Project, Berkeley, CA email: ihh@fruitfly.org Introduction Sequence homology

More information

Introduction to Bioinformatics 3. DNA editing and contig assembly

Introduction to Bioinformatics 3. DNA editing and contig assembly Introduction to Bioinformatics 3. DNA editing and contig assembly Benjamin F. Matthews United States Department of Agriculture Soybean Genomics and Improvement Laboratory Beltsville, MD 20708 matthewb@ba.ars.usda.gov

More information

Network Protocol Analysis using Bioinformatics Algorithms

Network Protocol Analysis using Bioinformatics Algorithms Network Protocol Analysis using Bioinformatics Algorithms Marshall A. Beddoe Marshall_Beddoe@McAfee.com ABSTRACT Network protocol analysis is currently performed by hand using only intuition and a protocol

More information

Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST

Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST Rapid alignment methods: FASTA and BLAST p The biological problem p Search strategies p FASTA p BLAST 257 BLAST: Basic Local Alignment Search Tool p BLAST (Altschul et al., 1990) and its variants are some

More information

Linear Sequence Analysis. 3-D Structure Analysis

Linear Sequence Analysis. 3-D Structure Analysis Linear Sequence Analysis What can you learn from a (single) protein sequence? Calculate it s physical properties Molecular weight (MW), isoelectric point (pi), amino acid content, hydropathy (hydrophilic

More information

DBDB : a Disulfide Bridge DataBase for the predictive analysis of cysteine residues involved in disulfide bridges

DBDB : a Disulfide Bridge DataBase for the predictive analysis of cysteine residues involved in disulfide bridges DBDB : a Disulfide Bridge DataBase for the predictive analysis of cysteine residues involved in disulfide bridges Emmanuel Jaspard Gilles Hunault Jean-Michel Richer Laboratoire PMS UMR A 9, Université

More information

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Focusing on results not data comprehensive data analysis for targeted next generation sequencing Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes

More information

Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance?

Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance? Optimization 1 Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance? Where to begin? 2 Sequence Databases Swiss-prot MSDB, NCBI nr dbest Species specific ORFS

More information

Amino Acids and Their Properties

Amino Acids and Their Properties Amino Acids and Their Properties Recap: ss-rrna and mutations Ribosomal RNA (rrna) evolves very slowly Much slower than proteins ss-rrna is typically used So by aligning ss-rrna of one organism with that

More information

Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment

Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment Sequence Analysis 15: lecture 5 Substitution matrices Multiple sequence alignment A teacher's dilemma To understand... Multiple sequence alignment Substitution matrices Phylogenetic trees You first need

More information

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

BLAST. Anders Gorm Pedersen & Rasmus Wernersson BLAST Anders Gorm Pedersen & Rasmus Wernersson Database searching Using pairwise alignments to search databases for similar sequences Query sequence Database Database searching Most common use of pairwise

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

BIOINFORMATICS TUTORIAL

BIOINFORMATICS TUTORIAL Bio 242 BIOINFORMATICS TUTORIAL Bio 242 α Amylase Lab Sequence Sequence Searches: BLAST Sequence Alignment: Clustal Omega 3d Structure & 3d Alignments DO NOT REMOVE FROM LAB. DO NOT WRITE IN THIS DOCUMENT.

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information

Introduction to Genome Annotation

Introduction to Genome Annotation Introduction to Genome Annotation AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGA CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG GTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCT

More information

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004 Protein & DNA Sequence Analysis Bobbie-Jo Webb-Robertson May 3, 2004 Sequence Analysis Anything connected to identifying higher biological meaning out of raw sequence data. 2 Genomic & Proteomic Data Sequence

More information

Data Mining Techniques and their Applications in Biological Databases

Data Mining Techniques and their Applications in Biological Databases Data mining, a relatively young and interdisciplinary field of computer sciences, is the process of discovering new patterns from large data sets involving application of statistical methods, artificial

More information

Protein Analysis and WEKA Data Mining

Protein Analysis and WEKA Data Mining GenMiner: a Data Mining Tool for Protein Analysis Gerasimos Hatzidamianos 1, Sotiris Diplaris 1,2, Ioannis Athanasiadis 1,2, Pericles A. Mitkas 1,2 1 Department of Electrical and Computer Engineering Aristotle

More information

Sequence Information. Sequence information. Good web sites. Sequence information. Sequence. Sequence

Sequence Information. Sequence information. Good web sites. Sequence information. Sequence. Sequence Sequence information Multiple Pair-wise SRS Entrez Comparisons Database searches Sequence Information Orthologue clusters Sequence Organell localisation Patterns Protein families Membrane attachment Bengt

More information

Discovering Bioinformatics

Discovering Bioinformatics Discovering Bioinformatics Sami Khuri Natascha Khuri Alexander Picker Aidan Budd Sophie Chabanis-Davidson Julia Willingale-Theune English version ELLS European Learning Laboratory for the Life Sciences

More information

Introduction to Bioinformatics AS 250.265 Laboratory Assignment 6

Introduction to Bioinformatics AS 250.265 Laboratory Assignment 6 Introduction to Bioinformatics AS 250.265 Laboratory Assignment 6 In the last lab, you learned how to perform basic multiple sequence alignments. While useful in themselves for determining conserved residues

More information

MATCH Commun. Math. Comput. Chem. 61 (2009) 781-788

MATCH Commun. Math. Comput. Chem. 61 (2009) 781-788 MATCH Communications in Mathematical and in Computer Chemistry MATCH Commun. Math. Comput. Chem. 61 (2009) 781-788 ISSN 0340-6253 Three distances for rapid similarity analysis of DNA sequences Wei Chen,

More information

Bio-Informatics Lectures. A Short Introduction

Bio-Informatics Lectures. A Short Introduction Bio-Informatics Lectures A Short Introduction The History of Bioinformatics Sanger Sequencing PCR in presence of fluorescent, chain-terminating dideoxynucleotides Massively Parallel Sequencing Massively

More information

GenBank, Entrez, & FASTA

GenBank, Entrez, & FASTA GenBank, Entrez, & FASTA Nucleotide Sequence Databases First generation GenBank is a representative example started as sort of a museum to preserve knowledge of a sequence from first discovery great repositories,

More information

Genome Explorer For Comparative Genome Analysis

Genome Explorer For Comparative Genome Analysis Genome Explorer For Comparative Genome Analysis Jenn Conn 1, Jo L. Dicks 1 and Ian N. Roberts 2 Abstract Genome Explorer brings together the tools required to build and compare phylogenies from both sequence

More information

ProteinQuest user guide

ProteinQuest user guide ProteinQuest user guide 1. Introduction... 3 1.1 With ProteinQuest you can... 3 1.2 ProteinQuest basic version 4 1.3 ProteinQuest extended version... 5 2. ProteinQuest dictionaries... 6 3. Directions for

More information

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want 1 When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want to search other databases as well. There are very

More information

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: msm_eng@k-space.org

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: msm_eng@k-space.org BIOINFTool: Bioinformatics and sequence data analysis in molecular biology using Matlab Mai S. Mabrouk 1, Marwa Hamdy 2, Marwa Mamdouh 2, Marwa Aboelfotoh 2,Yasser M. Kadah 2 1 Biomedical Engineering Department,

More information

Search Engine Optimization with Jahia

Search Engine Optimization with Jahia Search Engine Optimization with Jahia Thomas Messerli 12 Octobre 2009 Copyright 2009 by Graduate Institute Table of Contents 1. Executive Summary...3 2. About Search Engine Optimization...4 3. Optimizing

More information

Protein annotation and modelling servers at University College London

Protein annotation and modelling servers at University College London Nucleic Acids Research Advance Access published May 27, 2010 Nucleic Acids Research, 2010, 1 6 doi:10.1093/nar/gkq427 Protein annotation and modelling servers at University College London D. W. A. Buchan*,

More information

Data Integration via Constrained Clustering: An Application to Enzyme Clustering

Data Integration via Constrained Clustering: An Application to Enzyme Clustering Data Integration via Constrained Clustering: An Application to Enzyme Clustering Elisa Boari de Lima Raquel Cardoso de Melo Minardi Wagner Meira Jr. Mohammed Javeed Zaki Abstract When multiple data sources

More information

Department of Biochemistry, University of British Columbia, Vancouver, B.C., Canada V6T 1W5

Department of Biochemistry, University of British Columbia, Vancouver, B.C., Canada V6T 1W5 volume 10 Number 11982 Nucleic Acids Research A DNA sequence handling program A.D.Delaney Department of Biochemistry, University of British Columbia, Vancouver, B.C., Canada V6T 1W5 Received 10 November

More information

Module 1. Sequence Formats and Retrieval. Charles Steward

Module 1. Sequence Formats and Retrieval. Charles Steward The Open Door Workshop Module 1 Sequence Formats and Retrieval Charles Steward 1 Aims Acquaint you with different file formats and associated annotations. Introduce different nucleotide and protein databases.

More information

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions SMA 50: Statistical Learning and Data Mining in Bioinformatics (also listed as 5.077: Statistical Learning and Data Mining ()) Spring Term (Feb May 200) Faculty: Professor Roy Welsch Wed 0 Feb 7:00-8:0

More information

T cell Epitope Prediction

T cell Epitope Prediction Institute for Immunology and Informatics T cell Epitope Prediction EpiMatrix Eric Gustafson January 6, 2011 Overview Gathering raw data Popular sources Data Management Conservation Analysis Multiple Alignments

More information

Previous lecture: Today:

Previous lecture: Today: Previous lecture: The energy requiring step from substrate to transition state is an energy barrier called the free energy of activation G Transition state is the unstable (10-13 seconds) highest energy

More information

Support Vector Machines with Profile-Based Kernels for Remote Protein Homology Detection

Support Vector Machines with Profile-Based Kernels for Remote Protein Homology Detection TECHNICAL REPORT Report No. CSAI2004-01 Date: August 2004 Support Vector Machines with Profile-Based Kernels for Remote Protein Homology Detection Steven Busuttil John Abela Gordon J. Pace University of

More information

Core Bioinformatics. Degree Type Year Semester. 4313473 Bioinformàtica/Bioinformatics OB 0 1

Core Bioinformatics. Degree Type Year Semester. 4313473 Bioinformàtica/Bioinformatics OB 0 1 Core Bioinformatics 2014/2015 Code: 42397 ECTS Credits: 12 Degree Type Year Semester 4313473 Bioinformàtica/Bioinformatics OB 0 1 Contact Name: Sònia Casillas Viladerrams Email: Sonia.Casillas@uab.cat

More information

CSC 2427: Algorithms for Molecular Biology Spring 2006. Lecture 16 March 10

CSC 2427: Algorithms for Molecular Biology Spring 2006. Lecture 16 March 10 CSC 2427: Algorithms for Molecular Biology Spring 2006 Lecture 16 March 10 Lecturer: Michael Brudno Scribe: Jim Huang 16.1 Overview of proteins Proteins are long chains of amino acids (AA) which are produced

More information

Enhanced data mining analysis in higher educational system using rough set theory

Enhanced data mining analysis in higher educational system using rough set theory African Journal of Mathematics and Computer Science Research Vol. 2(9), pp. 184-188, October, 2009 Available online at http://www.academicjournals.org/ajmcsr ISSN 2006-9731 2009 Academic Journals Review

More information

Hidden Markov models in biological sequence analysis

Hidden Markov models in biological sequence analysis Hidden Markov models in biological sequence analysis by E. Birney The vast increase of data in biology has meant that many aspects of computational science have been drawn into the field. Two areas of

More information

Supplementary Information

Supplementary Information Supplementary Information S1: Degree Distribution of TFs in the E.coli TRN and CRN based on Operons 1000 TRN Number of TFs 100 10 y = 619.55x -1.4163 R 2 = 0.8346 1 1 10 100 1000 Degree of TFs CRN 100

More information

CD-HIT User s Guide. Last updated: April 5, 2010. http://cd-hit.org http://bioinformatics.org/cd-hit/

CD-HIT User s Guide. Last updated: April 5, 2010. http://cd-hit.org http://bioinformatics.org/cd-hit/ CD-HIT User s Guide Last updated: April 5, 2010 http://cd-hit.org http://bioinformatics.org/cd-hit/ Program developed by Weizhong Li s lab at UCSD http://weizhong-lab.ucsd.edu liwz@sdsc.edu 1. Introduction

More information

MORPHEUS. http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.

MORPHEUS. http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix. MORPHEUS http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix. Reference: MORPHEUS, a Webtool for Transcripton Factor Binding Analysis Using

More information

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications Product Bulletin Sequencing Software SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications Comprehensive reference sequence handling Helps interpret the role of each

More information

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources 1 of 8 11/7/2004 11:00 AM National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Human Genome Resources Model Organisms Guide Outreach and Education Databases and Tools

More information

A new type of Hidden Markov Models to predict complex domain architecture in protein sequences

A new type of Hidden Markov Models to predict complex domain architecture in protein sequences A new type of Hidden Markov Models to predict complex domain architecture in protein sequences Raluca Uricaru, Laurent Bréhélin and Eric Rivals LIRMM, CNRS Université de Montpellier 2 14 Juin 2007 Raluca

More information

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon Integrating DNA Motif Discovery and Genome-Wide Expression Analysis Department of Mathematics and Statistics University of Massachusetts Amherst Statistics in Functional Genomics Workshop Ascona, Switzerland

More information

Tutorial for proteome data analysis using the Perseus software platform

Tutorial for proteome data analysis using the Perseus software platform Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information

More information

Enzymes reduce the activation energy

Enzymes reduce the activation energy Enzymes reduce the activation energy Transition state is an unstable transitory combination of reactant molecules which occurs at the potential energy maximum (free energy maximum). Note - the ΔG of the

More information

Searching Nucleotide Databases

Searching Nucleotide Databases Searching Nucleotide Databases 1 When we search a nucleic acid databases, Mascot always performs a 6 frame translation on the fly. That is, 3 reading frames from the forward strand and 3 reading frames

More information

Final Project Report

Final Project Report CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes

More information

ProteinScape. Innovation with Integrity. Proteomics Data Analysis & Management. Mass Spectrometry

ProteinScape. Innovation with Integrity. Proteomics Data Analysis & Management. Mass Spectrometry ProteinScape Proteomics Data Analysis & Management Innovation with Integrity Mass Spectrometry ProteinScape a Virtual Environment for Successful Proteomics To overcome the growing complexity of proteomics

More information

Transcription and Translation of DNA

Transcription and Translation of DNA Transcription and Translation of DNA Genotype our genetic constitution ( makeup) is determined (controlled) by the sequence of bases in its genes Phenotype determined by the proteins synthesised when genes

More information

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006 Hidden Markov Models in Bioinformatics By Máthé Zoltán Kőrösi Zoltán 2006 Outline Markov Chain HMM (Hidden Markov Model) Hidden Markov Models in Bioinformatics Gene Finding Gene Finding Model Viterbi algorithm

More information

Flexible Information Visualization of Multivariate Data from Biological Sequence Similarity Searches

Flexible Information Visualization of Multivariate Data from Biological Sequence Similarity Searches Flexible Information Visualization of Multivariate Data from Biological Sequence Similarity Searches Ed Huai-hsin Chi y, John Riedl y, Elizabeth Shoop y, John V. Carlis y, Ernest Retzel z, Phillip Barry

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

Segmentation and Classification of Online Chats

Segmentation and Classification of Online Chats Segmentation and Classification of Online Chats Justin Weisz Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 jweisz@cs.cmu.edu Abstract One method for analyzing textual chat

More information

Global and Discovery Proteomics Lecture Agenda

Global and Discovery Proteomics Lecture Agenda Global and Discovery Proteomics Christine A. Jelinek, Ph.D. Johns Hopkins University School of Medicine Department of Pharmacology and Molecular Sciences Middle Atlantic Mass Spectrometry Laboratory Global

More information

Computational Systems Biology. Lecture 2: Enzymes

Computational Systems Biology. Lecture 2: Enzymes Computational Systems Biology Lecture 2: Enzymes 1 Images from: David L. Nelson, Lehninger Principles of Biochemistry, IV Edition, Freeman ed. or under creative commons license (search for images at http://search.creativecommons.org/)

More information

Software review. Pise: Software for building bioinformatics webs

Software review. Pise: Software for building bioinformatics webs Pise: Software for building bioinformatics webs Keywords: bioinformatics web, Perl, sequence analysis, interface builder Abstract Pise is interface construction software for bioinformatics applications

More information

A greedy algorithm for the DNA sequencing by hybridization with positive and negative errors and information about repetitions

A greedy algorithm for the DNA sequencing by hybridization with positive and negative errors and information about repetitions BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES, Vol. 59, No. 1, 2011 DOI: 10.2478/v10175-011-0015-0 Varia A greedy algorithm for the DNA sequencing by hybridization with positive and negative

More information

Lecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching

Lecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching COSC 348: Computing for Bioinformatics Definitions A pattern (keyword) is an ordered sequence of symbols. Lecture 4: Exact string searching algorithms Lubica Benuskova http://www.cs.otago.ac.nz/cosc348/

More information

The Ramachandran Map of More Than. 6,500 Perfect Polypeptide Chains

The Ramachandran Map of More Than. 6,500 Perfect Polypeptide Chains The Ramachandran Map of More Than 1 6,500 Perfect Polypeptide Chains Zoltán Szabadka, Rafael Ördög, Vince Grolmusz manuscript received March 19, 2007 Z. Szabadka, R. Ördög and V. Grolmusz are with Eötvös

More information

Frequently Asked Questions Next Generation Sequencing

Frequently Asked Questions Next Generation Sequencing Frequently Asked Questions Next Generation Sequencing Import These Frequently Asked Questions for Next Generation Sequencing are some of the more common questions our customers ask. Questions are divided

More information

Mass Spectrometry Signal Calibration for Protein Quantitation

Mass Spectrometry Signal Calibration for Protein Quantitation Cambridge Isotope Laboratories, Inc. www.isotope.com Proteomics Mass Spectrometry Signal Calibration for Protein Quantitation Michael J. MacCoss, PhD Associate Professor of Genome Sciences University of

More information

Protein Sequence Analysis - Overview -

Protein Sequence Analysis - Overview - Protein Sequence Analysis - Overview - UDEL Workshop Raja Mazumder Research Associate Professor, Department of Biochemistry and Molecular Biology Georgetown University Medical Center Topics Why do protein

More information

Motif Knowledge Base Based on Deductive Object-Oriented Database Language

Motif Knowledge Base Based on Deductive Object-Oriented Database Language - 10 Motif Knowledge Base Based on Deductive Object-Oriented Database Language MAKOTO HIROSAWAt REIKO TANAKAt MASATO ISHIKAWA hirosawa@sdl.hitachi.co.jp ma-tanak@icot.or.jp ishikawa@icot.or.jp tex-icot,

More information

Integrating Medicinal Chemistry and Computational Chemistry: The Molecular Forecaster Approach

Integrating Medicinal Chemistry and Computational Chemistry: The Molecular Forecaster Approach Integrating Medicinal Chemistry and Computational Chemistry: The Molecular Forecaster Approach Molecular Forecaster Inc. www.molecularforecaster.com Company Profile Founded in 2010 by Dr. Eric Therrien

More information

Extensions of the Partial Least Squares approach for the analysis of biomolecular interactions. Nina Kirschbaum

Extensions of the Partial Least Squares approach for the analysis of biomolecular interactions. Nina Kirschbaum Dissertation am Fachbereich Statistik der Universität Dortmund Extensions of the Partial Least Squares approach for the analysis of biomolecular interactions Nina Kirschbaum Erstgutachter: Prof. Dr. W.

More information

ProteinPilot Report for ProteinPilot Software

ProteinPilot Report for ProteinPilot Software ProteinPilot Report for ProteinPilot Software Detailed Analysis of Protein Identification / Quantitation Results Automatically Sean L Seymour, Christie Hunter SCIEX, USA Pow erful mass spectrometers like

More information

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING Sumit Goswami 1 and Mayank Singh Shishodia 2 1 Indian Institute of Technology-Kharagpur, Kharagpur, India sumit_13@yahoo.com 2 School of Computer

More information

Web-Based Genomic Information Integration with Gene Ontology

Web-Based Genomic Information Integration with Gene Ontology Web-Based Genomic Information Integration with Gene Ontology Kai Xu 1 IMAGEN group, National ICT Australia, Sydney, Australia, kai.xu@nicta.com.au Abstract. Despite the dramatic growth of online genomic

More information

Activity 7.21 Transcription factors

Activity 7.21 Transcription factors Purpose To consolidate understanding of protein synthesis. To explain the role of transcription factors and hormones in switching genes on and off. Play the transcription initiation complex game Regulation

More information

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

More information

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS 1. The Technology Strategy sets out six areas where technological developments are required to push the frontiers of knowledge

More information

Data Analysis for Ion Torrent Sequencing

Data Analysis for Ion Torrent Sequencing IFU022 v140202 Research Use Only Instructions For Use Part III Data Analysis for Ion Torrent Sequencing MANUFACTURER: Multiplicom N.V. Galileilaan 18 2845 Niel Belgium Revision date: August 21, 2014 Page

More information

Application of discriminant analysis to predict the class of degree for graduating students in a university system

Application of discriminant analysis to predict the class of degree for graduating students in a university system International Journal of Physical Sciences Vol. 4 (), pp. 06-0, January, 009 Available online at http://www.academicjournals.org/ijps ISSN 99-950 009 Academic Journals Full Length Research Paper Application

More information

Mascot Search Results FAQ

Mascot Search Results FAQ Mascot Search Results FAQ 1 We had a presentation with this same title at our 2005 user meeting. So much has changed in the last 6 years that it seemed like a good idea to re-visit the topic. Just about

More information

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

More information

Interaktionen von RNAs und Proteinen

Interaktionen von RNAs und Proteinen Sonja Prohaska Computational EvoDevo Universitaet Leipzig June 9, 2015 Studying RNA-protein interactions Given: target protein known to bind to RNA problem: find binding partners and binding sites experimental

More information

Biological Databases and Protein Sequence Analysis

Biological Databases and Protein Sequence Analysis Biological Databases and Protein Sequence Analysis Introduction M. Madan Babu, Center for Biotechnology, Anna University, Chennai 25, India Bioinformatics is the application of Information technology to

More information

Analysis of ChIP-seq data in Galaxy

Analysis of ChIP-seq data in Galaxy Analysis of ChIP-seq data in Galaxy November, 2012 Local copy: https://galaxy.wi.mit.edu/ Joint project between BaRC and IT Main site: http://main.g2.bx.psu.edu/ 1 Font Conventions Bold and blue refers

More information

Error Tolerant Searching of Uninterpreted MS/MS Data

Error Tolerant Searching of Uninterpreted MS/MS Data Error Tolerant Searching of Uninterpreted MS/MS Data 1 In any search of a large LC-MS/MS dataset 2 There are always a number of spectra which get poor scores, or even no match at all. 3 Sometimes, this

More information