BIOINFORMATICS ORIGINAL PAPER doi: /bioinformatics/btm377

Size: px
Start display at page:

Download "BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm377"

Transcription

1 Vol. 23 no , pages BIOINFORMATICS ORIGINAL PAPER doi: /bioinformatics/btm377 Structural bioinformatics Comparative protein structure modeling by combining multiple templates and optimizing sequence-to-structure alignments Narcis Fernandez-Fuentes, Brajesh K. Rai, Carlos J. Madrid-Aliste, J. Eduardo Fajardo and András Fiser* Department of Biochemistry and Seaver Center for Bioinformatics, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461, USA, Institute of Enzymology and Alfred Renyi Institute of Mathematics, Hungarian Academy of Sciences, H-1113 Budapest, Karolina ut 29, Hungary Received on March 8, 2007; revised on June 20, 2007; accepted on July 14, 2007 Advance Access publication September 6, 2007 Associate Editor: Burkhard Rost ABSTRACT Motivation: Two major bottlenecks in advancing comparative protein structure modeling are the efficient combination of multiple template structures and the generation of a correct input targettemplate alignment. Results: A novel method, Multiple Mapping Method with Multiple Templates (M4T) is introduced that implements an algorithm to automatically select and combine Multiple Template structures (MT) and an alignment optimization protocol (Multiple Mapping Method, MMM). The MT module of M4T selects and combines multiple template structures through an iterative clustering approach that takes into account the unique contribution of each template, their sequence similarity among themselves and to the target sequence, and their experimental resolution. MMM is a sequence-to-structure alignment method that optimally combines alternatively aligned regions according to their fit in the structural environment of the template structure. The resulting M4T alignment is used as input to a comparative modeling module. The performance of M4T has been benchmarked on CASP6 comparative modeling target sequences and on a larger independent test set, and showed favorable performance to current state of the art methods. Availability: A web server was established for the method at Contact: afiser@aecom.yu.edu or andras@fiserlab.org 1 INTRODUCTION Comparative protein structure modeling relies on detectable similarity spanning most of the modeled sequence and at least one known structure (Marti-Renom et al., 2000). When the structure of one protein in a family has been determined by experiment, the other members of the family can be modeled based on their alignment to the known structure. Comparative modeling approaches usually consist of four major steps: (1) identifying one or more templates (2) calculating an accurate alignment between the target sequence and template *To whom correspondence should be addressed. y Present address: Wyeth Research, CN8000, Princeton, New Jersey, , USA. structure(s) (3) modeling the target and (4) evaluating the target model (Fiser and Sali, 2003). Each step determines the success of all subsequent ones. For instance, an incorrect template selection cannot be corrected at the alignment step or an alignment error cannot be corrected at the model building step. Accordingly, the first two steps are the most critical ones in comparative modeling. The first step in homology modeling (i.e. template selection step) is aided by several available methods developed for foldrecognition (Domingues et al., 1999; McGuffin et al., 2000; Shi et al., 2001) and profile-alignment (Altschul et al., 1997; Li et al., 2000) that allow efficient recognition of remotely related sequences. Using these methods, it is most often possible to identify more than one template structure. Obviously, this trend is strengthening due to the rapid expansion of Protein Data Bank (PDB) (Berman et al., 2000) and in particular to worldwide structural genomics efforts (Chance et al., 2004). However, due to the complexity of the problem to optimally select and combine multiple templates, currently available modeling programs, and especially the automated servers, typically consider only one template for building a model for a target sequence. Meanwhile results at CASP experiments, as early as at CASP2 in 1996, indicated that multiple templates help to improve the quality of comparative models (Sanchez and Sali 1997; Venclovas and Margelevicius, 2005). Multiple template structures can be useful in two ways: first, multiple template structures may be aligned with different parts/domains of the target, with little overlap between them, in which case, the modeling procedure can construct a homologybased model of the whole target sequence (improving model coverage). Therefore, it is frequently beneficial to include in the modeling process all the templates that have a unique contribution to the target sequence (Fiser, 2004). Second, the template structures may be aligned with the same part of the target and build the model on the locally best template (improving model quality). Although the idea of combining multiple templates sounds straightforward, its implementation is fairly complex. The real challenge is not the identification of a list of suitable template candidates, but an optimal combination of these. This is 2558 ß The Author Published by Oxford University Press. All rights reserved. For Permissions, please journals.permissions@oxfordjournals.org

2 Comparative protein structure modeling because template search methods outperform the needs of comparative modeling in the sense that they are able to locate so remotely related sequences for which no reliable comparative model can be built. The reason for this is that sequence relationships are often established on short conserved segments, while a successful comparative modeling exercise requires an overall correct alignment for the entire modeled part of the protein. The MT module of the M4T algorithm addresses this very important issue. The second step in comparative modeling (i.e. the calculation of an accurate alignment of a target sequence to a template structure) remains to be a bottleneck in producing good quality homology models. A number of alignment methods have been developed and are publicly available [MUSCLE (Edgar, 2004), CLUSTALW (Thompson et al., 1994), Align2d (Madhusudhan et al., 2006), T-coffee (Notredame et al., 2000), FFAS (Jaroszewski et al., 2000) and SATCHMO (Edgar and Sjolander, 2003)]. However, none of these alignment methods consistently produces better solution for all cases (Prasad et al., 2003; Rai and Fiser, 2006). Furthermore, alignments produced by two different methods are often better in some regions and worse in others when compared to each other. One possible solution to this problem is to consider several alignment methods and combine better-aligned parts into a unique solution (Kosinski et al., 2005; Rai and Fiser, 2006). M4T has been developed to produce accurate alignments and models by minimizing the errors associated with the first two steps in comparative modeling (recognizing and combining templates and generating an optimal input alignment). In the first step, the MT module uses an iterative clustering approach to select and combine multiple protein structures to serve as templates. Next, to reduce errors associated with alignments, an iterative implementation of the earlier published Multiple Mapping Method (MMM) (Rai and Fiser, 2006) is used that considers solutions from several alignment methods and combines better-aligned parts into a unique solution. The performance of M4T has been rigorously tested using various benchmarks. We demonstrate that M4T produces better models when multiple templates are used as opposed to the cases using only the single best available template; M4T superior performance stands out in the low-sequence identity region, which present major challenge to homology modeling. Furthermore, M4T also compares favorably with other competitive approaches and with the performance of expert users at CASP. 2 METHODS 2.1 Template selection method: MT module The target sequence is used as a query to search for homologous protein structure(s) that could serve as template(s) by running three iterations of PSI-BLAST (Altschul et al., 1997) against PDB (Berman et al., 2000), with an E-value cutoff of Only those hits are selected where the sequence overlap with the target sequence is 460% of the actual SCOP domain length or more than 75% of the PDB chain length in case of a missing SCOP classification. Next, the hits are clustered using an iterative clustering procedure that identifies the most suitable PDB structures to combine as templates. The goal of the clustering step is to identify the least number of targets that can contribute the most to the model. Templates are selected or discarded according to the following procedure [Fig. 1, also Fig. 2 in Fernandez et al. (2007)]: (1) Cluster initiation. The hit with the smallest E-value is selected and is used to seed a cluster. All hits that align in the same region (within 10 flanking residues of the first selected hit) are added to this cluster. (2) Sequence identity hits to query. The sequence identity is calculated between query and all hits in the cluster according to the PSI-BLAST alignment. If the sequence identity of the best available hit is larger than 50%, only those additional hits are kept in the cluster whose identity is within 20% of the best hit. (3) Characterize hits as unique and non-unique. A hit is unique if it contains at least one stretch of 8 or more residues aligned to a region of the target sequence that is not covered by any other hit. The current limit of 8 residues approximately corresponds to an upper limit, until which a reliable loop conformation can be built using available approaches and therefore it is subject to change as loop modeling techniques are improving in time (Fernandez- Fuentes, 2006; Fiser et al., 2000). Unique and non-unique attributes are assigned to all hits that form a cluster and then all hits are ranked within a cluster according to their crystal resolution. Thus, a hit with the best crystal resolution is always unique and the remaining hits can be unique only if they contribute to a unique region (e.g. to an insertion that is solved in that one structure only and not in any other). (4) Consolidating the clusters. Once the hits that form the cluster are classified into unique and non-unique a purging process is started. It has three consecutive qualifying steps and applies to non-unique hits only: (a) The first step is a sequence identity comparison using a greedy algorithm, where only those non-unique hits that have a sequence identity between 30 and 90% to any unique hit are kept; the rest are discarded. Note that once a non-unique hit is selected the remaining non-unique hits will be compared against the unique plus the selected non-unique hits. Again, the order of comparisons is set by crystal resolution. The sequence identity is calculated using the alignments between hits and target sequence given by PSI-BLAST. In general, this step ensures that structurally neither too similar nor too dissimilar templates will be selected. (b) Next, a filtering step takes place that consolidates templates with varying crystal resolution. Non-unique hits are discarded if the difference in crystal resolution to the experimentally best-solved unique template is larger than 1.5 Å. This step guarantees that significantly poorer resolution templates are not used. NMR structures are assigned a virtual 4.5 Å resolution, which means that NMR solution is used only if it is the only template or if a similar X-ray structure has a worse resolution than 3 Å. (c) The last filter determines if a hit is contributing to an underrepresented part of the target, i.e. a non-unique hit is kept only if it is aligned to a region of 8 or more residues that is covered by two or less hits. (5) Return to point (1) if there are hits that are not assigned to any cluster and iterate again, if necessary by initiating and establishing new clusters. The result of this iterative clustering process is one or more clusters of templates containing one or more template structures. Next, within each cluster, all templates are aligned to the corresponding target 2559

3 N.Fernandez-Fuentes et al. sequence using the iterative-mmm approach (see Subsequently). In a last consolidation step, sequence-to-structure alignments of clusters that overlap are combined. The overlapping parts of the templates are superposed and an LGA_S score (Zemla, 2003) is calculated on that superposition. If this score is larger than 70%, then the overlapping clusters are combined using their alignment to the (same) target sequence as reference. If clusters of templates are not overlapping or the overlap between them cannot be structurally accurately superposed, then individual models are built for each modelable part of the target sequence for each cluster of templates. 2.2 Target to template(s) alignment: MMM module The target-to-template(s) alignments are calculated using an iterative implementation of the Multiple Mapping Method (Rai and Fiser, 2006). To construct profiles, the sequences of the target and template(s) are independently searched against the non-redundant database [NR (Boeckmann et al., 2003)] of NCBI using five iterations of PSI-BLAST and with E-value cutoff of Next, BlastProfiler (Rai et al., 2007) is run to build sequence profiles for both the target and template sequences. The program parses all iterations of PSIBLAST outputs, locates and stores those pairwise alignments between the query and database sequences that meet the filtering criteria. The values specified for filtering are: (i) Lower and upper cutoffs for percent sequence identities between the hit and the query, as reported in the pairwise Blast alignment; default: 30 and 90%, respectively. (ii) Lower bound for alignment length; default: 30 residues. (iii) Maximal E-value for each hit; default: (iv) Minimal required coverage of the query in the alignment, in percentage; default: 30%. Typically, the PSI- BLAST output contains more than one alignment for the same hit sequence, especially when multiple iterations are performed. Such alternative alignments may include either the same or different regions of the hit sequence. Alignments to different regions of the target are kept as separate entries. Two alignments that involve the same hit sequence are considered redundant if the overlap is 450%. Because alignments produced in later iterations contain more specific information about the sequence profile, these alignments are preferred over earlier ones in case of overlaps. The second major step in the selection of a set of representative hit sequences is to remove sequence redundancy using CD-HIT clustering program (Li et al., 2002) at 40% identity level. Starting from the collected sequences, three separate profiles are calculated for each template(s) and target sequence, namely clustalw_d_profile, clustalw_m_profile and muscle_profile. The clustalw_d_profile and clustalw_m_profile are obtained by aligning the sequences using CLUSTALW (Thompson et al., 1994) with default gap penalty function (clustal_d_profile) and with modified gap penalty function (clustalw_m_profile), and muscle_profile is obtained using MUSCLE (Edgar, 2004). At the end of this step, three alternative profile-to-profile-based sequence alignments are available, which are used as input to MMM (Rai and Fiser, 2006). These three alternative profile-to-profile based sequence alignments are combined in the following manner: clustalw_d_profile is combined with muscle_d_ profile, generating an MMM alignment, mmm_alignment_1; clustal_m_profile is combined with muscle_d_profile generating mmm_ alignment_2. Finally, mmm_alignment_1 and mmm_alignment_2 are used as inputs to MMM for the final MMM alignment (Fig. 1). 2.3 Model building Models are built with the MODELLER program (Fiser and Sali, 2003; Sali and Blundell, 1993) using the default values for model.top routine. Selected template(s) and optimized alignment(s) are provided as inputs. Fig. 1. Flowchart for model building. General overview of the algorithm: starting from a query sequence a search is performed using PSI-BLAST, and template(s) are selected in MT-module; subsequently, the MMM-module performs sequence alignment(s), and finally MODELLER builds the protein(s) model(s). see Methods section for further explanations. 2.4 Benchmark sets Two different test sets were used to benchmark our method. The first benchmark set was composed of sequences used in the CASP6 experiment for comparative modeling assessments. The target sequences were downloaded from edu/casp6/ and only those target sequences that produced a hit against a tailored PDB (Berman et al., 2000) dataset (see Subsequently) with PSI-BLAST (Altschul et al., 1997) were kept. In total 24 targets from 17 target protein sequences were considered (CASP target identifications: T0204, T0229, T0231, T0233, T0240, T0246, T0247, T0264, T0266, T0268, T0269, T0271, T0274, T0275, T0276, T0277 and T0282). The second benchmark set was composed of 765 selected protein sequences with known structures, taken out of 1160 from a previous work (Rai and Fiser, 2006), for each of these selected sequences the MT module returned more than one hit or template. Each query sequence of both benchmark sets was modeled using a tailored PDB (MT module) and a tailored NR database (MMM module). The tailored databases did not contain any structure or sequence that was deposited after the expiration date set by the CASP organizers. 2.5 Measure of model quality Three measures were used to assess the quality of the models, i.e. the similarity between the generated comparative models and the 2560

4 Comparative protein structure modeling Table 1. List of CASP6 targets and the accuracy of the comparative models built using a template with the best PSI-BLAST E-value Target Template Nt Nm RMSD seq (A ) RMSD str (A ) Nr GDT_TS T0204 1HXP_A T0229_1 1ML8_A T0229_2 1ML8_A T0231 1F7S_A T0233_1 1KHD_D T0233_2 1KHD_D T0240 1QXX_A T0246 1A05_A T0247_1 1PJ6_A T0247_2 1PJ6_A T0247_3 1PJ6_A T0264_1 1VHV_A T0264_2 1VHV_A T0266 1DBU_A T0268_1 1N2X_A T0268_2 1N2X_A T0269_1 1QMV_A T0269_1 1QQ2_A T0271 1RLH_A T0274 1I0R A T0275 1MJH_A T0276 1SOU_A T0277 1JOG_A T0282 1PQ3_A Nt: number of residues in target structure; Mm: number of residues in model; RMSDseq: root mean square deviation of C atoms based on a sequence-dependent superposition; RMSDstr: root mean square deviation of C atoms based on a structure-dependent superposition; Nr: number of residues considered for RMSD calculation and GDT_TS: global distance test total score (see Methods section for more information). corresponding experimental structure: RMSDseq, RMSDstr and GDT_TS score. RMSDseq is the root mean square deviation that is calculated on Calpha atoms after a sequence-dependent superposition of Calpha positions using a 5.0 Å distance cutoff. RMSDstr is the same as RMSDseq but on a sequence-independent superposition (i.e. using the best structural superimposition). Finally, GDT_TS score or global distance test total score was calculated. GDT_TS score is a main metric to evaluate CASP experiments and it accounts for the structural similarity between the model and experimental solution structure by measuring the fraction of superposable residues at distance cutoffs of 1.0, 2.0, 4.0 and 8.0 Å. All these measures were calculated using the LGA program (Zemla, 2003). 3 RESULTS 3.1 Performance of M4T The performance of the method has been benchmarked in two different scenarios. M4T performance was tested on CASP6 comparative modeling targets and compared to models that were based on the single best template and then on the single best model produced by any group at CASP6. Finally, on a larger independent set the overall performance of M4T was tested by building models on single and multiple templates for 765 cases. 3.2 Single versus multiple templates at CASP All comparative model targets were tested by building models with M4T using the single best identified template and then by using multiple templates. In this setup, we used the MMM alignment module of M4T to generate input alignments for both cases. For 11 out of 24 CASP comparative modeling targets, it was possible to combine multiple templates. For all cases but one (T0269) the use of multiple templates provides a superior model in terms of RMSDseq, RMSDstr and GDT_TS scores than the one based on a single best template (Tables 1 and 2). The most impressive improvement takes place in case of target sequence T0275 where the GDT_TS score increases from to when multiple templates are combined. These observations confirm the anecdotal reports of CASP participants that suggested that use of multiple templates is advantageous (Sanchez and Sali, 1997; Venclovas and Margelevicius, 2005). 3.3 Comparison with current methods and expert knowledge M4T also compared well with state-of-the art methods and human experts in protein modeling. Table 3 shows the performance of M4T as compared with the single best models submitted to CASP6 by any group. These results often differ from the ones reported in the previous section because alignments may be different due to different methods used, different profiles employed or manual editing. Certain users may have used information on multiple structures. In addition, expert users may have attempted side chain and loop modeling in certain parts of the models. An ultimate goal of automated 2561

5 N.Fernandez-Fuentes et al. Table 2. List of CASP6 target sequences and the accuracy of its prediction using multiple templates Target Template Nt Nm RMSD seq (Å) RMSD str (Å) Nr GDT_TS T0204 1HXP_A GUP_A T0231 1F7S_A M4J_A 1AHQ_- 1AK6_- T0233_1 1KHD_{A,C,D} BRW_A 1V8G_A T0233_2 1KHD_{A,C,D} BRW_A 1V8G_A T0246 1A05_A HQS_A 1CNZ_A 1CM7_A T0268_1 1N2X_A M6Y_B T0268_2 1N2X_A M6Y_B T0269_1 1QMV_A N8J_A T0269_1 1QQ2_A ST9_A T0275 1MJH_A JMV_A 1TQ8_A T0282 1PQ3_A CEV_A See Table 1 for explanation of headers. structure prediction is to deliver models with a competitive accuracy to the ones created to expert users, and to do it in a fully automated way and in a short time. In 9 out of 24 cases, M4T outperformed the single best model submitted to CASP (Table 3). As another qualitative comparison, in 9 cases the differences between the best CASP model and M4T were small, and in 5 other cases M4T was significantly better, while in 9 cases CASP models turned out to be more accurate (for one case M4T did not return a model). Out of the 24 best CASP targets the largest population of targets that belonged to the same research group was 9, the second largest was 2. In this simplified comparison, M4T would fare as the second best individual performer with five most superior models to any other submission. While it is true that from a small number of test cases, such as at CASP, it is hard to conclude statistical significance (Marti-Renom et al., 2002) we perceive this performance as encouraging and a sign that automated methods becoming competitive with the best expert users. 3.4 Benchmarking on an independent test set The benefit of using multiple templates was also confirmed on an independent benchmarking set consisting of 765 proteins taken from an earlier study (Rai and Fiser, 2006). Two sets of models were built: (a) using multiple templates, and (b) using the single best template. On Figure 2, RMSDseq is shown versus sequence identity (comparing the quality of models to the sequence identity between the target and the best template). Below 50% sequence identity, models built using multiple templates are more accurate than those built using a single template only and this trend is accentuated as one moves into more remote target-template pair cases. Meanwhile, the advantage of using multiple templates gradually disappears above 50% target-template sequence identity cases. This result is also consistent with the performance on the CASP6 set where hits usually have a low sequence identity with their corresponding query. Besides improving the model quality, the use of multiple templates also increases model coverage, i.e. the resulting models cover a larger fraction of the target sequence, sometimes as much as 50 residues longer (Fig. 3). 3.5 Two examples of models predicted using single and multiple templates Figure 4 shows the structure prediction of PDB: 1ekx, chain A. After searching in a tailored PDB database, the hit with highest E-value was 9atc (E-value 1E 176). MT module returned a cluster of three templates: 1acm, 1a1s and 1oth. Both models are very accurate for the core of the protein, however, the 2562

6 Comparative protein structure modeling Table 3. Comparison of prediction accuracy between the best possible model using our method and the best model submitted to CASP6 Target BEST M4T BEST CASP 6 GDT_TS RMSD seq (A ) GROUP GDT_TS RMSD seq (A ) T GINALSKY T0229_ CBRC-3D T0229_ CHIMERA T NANOMODEL T0233_ ROHL T0233_ GINALSKY T GINALSKY T HONIGLAB T0247_ ALSO-RAN_U T0247_ GINALSKY T0247_ TOME_U T0264_ JONES-UCL T0264_ GINALSKY T SKOLNICK-ZHANG T0268_ CASPITA T0268_ CBSU T0269_ GINALSKY T0269_2 N/A N/A GINALSKY T GENESILICO-GROUP T GINALSKY T VENCLOVAS T TOME_U T KOLINSKY&BUJNICKI T GINALSKY and 1yna as templates. For comparison, the length of the model using the single best E-value hit, 1xyn, is 167 residues only. The longer model includes an additional supersecondary element, a beta-turn-beta-turn element, which is not present in the model built with single best template. Fig. 2. RMSD(seq) versus sequence identity. Using a dataset of 765 proteins with known structure, two sets of models were built: (1) using one template (best E-value hit only; light bars), (2) using multiple templates (gray bars). The percentage of sequence identity is calculated between the hit with the highest E-value and the query sequence. The error of the mean is shown. model built using multiple templates (red) is more accurate in two regions, marked A and B, than the model built using a single template. An additional advantage of using multiple templates is that the resulting model is more complete. Figure 5 shows the model for PDB 1hix, chain B. The length of the model built with multiple templates is 187 and was built using 2bvv, 1enx, 1f5f 4 DISCUSSION AND CONCLUSIONS We described a new algorithm, M4T, for fully automated comparative modeling that makes it possible to: (1) efficiently selects and combines multiple template structures; and (2) generates an accurated target-to-template alignment. For template selection step, we introduced an iterative clustering approach of potential templates that is driven by a set of filtering and ranking criteria and is based on sequence signal, crystal resolution and on the relative sequence novelty contribution to the target. For aligning the selected templates with the target sequence, we used a new version of the MMM method. The novelty comes from employing a sequence profile building module so that profile-to-profile alignments are used as inputs to MMM instead of pairwise alignments. The other difference to the earlier implementation of MMM is that the input alignments are combined in an automated iterative way, unlike before when the actual combination required supervision (Rai and Fiser, 2006). The original version of MMM showed a statistically significant improvement over existing methods by reducing alignment errors in the range of 3 17% over the inputs. MMM also compared favorably over two alignment meta-servers tested (Lambert et al., 2002; 2563

7 N.Fernandez-Fuentes et al. Fig. 3. Histogram of the increase of model coverage. Each query sequence is modeled using single and multiple template(s). The histogram shows the frequency of difference between the length of model built using multiple templates (Lm), and length of the model built using a single template (Ls) sequence identity. Fig. 5. Model for pdb 1hix chain B using single and multiple templates. The X-ray structure, the model with multiple templates, and the model built with a single template are shown in gray, red and blue, respectively. The combination of multiple templates resulted in a more complete model that includes an extra beta-turn-beta-turn region (20 amino acids), depicted in ribbon in the figure. structure modeling in the hands of expert users. M4T also performs better at low sequence identity signal, both in terms of model quality and model coverage. Fig. 4. Model for pdb 1ekx chain A using single and multiple templates. The X-ray structure, model with multiple templates, and model with single templates are shown in gray, red and blue, respectively. Although both models agree very well with the core of the X-ray protein, the model constructed using multiple templates agrees much better in two exposed regions, A and B, than the model built using single template. Figures 4 and 5 were generated using PyMOL ( pymol.sourceforge.net/). Prasad et al., 2003). Meanwhile, the iterative version of MMM has been illustrated here to outperform its own earlier implementation (Rai et al., 2007). We have shown that the fully automated M4T performs equally well or better as the most advanced methods in protein 4.1 Web-server M4T is accessible as a web-server at servers/m4t/ (Fernandez-Fuentes et al., 2007). The web-server has a straightforward interface. The user only needs to provide a target sequence, which can be entered in a text box, or can be uploaded as a text file, provide a short description for the sequence and a valid address. The target sequence must be in pure text containing one-letter amino acid codes (without any header). The server will returns a full atom model(s) in PDB format as output, plus the alignment(s) used for modeling. All the jobs are submitted to a queuing system thus the delay in execution depends on the number of active queries. Once the prediction is completed results are sent by in the form of a link pointing to a temporary web page that stores results for 1 month. ACKNOWLEDGEMENT This work was supported by NIH GM Conflict of Interest: none declared. REFERENCES Altschul,S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, Berman,H.M. et al. (2000) The Protein Data Bank. Nucleic Acids Res., 28,

8 Comparative protein structure modeling Boeckmann,B. et al. (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in Nucleic Acids Res., 31, 365. Chance,M.R. et al. (2004) High-throughput computational and experimental techniques in structural genomics. Genome Res., 14, Domingues,F.S. et al. (1999) Sustained performance of knowledge-based potentials in fold recognition. Proteins, 37, 112. Edgar,R.C. (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics, 5, 113. Edgar,R.C. and Sjolander,K. (2003) SATCHMO: sequence alignment and tree construction using hidden Markov models. Bioinformatics, 19, Fernandez-Fuentes,N. et al. (2006) A supersecondary structure library and search algorithm for modeling loop in protein structures. Nucleic Acids Res., 14, Fernandez-Fuentes,N. et al. (2007) M4T: a comparative protein structure modeling server. Nucleic Acids Res. Fiser,A. (2004) Protein structure modeling in the proteomics era. Expert Rev Proteomics, 1, Fiser,A. and Sali,A. (2003) Modeller: generation and refinement of homologybased protein structure models. Methods Enzymol., 374, 461. Fiser,A et al. (2000) Modeling of loops in protein structures. Proein Sci., 9, Jaroszewski,L. et al. (2000) Improving the quality of twilight-zone alignments. Protein Sci., 9, Kosinski,J. et al. (2005) FRankenstein becomes a cyborg: the automatic recombination and realignment of fold recognition models in CASP6. Proteins, 61, Lambert,C. et al. (2002) ESyPred3D: prediction of proteins 3D structures. Bioinformatics, 18, Li,W. et al. (2000) Saturated BLAST: an automated multiple intermediate sequence search used to detect distant homology. Bioinformatics, 16, Li,W. et al. (2002) Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics, 18, Madhusudhan,M.S. et al. (2006) Variable gap penalty for protein sequencestructure alignment. Protein Eng. Des. Sel., 19, Marti-Renom,M.A. et al. (2000) Comparative protein structure modeling of genes and genomes. Annu. Rev. Biophys. Biomol. Struct., 29, 291. Marti-Renom,M.A. et al. (2002) Reliability of assessment of protein structure prediction methods. Structure (Camb.) 10, 435. McGuffin,L.J. et al. (2000) The PSIPRED protein structure prediction server. Bioinformatics, 16, 404. Notredame,C. et al. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205. Prasad,J.C. et al. (2003) Consensus alignment for reliable framework prediction in homology modeling. Bioinformatics, 19, Rai,B.K. and Fiser,A. (2006) Multiple mapping method: a novel approach to the sequence-to-structure alignment problem in comparative protein structure modeling. Proteins, 63, Rai,B.K. et al. (2007) MMM: a sequence-to-structure alignment protocol. Bioinformatics, 22, Sali,A. and Blundell,T.L. (1993) Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol., 234, 779. Sanchez,R. and Sali,A. (1997) Evaluation of comparative protein structure modeling by MODELLER-3. Proteins, (Suppl. 1), 50. Shi,J. et al. (2001) FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J. Mol. Biol., 310, 243. Thompson,J.D. et al. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice. Nucleic Acids Res., 22, Venclovas,C. and Margelevicius,M. (2005) Comparative modeling in CASP6 using consensus approach to template selection, sequence-structure alignment, and structure assessment. Proteins, 61, Zemla,A. (2003) LGA: a method for finding 3D similarities in protein structures. Nucleic Acids Res., 31,

Consensus alignment server for reliable comparative modeling with distant templates

Consensus alignment server for reliable comparative modeling with distant templates W50 W54 Nucleic Acids Research, 2004, Vol. 32, Web Server issue DOI: 10.1093/nar/gkh456 Consensus alignment server for reliable comparative modeling with distant templates Jahnavi C. Prasad 1, Sandor Vajda

More information

CD-HIT User s Guide. Last updated: April 5, 2010. http://cd-hit.org http://bioinformatics.org/cd-hit/

CD-HIT User s Guide. Last updated: April 5, 2010. http://cd-hit.org http://bioinformatics.org/cd-hit/ CD-HIT User s Guide Last updated: April 5, 2010 http://cd-hit.org http://bioinformatics.org/cd-hit/ Program developed by Weizhong Li s lab at UCSD http://weizhong-lab.ucsd.edu liwz@sdsc.edu 1. Introduction

More information

Prediction Center s Data Guide

Prediction Center s Data Guide Prediction Center s Data Guide Groups 215 208 253 234 251 163 35 70 98 www.predictioncenter.org PACIFIC GROVE CALIFORNIA, USA DECEMBER 5, 2010 CASP process even year odd year Jan May Jul Aug Sep Oct Dec

More information

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the

More information

Protein annotation and modelling servers at University College London

Protein annotation and modelling servers at University College London Nucleic Acids Research Advance Access published May 27, 2010 Nucleic Acids Research, 2010, 1 6 doi:10.1093/nar/gkq427 Protein annotation and modelling servers at University College London D. W. A. Buchan*,

More information

Linear Sequence Analysis. 3-D Structure Analysis

Linear Sequence Analysis. 3-D Structure Analysis Linear Sequence Analysis What can you learn from a (single) protein sequence? Calculate it s physical properties Molecular weight (MW), isoelectric point (pi), amino acid content, hydropathy (hydrophilic

More information

Bioinformatics Grid - Enabled Tools For Biologists.

Bioinformatics Grid - Enabled Tools For Biologists. Bioinformatics Grid - Enabled Tools For Biologists. What is Grid-Enabled Tools (GET)? As number of data from the genomics and proteomics experiment increases. Problems arise for the current sequence analysis

More information

Guide for Bioinformatics Project Module 3

Guide for Bioinformatics Project Module 3 Structure- Based Evidence and Multiple Sequence Alignment In this module we will revisit some topics we started to look at while performing our BLAST search and looking at the CDD database in the first

More information

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD White Paper SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems Haruna Cofer*, PhD January, 2012 Abstract The SGI High Throughput Computing (HTC) Wrapper

More information

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want 1 When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want to search other databases as well. There are very

More information

Bioinformatics for Biologists. Protein Structure

Bioinformatics for Biologists. Protein Structure Bioinformatics for Biologists Comparative Protein Analysis: Part III. Protein Structure Prediction and Comparison Robert Latek, PhD Sr. Bioinformatics Scientist Whitehead Institute for Biomedical Research

More information

Genome Explorer For Comparative Genome Analysis

Genome Explorer For Comparative Genome Analysis Genome Explorer For Comparative Genome Analysis Jenn Conn 1, Jo L. Dicks 1 and Ian N. Roberts 2 Abstract Genome Explorer brings together the tools required to build and compare phylogenies from both sequence

More information

Optimal Contact Map Alignment of Protein-Protein Interfaces Vinay Pulim, 1 Bonnie Berger, 1,2 * Jadwiga Bienkowska, 1,3,* 1

Optimal Contact Map Alignment of Protein-Protein Interfaces Vinay Pulim, 1 Bonnie Berger, 1,2 * Jadwiga Bienkowska, 1,3,* 1 Bioinformatics Advance Access published August, 008 Original Paper Optimal Contact Map Alignment of Protein-Protein Interfaces Vinay Pulim, Bonnie Berger,, * Jadwiga Bienkowska,,3,* Computer Science and

More information

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS NEW YORK CITY COLLEGE OF TECHNOLOGY The City University Of New York School of Arts and Sciences Biological Sciences Department Course title:

More information

Database searching with DNA and protein sequences: An introduction Clare Sansom Date received (in revised form): 12th November 1999

Database searching with DNA and protein sequences: An introduction Clare Sansom Date received (in revised form): 12th November 1999 Dr Clare Sansom works part time at Birkbeck College, London, and part time as a freelance computer consultant and science writer At Birkbeck she coordinates an innovative graduate-level Advanced Certificate

More information

Protein Block Expert (PBE): a web-based protein structure analysis server using a structural alphabet

Protein Block Expert (PBE): a web-based protein structure analysis server using a structural alphabet Nucleic Acids Research, 2006, Vol. 34, Web Server issue W119 W123 doi:10.1093/nar/gkl199 Protein Block Expert (PBE): a web-based protein structure analysis server using a structural alphabet M. Tyagi 1,

More information

Steffen Lindert, René Staritzbichler, Nils Wötzel, Mert Karakaş, Phoebe L. Stewart, and Jens Meiler

Steffen Lindert, René Staritzbichler, Nils Wötzel, Mert Karakaş, Phoebe L. Stewart, and Jens Meiler Structure 17 Supplemental Data EM-Fold: De Novo Folding of α-helical Proteins Guided by Intermediate-Resolution Electron Microscopy Density Maps Steffen Lindert, René Staritzbichler, Nils Wötzel, Mert

More information

LOCSVMPSI: a web server for subcellular localization of eukaryotic proteins using SVM and profile of PSI-BLAST

LOCSVMPSI: a web server for subcellular localization of eukaryotic proteins using SVM and profile of PSI-BLAST Nucleic Acids Research, 2005, Vol. 33, Web Server issue W105 W110 doi:10.1093/nar/gki359 LOCSVMPSI: a web server for subcellular localization of eukaryotic proteins using SVM and profile of PSI-BLAST Dan

More information

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003 Similarity Searches on Sequence Databases: BLAST, FASTA Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003 Outline Importance of Similarity Heuristic Sequence Alignment:

More information

Searching Nucleotide Databases

Searching Nucleotide Databases Searching Nucleotide Databases 1 When we search a nucleic acid databases, Mascot always performs a 6 frame translation on the fly. That is, 3 reading frames from the forward strand and 3 reading frames

More information

Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance?

Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance? Optimization 1 Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance? Where to begin? 2 Sequence Databases Swiss-prot MSDB, NCBI nr dbest Species specific ORFS

More information

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Focusing on results not data comprehensive data analysis for targeted next generation sequencing Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes

More information

Module 1. Sequence Formats and Retrieval. Charles Steward

Module 1. Sequence Formats and Retrieval. Charles Steward The Open Door Workshop Module 1 Sequence Formats and Retrieval Charles Steward 1 Aims Acquaint you with different file formats and associated annotations. Introduce different nucleotide and protein databases.

More information

Core Bioinformatics. Degree Type Year Semester. 4313473 Bioinformàtica/Bioinformatics OB 0 1

Core Bioinformatics. Degree Type Year Semester. 4313473 Bioinformàtica/Bioinformatics OB 0 1 Core Bioinformatics 2014/2015 Code: 42397 ECTS Credits: 12 Degree Type Year Semester 4313473 Bioinformàtica/Bioinformatics OB 0 1 Contact Name: Sònia Casillas Viladerrams Email: Sonia.Casillas@uab.cat

More information

BUDAPEST: Bioinformatics Utility for Data Analysis of Proteomics using ESTs

BUDAPEST: Bioinformatics Utility for Data Analysis of Proteomics using ESTs BUDAPEST: Bioinformatics Utility for Data Analysis of Proteomics using ESTs Richard J. Edwards 2008. Contents 1. Introduction... 2 1.1. Version...2 1.2. Using this Manual...2 1.3. Why use BUDAPEST?...2

More information

On-line supplement to manuscript Galaxy for collaborative analysis of ENCODE data: Making large-scale analyses biologist-friendly

On-line supplement to manuscript Galaxy for collaborative analysis of ENCODE data: Making large-scale analyses biologist-friendly On-line supplement to manuscript Galaxy for collaborative analysis of ENCODE data: Making large-scale analyses biologist-friendly DANIEL BLANKENBERG, JAMES TAYLOR, IAN SCHENCK, JIANBIN HE, YI ZHANG, MATTHEW

More information

Tutorial for proteome data analysis using the Perseus software platform

Tutorial for proteome data analysis using the Perseus software platform Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information

More information

Tutorial for Proteomics Data Submission. Katalin F. Medzihradszky Robert J. Chalkley UCSF

Tutorial for Proteomics Data Submission. Katalin F. Medzihradszky Robert J. Chalkley UCSF Tutorial for Proteomics Data Submission Katalin F. Medzihradszky Robert J. Chalkley UCSF Why Have Guidelines? Large-scale proteomics studies create huge amounts of data. It is impossible/impractical to

More information

Final Project Report

Final Project Report CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes

More information

Analysis of ChIP-seq data in Galaxy

Analysis of ChIP-seq data in Galaxy Analysis of ChIP-seq data in Galaxy November, 2012 Local copy: https://galaxy.wi.mit.edu/ Joint project between BaRC and IT Main site: http://main.g2.bx.psu.edu/ 1 Font Conventions Bold and blue refers

More information

Data Integration via Constrained Clustering: An Application to Enzyme Clustering

Data Integration via Constrained Clustering: An Application to Enzyme Clustering Data Integration via Constrained Clustering: An Application to Enzyme Clustering Elisa Boari de Lima Raquel Cardoso de Melo Minardi Wagner Meira Jr. Mohammed Javeed Zaki Abstract When multiple data sources

More information

Protein Sequence Analysis - Overview -

Protein Sequence Analysis - Overview - Protein Sequence Analysis - Overview - UDEL Workshop Raja Mazumder Research Associate Professor, Department of Biochemistry and Molecular Biology Georgetown University Medical Center Topics Why do protein

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

Department of Microbiology, University of Washington

Department of Microbiology, University of Washington The Bioverse: An object-oriented genomic database and webserver written in Python Jason McDermott and Ram Samudrala Department of Microbiology, University of Washington mcdermottj@compbio.washington.edu

More information

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

BLAST. Anders Gorm Pedersen & Rasmus Wernersson BLAST Anders Gorm Pedersen & Rasmus Wernersson Database searching Using pairwise alignments to search databases for similar sequences Query sequence Database Database searching Most common use of pairwise

More information

Gold (Genetic Optimization for Ligand Docking) G. Jones et al. 1996

Gold (Genetic Optimization for Ligand Docking) G. Jones et al. 1996 Gold (Genetic Optimization for Ligand Docking) G. Jones et al. 1996 LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 1 Genetic algorithms Inspired

More information

Biological Databases and Protein Sequence Analysis

Biological Databases and Protein Sequence Analysis Biological Databases and Protein Sequence Analysis Introduction M. Madan Babu, Center for Biotechnology, Anna University, Chennai 25, India Bioinformatics is the application of Information technology to

More information

Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST

Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST Rapid alignment methods: FASTA and BLAST p The biological problem p Search strategies p FASTA p BLAST 257 BLAST: Basic Local Alignment Search Tool p BLAST (Altschul et al., 1990) and its variants are some

More information

The Ramachandran Map of More Than. 6,500 Perfect Polypeptide Chains

The Ramachandran Map of More Than. 6,500 Perfect Polypeptide Chains The Ramachandran Map of More Than 1 6,500 Perfect Polypeptide Chains Zoltán Szabadka, Rafael Ördög, Vince Grolmusz manuscript received March 19, 2007 Z. Szabadka, R. Ördög and V. Grolmusz are with Eötvös

More information

Template-based protein structure modeling using the RaptorX web server

Template-based protein structure modeling using the RaptorX web server Template-based protein structure modeling using the RaptorX web server Morten Källberg 1 3, Haipeng Wang 1,3, Sheng Wang 1, Jian Peng 1, Zhiyong Wang 1, Hui Lu 2 & Jinbo Xu 1 1 Toyota Technological Institute

More information

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004 Protein & DNA Sequence Analysis Bobbie-Jo Webb-Robertson May 3, 2004 Sequence Analysis Anything connected to identifying higher biological meaning out of raw sequence data. 2 Genomic & Proteomic Data Sequence

More information

BIOINFORMATICS TUTORIAL

BIOINFORMATICS TUTORIAL Bio 242 BIOINFORMATICS TUTORIAL Bio 242 α Amylase Lab Sequence Sequence Searches: BLAST Sequence Alignment: Clustal Omega 3d Structure & 3d Alignments DO NOT REMOVE FROM LAB. DO NOT WRITE IN THIS DOCUMENT.

More information

Protein Structure Prediction and Analysis Tools Jianlin Cheng, PhD

Protein Structure Prediction and Analysis Tools Jianlin Cheng, PhD Protein Structure Prediction and Analysis Tools Jianlin Cheng, PhD Assistant Professor Department of Computer Science & Informatics Institute University of Missouri, Columbia 2011 Sequence, Structure and

More information

The Galaxy workflow. George Magklaras PhD RHCE

The Galaxy workflow. George Magklaras PhD RHCE The Galaxy workflow George Magklaras PhD RHCE Biotechnology Center of Oslo & The Norwegian Center of Molecular Medicine University of Oslo, Norway http://www.biotek.uio.no http://www.ncmm.uio.no http://www.no.embnet.org

More information

T cell Epitope Prediction

T cell Epitope Prediction Institute for Immunology and Informatics T cell Epitope Prediction EpiMatrix Eric Gustafson January 6, 2011 Overview Gathering raw data Popular sources Data Management Conservation Analysis Multiple Alignments

More information

Bioinformatics Resources at a Glance

Bioinformatics Resources at a Glance Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences

More information

Supplementary material: A benchmark of multiple sequence alignment programs upon structural RNAs Paul P. Gardner a Andreas Wilm b Stefan Washietl c

Supplementary material: A benchmark of multiple sequence alignment programs upon structural RNAs Paul P. Gardner a Andreas Wilm b Stefan Washietl c Supplementary material: A benchmark of multiple sequence alignment programs upon structural RNAs Paul P. Gardner a Andreas Wilm b Stefan Washietl c a Department of Evolutionary Biology, University of Copenhagen,

More information

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS O.U. Sezerman 1, R. Islamaj 2, E. Alpaydin 2 1 Laborotory of Computational Biology, Sabancı University, Istanbul, Turkey. 2 Computer Engineering

More information

Supplementary Information

Supplementary Information Supplementary Information S1: Degree Distribution of TFs in the E.coli TRN and CRN based on Operons 1000 TRN Number of TFs 100 10 y = 619.55x -1.4163 R 2 = 0.8346 1 1 10 100 1000 Degree of TFs CRN 100

More information

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing James D. Jackson Philip J. Hatcher Department of Computer Science Kingsbury Hall University of New Hampshire Durham,

More information

Functional Architecture of RNA Polymerase I

Functional Architecture of RNA Polymerase I Cell, Volume 131 Supplemental Data Functional Architecture of RNA Polymerase I Claus-D. Kuhn, Sebastian R. Geiger, Sonja Baumli, Marco Gartmann, Jochen Gerber, Stefan Jennebach, Thorsten Mielke, Herbert

More information

A polynomial time algorithm for computing the area under a GDT curve

A polynomial time algorithm for computing the area under a GDT curve DOI 10.1186/s13015-015-0058-0 RESEARCH Open Access A polynomial time algorithm for computing the area under a GDT curve Aleksandar Poleksic * Abstract Background: Progress in the field of protein three-dimensional

More information

Clone Manager. Getting Started

Clone Manager. Getting Started Clone Manager for Windows Professional Edition Volume 2 Alignment, Primer Operations Version 9.5 Getting Started Copyright 1994-2015 Scientific & Educational Software. All rights reserved. The software

More information

Multiple Sequence Alignment. Hot Topic 5/24/06 Kim Walker

Multiple Sequence Alignment. Hot Topic 5/24/06 Kim Walker Multiple Sequence Alignment Hot Topic 5/24/06 Kim Walker Outline Why are Multiple Sequence Alignments useful? What Tools are Available? Brief Introduction to ClustalX Tools to Edit and Add Features to

More information

-Blue Print- The Quality Approach towards IT Service Management

-Blue Print- The Quality Approach towards IT Service Management -Blue Print- The Quality Approach towards IT Service Management The Qualification and Certification Program in IT Service Management according to ISO/IEC 20000 TÜV SÜD Akademie GmbH Certification Body

More information

RJE Database Accessory Programs

RJE Database Accessory Programs RJE Database Accessory Programs Richard J. Edwards (2006) 1: Introduction...2 1.1: Version...2 1.2: Using this Manual...2 1.3: Getting Help...2 1.4: Availability and Local Installation...2 2: RJE_DBASE...3

More information

EMBL-EBI Web Services

EMBL-EBI Web Services EMBL-EBI Web Services Rodrigo Lopez Head of the External Services Team SME Workshop Piemonte 2011 EBI is an Outstation of the European Molecular Biology Laboratory. Summary Introduction The JDispatcher

More information

Pairwise Sequence Alignment

Pairwise Sequence Alignment Pairwise Sequence Alignment carolin.kosiol@vetmeduni.ac.at SS 2013 Outline Pairwise sequence alignment global - Needleman Wunsch Gotoh algorithm local - Smith Waterman algorithm BLAST - heuristics What

More information

Version 5.0 Release Notes

Version 5.0 Release Notes Version 5.0 Release Notes 2011 Gene Codes Corporation Gene Codes Corporation 775 Technology Drive, Ann Arbor, MI 48108 USA 1.800.497.4939 (USA) +1.734.769.7249 (elsewhere) +1.734.769.7074 (fax) www.genecodes.com

More information

Vaxign Reverse Vaccinology Software Demo Introduction Zhuoshuang Allen Xiang, Yongqun Oliver He

Vaxign Reverse Vaccinology Software Demo Introduction Zhuoshuang Allen Xiang, Yongqun Oliver He Vaxign Reverse Vaccinology Software Demo Introduction Zhuoshuang Allen Xiang, Yongqun Oliver He Unit for Laboratory Animal Medicine Department of Microbiology and Immunology Center for Computational Medicine

More information

A Quantitative Decision Support Framework for Optimal Railway Capacity Planning

A Quantitative Decision Support Framework for Optimal Railway Capacity Planning A Quantitative Decision Support Framework for Optimal Railway Capacity Planning Y.C. Lai, C.P.L. Barkan University of Illinois at Urbana-Champaign, Urbana, USA Abstract Railways around the world are facing

More information

Research Article Cloud Computing for Protein-Ligand Binding Site Comparison

Research Article Cloud Computing for Protein-Ligand Binding Site Comparison BioMed Research International Volume 213, Article ID 17356, 7 pages http://dx.doi.org/1.1155/213/17356 Research Article Cloud Computing for Protein-Ligand Binding Site Comparison Che-Lun Hung 1 and Guan-Jie

More information

A single minimal complement for the c.e. degrees

A single minimal complement for the c.e. degrees A single minimal complement for the c.e. degrees Andrew Lewis Leeds University, April 2002 Abstract We show that there exists a single minimal (Turing) degree b < 0 s.t. for all c.e. degrees 0 < a < 0,

More information

SMRT Analysis v2.2.0 Overview. 1. SMRT Analysis v2.2.0. 1.1 SMRT Analysis v2.2.0 Overview. Notes:

SMRT Analysis v2.2.0 Overview. 1. SMRT Analysis v2.2.0. 1.1 SMRT Analysis v2.2.0 Overview. Notes: SMRT Analysis v2.2.0 Overview 100 338 400 01 1. SMRT Analysis v2.2.0 1.1 SMRT Analysis v2.2.0 Overview Welcome to Pacific Biosciences' SMRT Analysis v2.2.0 Overview 1.2 Contents This module will introduce

More information

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Department of Computer Engineering, YMCA University of Science & Technology, Faridabad,

More information

BMC Bioinformatics. Open Access. Abstract

BMC Bioinformatics. Open Access. Abstract BMC Bioinformatics BioMed Central Software Recent Hits Acquired by BLAST (ReHAB): A tool to identify new hits in sequence similarity searches Joe Whitney, David J Esteban and Chris Upton* Open Access Address:

More information

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011 Sequence Formats and Sequence Database Searches Gloria Rendon SC11 Education June, 2011 Sequence A is the primary structure of a biological molecule. It is a chain of residues that form a precise linear

More information

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources 1 of 8 11/7/2004 11:00 AM National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Human Genome Resources Model Organisms Guide Outreach and Education Databases and Tools

More information

Bio-Informatics Lectures. A Short Introduction

Bio-Informatics Lectures. A Short Introduction Bio-Informatics Lectures A Short Introduction The History of Bioinformatics Sanger Sequencing PCR in presence of fluorescent, chain-terminating dideoxynucleotides Massively Parallel Sequencing Massively

More information

MASCOT Search Results Interpretation

MASCOT Search Results Interpretation The Mascot protein identification program (Matrix Science, Ltd.) uses statistical methods to assess the validity of a match. MS/MS data is not ideal. That is, there are unassignable peaks (noise) and usually

More information

Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment

Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment Sequence Analysis 15: lecture 5 Substitution matrices Multiple sequence alignment A teacher's dilemma To understand... Multiple sequence alignment Substitution matrices Phylogenetic trees You first need

More information

Northumberland Knowledge

Northumberland Knowledge Northumberland Knowledge Know Guide How to Analyse Data - November 2012 - This page has been left blank 2 About this guide The Know Guides are a suite of documents that provide useful information about

More information

Current Motif Discovery Tools and their Limitations

Current Motif Discovery Tools and their Limitations Current Motif Discovery Tools and their Limitations Philipp Bucher SIB / CIG Workshop 3 October 2006 Trendy Concepts and Hypotheses Transcription regulatory elements act in a context-dependent manner.

More information

Data mining with Mascot Integra ASMS 2005

Data mining with Mascot Integra ASMS 2005 Data mining with Mascot Integra 1 What is Mascot Integra? Fully functional out-the-box solution for proteomics workflow and data management Support for all the major mass-spectrometry data systems Powered

More information

Structure Tools and Visualization

Structure Tools and Visualization Structure Tools and Visualization Gary Van Domselaar University of Alberta gary.vandomselaar@ualberta.ca Slides Adapted from Michel Dumontier, Blueprint Initiative 1 Visualization & Communication Visualization

More information

RNA Movies 2: sequential animation of RNA secondary structures

RNA Movies 2: sequential animation of RNA secondary structures W330 W334 Nucleic Acids Research, 2007, Vol. 35, Web Server issue doi:10.1093/nar/gkm309 RNA Movies 2: sequential animation of RNA secondary structures Alexander Kaiser 1, Jan Krüger 2 and Dirk J. Evers

More information

CONFIOUS * : Managing the Electronic Submission and Reviewing Process of Scientific Conferences

CONFIOUS * : Managing the Electronic Submission and Reviewing Process of Scientific Conferences CONFIOUS * : Managing the Electronic Submission and Reviewing Process of Scientific Conferences Manos Papagelis 1, 2, Dimitris Plexousakis 1, 2 and Panagiotis N. Nikolaou 2 1 Institute of Computer Science,

More information

Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data

Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data Data Mining and Knowledge Discovery, 2005 c 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands. Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data

More information

Bioinformatics Tools Tutorial Project Gene ID: KRas

Bioinformatics Tools Tutorial Project Gene ID: KRas Bioinformatics Tools Tutorial Project Gene ID: KRas Bednarski 2011 Original project funded by HHMI Bioinformatics Projects Introduction and Tutorial Purpose of this tutorial Illustrate the link between

More information

REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf])

REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf]) 820 REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf]) (See also General Regulations) BMS1 Admission to the Degree To be eligible for admission to the degree of Bachelor

More information

ID of alternative translational initiation events. Description of gene function Reference of NCBI database access and relative literatures

ID of alternative translational initiation events. Description of gene function Reference of NCBI database access and relative literatures Data resource: In this database, 650 alternatively translated variants assigned to a total of 300 genes are contained. These database records of alternative translational initiation have been collected

More information

CRAC: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data.

CRAC: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data. : An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data. Nicolas Philippe and Mikael Salson and Thérèse Commes and Eric Rivals February 13, 2013 1 Results

More information

OD-seq: outlier detection in multiple sequence alignments

OD-seq: outlier detection in multiple sequence alignments Jehl et al. BMC Bioinformatics (2015) 16:269 DOI 10.1186/s12859-015-0702-1 RESEARCH ARTICLE Open Access OD-seq: outlier detection in multiple sequence alignments Peter Jehl, Fabian Sievers * and Desmond

More information

Protein Studies Using CAChe

Protein Studies Using CAChe Protein Studies Using CAChe Exercise 1 Building the Molecules of Interest, and Using the Protein Data Bank In the CAChe workspace, click File / pen, and navigate to the C:\Program Files\Fujitsu\ CAChe\Fragment

More information

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: msm_eng@k-space.org

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: msm_eng@k-space.org BIOINFTool: Bioinformatics and sequence data analysis in molecular biology using Matlab Mai S. Mabrouk 1, Marwa Hamdy 2, Marwa Mamdouh 2, Marwa Aboelfotoh 2,Yasser M. Kadah 2 1 Biomedical Engineering Department,

More information

2.3 Identify rrna sequences in DNA

2.3 Identify rrna sequences in DNA 2.3 Identify rrna sequences in DNA For identifying rrna sequences in DNA we will use rnammer, a program that implements an algorithm designed to find rrna sequences in DNA [5]. The program was made by

More information

An Alternative Web Search Strategy? Abstract

An Alternative Web Search Strategy? Abstract An Alternative Web Search Strategy? V.-H. Winterer, Rechenzentrum Universität Freiburg (Dated: November 2007) Abstract We propose an alternative Web search strategy taking advantage of the knowledge on

More information

A Review of Anomaly Detection Techniques in Network Intrusion Detection System

A Review of Anomaly Detection Techniques in Network Intrusion Detection System A Review of Anomaly Detection Techniques in Network Intrusion Detection System Dr.D.V.S.S.Subrahmanyam Professor, Dept. of CSE, Sreyas Institute of Engineering & Technology, Hyderabad, India ABSTRACT:In

More information

Refinement of a pdb-structure and Convert

Refinement of a pdb-structure and Convert Refinement of a pdb-structure and Convert A. Search for a pdb with the closest sequence to your protein of interest. B. Choose the most suitable entry (or several entries). C. Convert and resolve errors

More information

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want 1 When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want to search other databases as well. There are very

More information

A QoS-Aware Web Service Selection Based on Clustering

A QoS-Aware Web Service Selection Based on Clustering International Journal of Scientific and Research Publications, Volume 4, Issue 2, February 2014 1 A QoS-Aware Web Service Selection Based on Clustering R.Karthiban PG scholar, Computer Science and Engineering,

More information

Creating Synthetic Temporal Document Collections for Web Archive Benchmarking

Creating Synthetic Temporal Document Collections for Web Archive Benchmarking Creating Synthetic Temporal Document Collections for Web Archive Benchmarking Kjetil Nørvåg and Albert Overskeid Nybø Norwegian University of Science and Technology 7491 Trondheim, Norway Abstract. In

More information

Systematic assessment of cancer missense mutation clustering in protein structures

Systematic assessment of cancer missense mutation clustering in protein structures Systematic assessment of cancer missense mutation clustering in protein structures Atanas Kamburov, Michael Lawrence, Paz Polak, Ignaty Leshchiner, Kasper Lage, Todd R. Golub, Eric S. Lander, Gad Getz

More information

Error Tolerant Searching of Uninterpreted MS/MS Data

Error Tolerant Searching of Uninterpreted MS/MS Data Error Tolerant Searching of Uninterpreted MS/MS Data 1 In any search of a large LC-MS/MS dataset 2 There are always a number of spectra which get poor scores, or even no match at all. 3 Sometimes, this

More information

This document presents the new features available in ngklast release 4.4 and KServer 4.2.

This document presents the new features available in ngklast release 4.4 and KServer 4.2. This document presents the new features available in ngklast release 4.4 and KServer 4.2. 1) KLAST search engine optimization ngklast comes with an updated release of the KLAST sequence comparison tool.

More information

Bernice E. Rogowitz and Holly E. Rushmeier IBM TJ Watson Research Center, P.O. Box 704, Yorktown Heights, NY USA

Bernice E. Rogowitz and Holly E. Rushmeier IBM TJ Watson Research Center, P.O. Box 704, Yorktown Heights, NY USA Are Image Quality Metrics Adequate to Evaluate the Quality of Geometric Objects? Bernice E. Rogowitz and Holly E. Rushmeier IBM TJ Watson Research Center, P.O. Box 704, Yorktown Heights, NY USA ABSTRACT

More information

CENG 734 Advanced Topics in Bioinformatics

CENG 734 Advanced Topics in Bioinformatics CENG 734 Advanced Topics in Bioinformatics Week 9 Text Mining for Bioinformatics: BioCreative II.5 Fall 2010-2011 Quiz #7 1. Draw the decompressed graph for the following graph summary 2. Describe the

More information

Introducing diversity among the models of multi-label classification ensemble

Introducing diversity among the models of multi-label classification ensemble Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and

More information

CSC 2427: Algorithms for Molecular Biology Spring 2006. Lecture 16 March 10

CSC 2427: Algorithms for Molecular Biology Spring 2006. Lecture 16 March 10 CSC 2427: Algorithms for Molecular Biology Spring 2006 Lecture 16 March 10 Lecturer: Michael Brudno Scribe: Jim Huang 16.1 Overview of proteins Proteins are long chains of amino acids (AA) which are produced

More information

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data

More information