TITLE PAGE - CURRENT PROTOCOLS IN BIOINFORMATICS

Similar documents

Guide for Bioinformatics Project Module 3

Lecture 19: Proteins, Primary Struture

Bioinformatics for Biologists. Protein Structure

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Consensus alignment server for reliable comparative modeling with distant templates

CSC 2427: Algorithms for Molecular Biology Spring Lecture 16 March 10

Mascot Search Results FAQ

Introduction to Bioinformatics AS Laboratory Assignment 6

Structure Tools and Visualization

MASCOT Search Results Interpretation

Linear Sequence Analysis. 3-D Structure Analysis

Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance?

Tutorial for proteome data analysis using the Perseus software platform

1. Product Information

Online Backup Client User Manual Linux

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Protein Protein Interaction Networks

RecoveryVault Express Client User Manual

Genome Explorer For Comparative Genome Analysis

Protein annotation and modelling servers at University College London

Bioinformatics Grid - Enabled Tools For Biologists.

Online Backup Linux Client User Manual

Online Backup Client User Manual

DataPA OpenAnalytics End User Training

Concepts of digital forensics

IQ MORE / IQ MORE Professional

Novell ZENworks 10 Configuration Management SP3

Multiobjective Robust Design Optimization of a docked ligand

Hydrogen Bonds The electrostatic nature of hydrogen bonds

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

GenBank, Entrez, & FASTA

An Introduction to Point Pattern Analysis using CrimeStat

Cloud. Hosted Exchange Administration Manual

UGENE Quick Start Guide

Online Backup Client User Manual

Dell Enterprise Reporter 2.5. Configuration Manager User Guide

FEAWEB ASP Issue: 1.0 Stakeholder Needs Issue Date: 03/29/ /07/ Initial Description Marco Bittencourt

PyRy3D: a software tool for modeling of large macromolecular complexes MODELING OF STRUCTURES FOR LARGE MACROMOLECULAR COMPLEXES

file:///c /Documents%20and%20Settings/terry/Desktop/DOCK%20website/terry/Old%20Versions/dock4.0_faq.txt

Steffen Lindert, René Staritzbichler, Nils Wötzel, Mert Karakaş, Phoebe L. Stewart, and Jens Meiler

Gold (Genetic Optimization for Ligand Docking) G. Jones et al. 1996

Eventia Log Parsing Editor 1.0 Administration Guide

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Bio-Informatics Lectures. A Short Introduction

T cell Epitope Prediction

CD-HIT User s Guide. Last updated: April 5,

CHM 579 Lab 1: Basic Monte Carlo Algorithm

Clustering & Visualization

NNMi120 Network Node Manager i Software 9.x Essentials

Visualizing molecular simulations

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

Network Protocol Analysis using Bioinformatics Algorithms

RNA Movies 2: sequential animation of RNA secondary structures

A QUICK OVERVIEW OF THE OMNeT++ IDE

MultiExperiment Viewer Quickstart Guide

How To Use Query Console

Protein Studies Using CAChe

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

Network Scanner Tool R3.1. User s Guide Version

WS_FTP Professional 12

ImageNow User. Getting Started Guide. ImageNow Version: 6.7. x

Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)

Bridging People and Process. Bridging People and Process. Bridging People and Process. Bridging People and Process

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

TD 271 Rev.1 (PLEN/15)

IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA

Numerical Algorithms Group

Bioinformatics Resources at a Glance

Note : It may be possible to run Test or Development instances on 32-bit systems with less memory.

A Business Process Services Portal

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

IT Service Level Management 2.1 User s Guide SAS

USER GUIDE MANTRA WEB EXTRACTOR.

Pairwise Sequence Alignment

SAnDReS Tutorial 01 Prof. Dr. Walter F. de Azevedo Jr.

ACCESS Importing and Exporting Data Files. Information Technology. MS Access 2007 Users Guide. IT Training & Development (818)

Image Compression through DCT and Huffman Coding Technique

Novell ZENworks Asset Management 7.5

Practical Graph Mining with R. 5. Link Analysis

Polynomial Neural Network Discovery Client User Guide

CDD user guide. PsN Revised

The Scientific Data Mining Process

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Bitrix Site Manager 4.1. User Guide

Protein Block Expert (PBE): a web-based protein structure analysis server using a structural alphabet

PSW Guide. Version 4.7 April 2013

Unemployment Insurance Data Validation Operations Guide

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Experiments in Web Page Classification for Semantic Web

Moxa Device Manager 2.3 User s Manual

Data Protection. Administrator Guide

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

Authoring for System Center 2012 Operations Manager

How To Test Your Web Site On Wapt On A Pc Or Mac Or Mac (Or Mac) On A Mac Or Ipad Or Ipa (Or Ipa) On Pc Or Ipam (Or Pc Or Pc) On An Ip

v4.8 Getting Started Guide: Using SpatialWare with MapInfo Professional for Microsoft SQL Server

Categorical Data Visualization and Clustering Using Subjective Factors

Multivariate Analysis of Ecological Data

The Real Challenges of Configuration Management

Transcription:

TITLE PAGE - CURRENT PROTOCOLS IN BIOINFORMATICS Unit Number: Unit Title: DALI structural comparison of proteins Authors: Liisa Holm *, Sakari Kääriäinen, Dariusz Plewczynski 1, Chris Wilton Address(es): Institute of Biotechnology, University of Helsinki, Viikinkaari 5, P.O. Box 56, Helsinki, FI-00014, Finland, 1 Interdisciplinary Centre for Mathematical and Computational Modeling, University of Warsaw, Pawinskiego 5a Street, 02-106 Warsaw, Poland. Telephone: +358-9-19159115 Fax: +358-9-19159079 Email: Liisa.Holm@helsinki.fi 3-6 Keywords: classification of protein folds database searching distance geometry pattern recognition protein structure alignment Abstract:(up to 150 words) The Dali program is widely used for carrying out automatic comparisons of protein structures determined by X-ray crystallography or NMR. The most familiar version is the Dali server, which performs a database search comparing a query structure supplied by the user against the database of known structures (PDB) and returns the list of structural neighbours by e-mail. The more recently introduced DaliLite server compares two structures against each other and visualizes the result interactively. The Dali database is a structural classification based on precomputed all-against-all structural similarities within the PDB. The resulting hierarchical classification can be browsed on the web and is linked to protein sequence classification resources. All Dali resources use an identical algorithm for structure comparison. Users may run Dali using the Web, or the program may be downloaded to be run locally on Linux computers. 1

UNIT INTRODUCTION The rapidly growing number of known tertiary structures makes protein structure comparison important. In the center of biological interest are evolutionary relationships inferred from quantifiable similarities between proteins. Sequence similarity searches are able to detect evolutionary relationships down to a sequence identity of about 25 %. Below this level of sequence identity starts the twilight zone of similarity. Comparing structures can help to extend the validity of an evolutionary relationship between proteins through the border of twilight zone. This is because the structure of proteins is much better preserved during evolution than the sequence (Chothia and Lesk, 1986). By searching structural databases, molecular biologists can gain a considerable amount of information about connections between protein families, which are unseen using sequence alone. The prediction of protein function based on the structure aims at the unification of protein families into larger sets (super-families). Functionally divergent families classified into the same super-family typically exploit a conserved mechanical or biochemical mechanism that has been adapted to different cellular processes and substrates (Holm and Sander 1996). Inferring complex conserved properties is the basic reason to provide the systematic structure-structure comparison and classification of available proteins. Dali is a tool for both pair-wise structure comparison and structure database searching. It is equipped with a web interface to easily view the results, multiple alignments and threedimensional superimpositions of structures. The method is fully automated and identifies very sensitively common structural cores and structural resemblances. Dali uses 3D Cartesian coordinates of Cα atoms of each protein in order to calculate residue-residue distance matrices. A similarity score for these sets is defined as a weighted sum of equivalent intra-molecular distances. As a result one gets the scored list of all important structural alignments. The method allows for any length of gaps (i.e., insertions or deletions) and detects similarities involving geometrical distortions. Dali is easily accessible through web servers. Table 1 outlines the relationships of Dali resources. Use the DaliLite server to compare two known structures to each other and visualize the superimposition (Basic protocol 1: Interactive DaliLite server for pairwise comparison). This server requires two sets of atomic coordinates in PDB format as input. The comparison is usually 2

quite fast, and results should be returned after about one minute. A search against all known structures takes much longer, and can be performed using the DALI Server (Basic protocol 2: Dali e-mail server for database searching). This server is routinely used by protein crystallographers to compare a newly solved structure against the database of known structures in order to detect possible evolutionary relationships. If you are interested in the structure neighbours of proteins already in the PDB, you can find them in the Dali database. Its web interface allows you to browse the hierarchical classification of protein structures based on allagainst-all comparisons of known structures (Basic protocol 3: Dali database). In the case that you have many query structures, you may wish to download the DaliLite standalone program package for your convenience. This uses the same comparison algorithms as the Dali web servers but can be run locally on Linux computers (Alternate protocol 1: Comparing two structures using DaliLite; Alternate protocol 2: Comparing large sets of structures using DaliLite; Support protocol: Obtaining DaliLite). BASIC PROTOCOL 1 Protocol Title: Interactive DaliLite server for pairwise comparison Introduction: This interactive web server provides a quick, convenient means to check the structural alignment of two known protein structures and to visualize their structural superimposition. You need only to know the PDB identifiers of the structures. It is also possible to upload your own structures. A fast server can be accessed at http://www.ebi.ac.uk/dalilite/. Necessary Resources (list) Hardware A computer connected to the Internet. software A web browser (Internet Explorer, Netscape etc.). Rasmol or other PDB viewer. files none. (User PDB files are optional.) Protocol Steps: 1. You need two inputs to run this server - these are intuitively called First and Second structures in the submission page. You can either enter PDB entry codes (for known 3

structures), or upload your own coordinate files in PDB format. You can search for the PDB entry codes of known structures for your query protein using the NCBI-Entrez, SRS and other similar database cross-linking resources. If you have a structure file containing a number of different chains, you can select a specific chain in the submission page. If no chain is specified, structural comparisons will be performed on every chain in the structure file, and it will take much longer to return your results. Size limits for the comparison are: at least 30 amino acid residues per chain, at most 1000. The results summary page looks like Figure 1. For each chain in the query structure, a table is presented showing significant hits against each chain of the subject structure, with the best hit for each chain highlighted. Note that the First structure is named mol1, the Second is mol2, chain A of the First structure is mol1a, and so on. Suboptimal alignments are reported; the highest scoring alignment per any pair of chains is highlighted. The tables show: Z-Scores, number of aligned residues, root-mean-square deviation (RMSD) of alpha-carbon atoms, sequence identity between the two chains. Links are then given for: a. the structural alignment, including DSSP secondary structure information, between the indicated chains (Figure 2) b. a coordinates file of the superposed alpha-carbon traces for the indicated chains, viewable in Rasmol or other PDB structure viewer (Figure 3). Only the C-alpha coordinates are transmitted, therefore use the backbone display in Rasmol! Note that the first structure chain is renamed Q, and Second structure chain S. c. the First structure file (unchanged), followed by the Second structure file with all ATOM coordinates of the indicated chain rotated/translated to match the First structure - to view the full superposition, either open both files in your structure viewer, or concatenate the two files and view the resulting file. 4

You can build a superimposition of multiple Second structures onto the same First structure. This is useful in studying a large superfamily that has many distantly related known structures. Essential and variable structural elements are easily seen in the multiple superimposition. This option preserves ligands that might have been co-crystallised with the protein as well as showing quaternary structure interactions. Note that only the indicated chains are superposed (eg: mol1a with mol2b), however, any other chains will still be contained in the structure files, so you may wish to remove unwanted chains using a text editor before viewing the structures. The following files can also be viewed: the rotation/translation matrices for each alignment, a list of structurally equivalent residue ranges, a log file indicating all the steps taken by the DaliLite application. These are included for completeness but are uninformative to most users. Finally, at the bottom of the results page, a summary of your two inputs is given, including header information and a report of the chains found within each structure file. If these data are not as expected, it is apparent that file upload (rather than the program itself) failed for one reason or another. BASIC PROTOCOL 2 Protocol Title: Dali e-mail server for database searching Introduction: The Dali server is an easy-to-use network service for comparing protein structures. It is routinely used by structural biologists to compare a newly solved structure against previously known structures. In favourable cases, comparing 3D structures may reveal biologically interesting similarities that are not detectable by comparing sequences. You submit the coordinates of a query protein structure and Dali compares them against those in the Protein Data Bank. A multiple alignment of structural neighbours is mailed back to you. If you want to know the structural neighbours of a protein already in the Protein Data Bank, you can find them in the Dali database (Basic Protocol 3). The Dali server is hosted by the EBI (http://www.ebi.ac.uk/dali). Necessary Resources (list) hardware A computer connected to the Internet. 5

software A web browser (Internet Explorer, Netscape, etc.) files Atomic coordinates of protein structure in PDB-format. Protocol Steps: 1. Structure submission can be done either interactively or by e-mail. a. Upload your coordinate file through the web page http://www.ebi.ac.uk/dali/interactive.html and press the Submit button. The results will be sent to the e-mail address provided by you. Type carefully. b. E-mail a message containing the PDB entry to dali@ebi.ac.uk. The submission will fail unless the message is plain text, as encoded messages (e.g. MIME or BinHex) are rejected by the server. 2. You will receive an e-mail with the results. Expect a reply within a few days of submission; in case of longer delays, please notify dali-help@ebi.ac.uk. The comparison is carried out against a representative subset of PDB structures. The set is constructed so that the sequence identity between any two chains in the set should be less than 25 %. The summary of structural neighbours looks like Figure 7. 3. Use the DaliLite server for pairwise comparison (Basic Protocol 1) to visualize interesting pairs of structures. BASIC PROTOCOL 3 Protocol Title: Dali database Introduction: The Dali database is based on exhaustive all-against-all 3D structure comparison of protein structures currently in the Protein Data Bank (PDB). The classification and alignments are automatically maintained and continuously updated using the Dali search engine. The database currently (Sep 2005) contains 10,562 representative structures. The set of representative structures is called PDB90 and it contains all polypeptide chains from the PDB with less than 90 % sequence identity to each other. The representative structures are decomposed into 14,020 domains. Hierarchical clustering reveals 3,107 fold types. Fold types are defined as clusters of structural neighbours in fold space with average pairwise Dali Z-scores above 2. The threshold has been chosen empirically and groups together structures which have topological similarity. Higher Z-scores correspond to structures which agree more closely in architectural detail. The Fold Index lists all chains in PDB90 ordered by structural similarity. The order is that of a 6

dendrogramme derived in the hierarchical clustering. Fold types are indexed. A heavier branch with more members is listed above a branch with fewer members. Domains that are structural neighbours are found next to each other. Fold types with similar structural motifs are also found next to each other. Necessary Resources (list) hardware A computer connected to the Internet. software A web browser. files none. Protocol Steps: 1. Browsing: The Dali database is accessed from http://www.bioinfo.biocenter.helsinki.fi/dali/start. You can enter into the fold classification from the Fold Index or by querying for a text term that occurs in the COMPND records of the PDB entries (Figure 4). More sophisticated queries should be performed using specialized search engines such as NCBI- Entrez or SRS. Figure 5 shows the result for a query for estradiol receptor. The leftmost column shows that there are two PDB entries for estradiol receptor, namely 1qkt and 1qku. The latter has three chains named A, B and C. The second column indicates that representative of all is the chain 1qkuA. The third column shows that 1qkuA belongs to domain fold class 342. Clicking on the fold-link shows a section of the Fold Index. Here, you see all members of the fold class at a glance (Figure 6). Domains in the Fold Index are annotated by the sequence family that they belong to. Sequence families are defined in the Adda database (Heger & Holm, 2003) based on shared sequence motifs. Adda unifies many structural neighbours with little overall sequence similarity in terms of percent-identity. As can be seen from Figure 6, the nuclear receptors are unified by Adda into one family. The interact link shows details about the structural neighbours of each domain. The list of neighbours of estradiol receptor is shown in Figure 7. Structural alignments between estradiol receptor and its neighbours can be displayed as 1D alignments or in 3D superimposition. Select a 7

few structures (click on check-boxes). The Structure Alignment button shows a multiple structure alignment similar to a sequence alignment. Secondary structure definitions are shown below the amino acid sequences. Typically secondary structure assignments agree very well even though sequence identity is low (Figure 8). The Structure/Sequence alignment button augments the structural alignment by related sequences, which are detected by PSI-Blast and stored in the Adda database (Heger & Holm 2003). This view is useful for checking sequence patterns that are conserved across distantly related protein families. Conserved functional sites are a strong hint at common evolutionary origins. In the alignment, residues are coloured if the frequency of the amino acid type in the column is above 50 %. The superimposed C-alpha traces of the selected structures can be viewed in 3D using Rasmol or other PDB viewer. The 3D superimposition button launches a Rasmol script, if your browser is appropriately configured. Use the PDB format button to download the C-alpha coordinates of selected neighbours superimposed onto the query structure. 2. External links: External sites may link directly to the query engine of the Dali database. To make a link from a PDB identifier to the database, use the call http://www.bioinfo.biocenter.helsinki.fi:8080/daliquery?search_term, where the search_term is a PDB identifier (e.g. 2kau or 2kauC ). 3. Data downloads: For non-interactive use, we provide comprehensive computer-readable database-dumps for large-scale studies. These are accessed from the link to Downloads from the home page of the Dali database. ALTERNATE protocol 1 Protocol Title: Comparing two structures using DaliLite Introduction: This simple protocol is the command-line version of that performed online by the DaliLite server for pairwise structure comparison (Basic Protocol 1). The inputs are two protein structures 8

in PDB format. The output is a set of HTML files, which should be viewed from a browser. Rough timings are from a few seconds up to tens of seconds per pairwise comparison. Necessary Resources (list) hardware Linux workstation (Sun, Alpha, Silicon Graphics, PC). software DaliLite program, Perl interpreter, web browser (Netscape, Internet Explorer, Opera etc.). files Two protein structures in PDB format files. Protocol Steps: The option to run DaliLite is DaliLite pairwise <pdbfile1> <pdbfile2>, where the arguments <pdbfile1> <pdbfile2> should be replaced by the PDB file names, for instance: Linux-prompt> perl DaliLite -pairwise /pdb/1wsy.brk /pdb/2kau.brk > log Linux-prompt> netscape index.html The program computes the structural alignments for all chains in pdbfile1 against all chains in pdbfile2, and creates a set of HTML pages linked from the top page 'index.html'. The first structure is called 'mol1' and the second 'mol2'. All data are stored in the current work directory, overwriting any previous results generated using this option. The output is identical to that from Basic Protocol 1 (Figures 1-3). ALTERNATE PROTOCOL 2 Protocol Title: Comparing large sets of structures using DaliLite Introduction: This is a more advanced protocol that allows the systematic comparison of large sets of structures. It performs the structural comparisons between all pairs of two user-provided lists of structures. The results are stored in an internal alignment format which can be processed by computer programs for further statistical analysis. There is an option to re-format the results as human-readable output. Necessary Resources (list) hardware Linux workstation (Sun, Alpha, Silicon Graphics, PC). software DaliLite program, Perl interpreter. files Protein structures in PDB format files. 9

Protocol Steps: 1. All structures that one wants to compare must be prepared using the -readbrk option. These structural data are stored in a DAT subdirectory under the DaliLite home directory. You must supply a unique identifier for the structure as the second argument. The identifier must be PDB-style, i.e., four characters long. Linux-prompt> perl DaliLite -readbrk <pdbfile> <pdbid> Examples: DaliLite -readbrk 3ubp.brk 3ubp DaliLite -readbrk /data/pdb/3ubp.brk 3ubp DaliLite -readbrk /data/pdb/pdb3ubp.ent 3ubp The program automatically generates a data file for each chain in the PDB entry. In the above examples, 3ubpA.dat, 3ubpB.dat and 3ubpC.dat are created in the DAT subdirectory. The system uses the DSSP program by Kabsch and Sander (included in the distribution package) to parse the information out of the PDB file. DSSP requires that the complete backbone (N, CA, C, O atoms) is present or it will skip the residue. The MaxSprout server (http://www.ebi.ac.uk/maxsprout) can be used to build full coordinates from a C-alpha trace. The DAT file includes information about the CA coordinates, primary structure, secondary structure elements (by program DSSP, Kabsch & Sander 1986) and putative folding pathway of the protein (by program PUU, Holm & Sander 1994). The first line of a properly formed DAT file looks like this: >>>> 1xg8A 108 7 3 4 EHEHEEH order secondary structure elements number of beta-strands (E) number of helices (H) total number of secondary structure elements number of residues chain identifier If reading the coordinates failed, for any reason, you only find lots of zeros on the first line of the DAT file. 10

2. Generate structural alignments. There are options for pairwise, one against many, and many against many comparisons. The structures are specified using the unique identifiers, which were introduced in the previous step when reading in PDB structures using the readbrk option. Pairwise alignments of two structures are generated using exhaustive search (Parsi method). If the query structure has few secondary structure elements, the Soap method is used. Monte Carlo optimization is used for refinement (see Table 2). Alignment data is output to <code>.dccp files. An optimal and a number of suboptimal structural alignments are reported for each pair of structures. Similarities with a Z-score below zero are omitted from the output. The format is explained below: DCCP 1 93.9 1.8 33 3.6 39 1 1ppt 1bba second first structure number of aligned blocks Z-score sequence identity number of structurally equivalent residues root mean square deviation, in Angstroms, of CAs raw similarity score alignment 1 33 List of start and end residues of each aligned block in the first structure. 1 33 List of start and end residues of each aligned block in the second structure. If you want to construct a similarity matrix of a large set of proteins, you can extract the DCCP lines from the alignment data files (*.dccp). Note that several alternative alignments may be reported by protein pair. DaliLite has four options for alignment. The simplest is pairwise alignment (-align option) which takes two chain identifiers as argument, for example: Linux-prompt> perl DaliLite align 3ubpC 1gkpA The arguments are the unique-identifier with the chain-identifier appended. Output (alignment data) is automatically appended to the alignment file <code>.dccp You may also prepare a list of chain identifiers in a file, and the program will perform a pairwise comparison of the query to each structure in the list. For example, the list file mylist may have the following contents: 11

1bf6A 1j79A 1a4mA 1k70A 3ubpC The command to compare 3ubpC against each entry in the list file is then: Linux-prompt> perl DaliLite list 3ubpC mylist There is also an option for all-against-all comparison: Linux-prompt> perl DaliLite AllAll mylist The database search option (-search) uses the same shortcuts as the Dali server. Note that using this option is dependent on an up-to-date list of representative structures and the complete database of pre-computed structural alignments. This database resides in the DCCP/ subdirectory. Updates of the database are available for download. Click the Downloads link on the home page of the Dali database http://www.bioinfo.biocenter.helsinki.fi/dali/start. 3. Convert the alignment file to a readable format using the format option. The output of the alignment options is in DaliLite s internal format (files with the extension.dccp). The arguments to the format option are the identifier of the query structure, the alignment datafile, a listfile of valid identifiers, and the name of the output file. Only comparisons to structures listed in the listfile will be output. For example: Linux-prompt> perl DaliLite -format 3ubpC 3ubpC.dccp representatives.list 3ubpC.html The output file is in HTML-format. It contains the list of structural neighbours and links to the structural alignments similar to Figure 2). SUPPORT PROTOCOL Protocol Title: Obtaining the DaliLite standalone program Introduction: DaliLite is a stand-alone program package that can help researchers compare large numbers of protein structures for specialized projects efficiently and locally. The DaliLite distribution package contains a self-contained package of scripts and programs written in Perl and Fortran77. It has been tested on the Linux operating systems (RedHat distribution, version 6.0) and on Cygwin, a Linux-like environment for Microsoft Windows (http://cygwin.com). 12

The program code is distributed to academic users. Commercial use is prohibited. Necessary Resources (list) hardware Linux workstation. software Fortran-77 compiler, Perl5 interpreter. files none. Protocol Steps: 1. Download the academic licence agreement from http://www.bioinfo.biocenter.helsinki.fi/dali_lite/downloads, print, sign and fax it to the address indicated. 2. Download the DaliLite program package by clicking on the link at the top of the above web page. The current distribution version (spring 2005) is 2.4.1. 3. Complete instructions for compilation and installation are available in the INSTALL file included in the DaliLite distribution. Instructions where to obtain the necessary software resources are included in the INSTALL file. Test examples are included in the distribution package. In brief overview, the installation proceeds as follows:... Unpack the distribution package: Linux-prompt> tar -zxvf DaliLite_2.4.1.tar.gz Linux-prompt> cd./dalilite_2.4.1/bin... If you are using cygwin (Linux emulator for Windows): Linux-prompt> mv -f Makefile_cygwin Makefile... Use a text editor to set proper HOMEDIR and ESCAPED_HOMEDIR in Makefile Linux-prompt> make clean Linux-prompt> make install Linux-prompt> make test Linux-prompt> cd../ Linux-prompt>./DaliLite -help GUIDELINES FOR UNDERSTANDING RESULTS: 13

Like in sequence analysis, the goal of structural database searching is usually to identify homologous proteins which might provide clues to the function of the query protein. Homology means descent from a common ancestor. We can infer homology from sequence or structural similarities that are so strong they would not be expected to have arisen by chance. The structural neighbours reported by Dali are ranked in order of decreasing structural similarity (Zscore). The Z-Score is the most important measure of quality of the structural alignment. Homologous proteins cluster at the top of the ranked list, but the boundary between homologous and unrelated proteins varies from one family to another. As a general rule, a Z-score above 20 means the two structures are definitely homologous, between 8 and 20 means the two are probably homologous, between 2 and 8 is a grey area, and a Z-Score below 2 is not significant. The size of the proteins influences Z-scores - small structures will tend to have small Z- Scores, whereas a medium Z-Score for very large structures need not imply a biologically interesting relationship. Fold type also has an effect α/β proteins also usually have higher Z- scores than all-β proteins. Homologous proteins often share significant functional similarities. You should try to place the query structure in the context of a fold similarity dendrogram like Figure 6 before transferring function. There is always a best hit. Reciprocal nearest neighbours suggest more similar functions than if your query joins a whole branch of functionally diverse proteins. For example, in the receptor dendrogram (Figure 6), sex hormone receptors form one sub-cluster while the orphan receptor is about equidistant from all the other receptors. RMSD is a measure of the average deviation in distance between aligned alpha-carbons. For sequences sharing 50% identity, this should be around 1.0. Dali maximizes a geometrical similarity score which is defined in terms of similarities of intra-molecular distances and is thus not primarily aiming to generate alignments with low RMSD. The RMSD and number of equivalent residues (NE) are reported, because they are traditional measures. Note that an alignment is better if it has both smaller RMSD and larger NE. If both RMSD and NE are smaller or both are larger, it is not possible to establish an order between the alignments. It is generally assumed that if two sequences share over 40% identity, then they are unambiguously homologous. However, two distantly-related proteins may share very low sequence identity but still be homologous, and conversely, two sequences may locally share as 14

much as 30% identity but be unrelated. Therefore, the percentage of sequence identity is only a guide. In lieu of numbers, it is often informative to inspect using Rasmol or another graphics program, whether the structurally equivalent regions form a continuous, compact structural core. If there are many structures known in a super-family, you can see secondary structure elements line up consistently in the multiple structure alignment view (Figure 8). Check especially for the conservation of known active site residues. You can study conservation profiles in multiple sequence alignments of protein families in sequence classification databases such as ADDA (http://www.bioinfo.biocenter.helsinki.fi/sqgraph/pairsdb) or PFAM (http://www.sanger.ac.uk/pfam). Enzyme super-families have sharp signatures but binding domains can have very little sequence similarity. Without a sequence signature, it is harder to establish homology. COMMENTARY. Background Information Improved methods of protein engineering, crystallography and NMR spectroscopy have led to a surge of new protein structures deposited in the Protein Data Bank (PDB). At the end of 2004, the Protein Data Bank (PDB) contained over 28,000 protein structures, and the structural genomics initiative aims to provide a structure for each major protein family within a decade. This wealth of data needs to be organised and correlated using automated methods. Nearly all proteins have structural similarities to other proteins. General similarities arise from principles of physics and chemistry that limit the number of ways in which a polypeptide chain can fold into a compact globule. Evolutionary relationships result in surprising similarities (which are even stronger than similarity due to convergence caused by physical principles). Because structure tends to diverge more conservatively than sequence during evolution, structure alignment is a more powerful method than pairwise sequence alignment for detecting homology and aligning the sequences of distantly-related proteins. In favourable cases, comparing 3D structures may reveal biologically interesting similarities that are not detectable by comparing sequences and may help to infer functional properties of hypothetical proteins. Automatic methods enable exhaustive all-against-all structure comparisons. As a result, each structure in the PDB can be represented as a node in a graph where similar structures are neighbours of each other and structurally unrelated proteins are not neighbours. Clustering the 15

graph at different levels of granularity removes redundancy and aids navigation in protein space. At long range, the overall distribution of folds is dominated by secondary structure composition (for example, all-alpha or alternating alpha/beta). At intermediate range, clusters are related by shape similarity that does not necessarily reflect similarity of biological function (for example, globins and colicin A). At close range, clusters represent protein families related through strong functional constraints (for example, hemoglobin and myoglobin). Evolutionary relationships can be recovered by searching for continuous neighbourhoods (Dietmann & Holm 2001). In order to identify natural groupings of any set of objects, one needs a measure of distance or similarity. Structure comparison programs derive a structural alignment, which maximizes similarity or minimizes distance. The alignment defines a one-to-one correspondence of amino acid residues (sequence positions) in two proteins. This is analogous to sequence alignment except that the notion of (dis)similarity is much more complex between three-dimensional objects than between linear strings. For example, the conformation of a point mutant differs from of the wild-type protein only locally and only be a few tenths of an Angstrom. Much larger deviations are observed in pairs of homologous proteins: with increasing sequence dissimilarity, small shifts in the relative orientations of secondary structure elements accumulate and reach several Angstroms and tens of degrees. At the largest evolutionary distances, only the topology of the fold or folding motif is conserved; topology here means the relative location of helices and strands and the loop connections between these. Deviations can be even larger and qualitatively different when structural similarity is the result of convergent rather than divergent evolution. In particular, convergent evolution may result in similar 3D folds that differ in the topology of loop connections. The modular architecture of proteins presents another complication. Large proteins can be decomposed into semi-autonomous, globular folding units called domains. Domains are often evolutionarily mobile modules and may carry specific biological functions. Because a common domain may be surrounded by completely unrelated domains, most structure comparison methods search for local similarities. Given a measure of similarity or distance, the algorithmic problem is to find the set of corresponding points in two structures that optimise this target function. Just as there is much latitude in the formulation of the structure comparison problem, many different types of optimization algorithm have been employed. Similarity measures of the sum-of-pairs form and subgraph isomorphism formulations of the structure comparison problem belong to the NP- 16

complete class of problems and one has to resort to heuristics for practical algorithms. Heuristic approaches do not aim for provably correct solutions, gaining computational performance at the potential cost of accuracy or precision. Many programs use a hierarchical approach, where promising seeds for alignment are identified using local criteria based on dynamic programming, distance difference matrices, maximal common subgraph detection, fragment matching, geometric hashing, unit vector comparison or local geometry matching (reviewed by Sierk & Kleywegt 2004). The initial set of correspondences is then optimised globally using methods such as double dynamic programming, Monte Carlo algorithms or simulated annealing, a genetic algorithm or combinatorial searching. Recently, it has been proved that brute-force exhaustive scanning of the six degrees of freedom from rotations and translations in rigid-body superimposition leads to a polynomial-time approximation algorithm for the problem of determining the maximum number of C-alpha atom pairs that can be superimposed within a given RMSD at a given error. However, this solution is too computationally demanding for practical application (Kolodny & Linial 2004). The Dali method is based on a sensitive measure of geometrical similarities defined as a weighted sum of similarities of intra-molecular distances (see Appendix for details). 3D shape is described with a matrix of all intramolecular distances between the C-alpha atoms. Such a distance matrix is independent of coordinate frame but contains more than enough information to reconstruct the 3D coordinates, except for overall chirality, by distance geometry methods. Imagine sliding a (transparent) distance matrix on top of another one. Depending on the register of the two matrices, similar substructures will stand out as submatrices with similar patterns. Structurally equivalent regions can be filtered out with a fixed cutoff on acceptable differences of intramolecular distances or, as we prefer, with a continuous function defined in terms of relative distance deviations. The common structure is revealed when two distance matrices brought into register by keeping only rows or columns corresponding to the structurally equivalent residues (Figure 9). The Dali program has a modular architecture, where the structure alignment / database searching problem is approached by a cascade of algorithms. The Dali package consists of many Fortran programs and Perl5 scripts. The program flow is controlled by a Perl wrapper script that calls other programs as needed. Each program implements pairwise structure comparisons using different algorithms. References for these programs are given in Table 2. The goal of a database 17

search is to find all structures that are significantly similar to the query. A conceptual map of fold space is determined by the pre-computed all-against-all structural alignments between all representative structures. Based on this map, the database search by the Dali server tries shortcuts to quickly place the query structure in a known location of fold space. If a strong match is found to one database structure, then the search can be restricted to the pre-computed neighborhood of this structure. Fast but approximate methods can quickly find obvious structural resemblances. Slower but most sensitive algorithms need then only be applied to a smaller set of candidates. DaliLite has the core algorithmic functionality of the Dali server. The DaliLite programs perform systematic pairwise comparisons without shortcuts and can therefore be run independently of database updates. Applications The exponential growth in the number of newly solved protein structures makes correlating and classifying the data an important task. Dali is now used routinely by crystallographers world-wide to screen the database of known structures for similarity to newly-determined structures. The application of Dali to newly released structures led to a string of discoveries of unexpected distant evolutionary relationships. For example, a remarkably diverse set of distant relatives of urease were identified based on structural and sequence analysis (Holm & Sander 1997); several blind fold predictions have since been verified by experimental structure determination. Comparison to other techniques Dali was ranked at top among seven protein structure comparison methods and two sequence comparison programs that were evaluated on their ability to detect either protein homologues or domains with the same topology (fold) as defined by the CATH structure database (Novotny et al. 2004). Critical Parameters The Dali program has been run successfully with default parameters since its inception (Holm & Sander 1993). The results usually agree quite well with human experts assessment. For example, the dendrogram of structural similarities by Dali has similar topology to the SCOP hierarchical classification based on visual analysis and biological knowledge (Dietmann & Holm 2001). 18

While we strongly advise against changing parameter values from their default values, a description of the numerical parameters that go into the algorithms is given in the Appendix. Troubleshooting Similarity not reported. The Dali system reports only similarities above an empirically chosen threshold of Z=2. This captures most cases of topological similarity of globular domains. In some fold types, though, also structural similarities between parts of globular domains score above this threshold. Known similarity not reported. The Dali server currently reports similarities only to PDB25 representatives. The purpose of using PDB25 is to suppress the redundancy of output due to multiple structure determinations of mutants or of the same protein in slightly differing conditions. Thus, a particular PDB entry, which you know to be structurally similar to the query, might appear to be missing from the output list only because the representative structure is a different PDB entry. The Dali database reports similarities between PDB90 representatives. The PDB90 representatives for any PDB entry can be found by using the search functionality on the homepage of the Dali database (http://www.bioinfo.biocenter.helsinki.fi/dali). Empty result. The Dali database includes all peptide chains from the PDB, except Cα-only entries and chains that are shorter than 30 residues. DaliLite requires that the backbone atoms (N, CA, C, O) must be complete. You can build a complete backbone model from the CA-trace using the MaxSprout Server. The Dali server runs MaxSprout automatically, if only a CA-trace is submitted. The submission to the Dali server will fail unless the message is plain text, as encoded messages (e.g. MIME or BinHex) are rejected by the server. Complex comparison. Each chain is compared separately. For example, similarities to structural units made up of a dimer of two different chains (say, A and B) will not be detected. There is a way around this limitation, which requires manual editing of the PDB entry by the user: renumber the residues in a sequential order and give all chains the same chain identifier. Multidomain proteins. It is advisable to break a multidomain query structure into its constituent domains, because the Dali server is designed to report all matches only to the firstfound structural neighbourhood. That is, if the query protein has one common domain that is found by the fast filters, the search termination criteria are satisfied without a more unique domain in the same query being tested systematically. 19

Which Z-score threshold implies homology? This varies for each protein family (Dietmann & Holm 2001). The topology of the fold dendrogramme (hierarchical clustering of domains based on structure similarity) represents evolutionary relationships fairly faithfully, so that homologous structures are found collected in one branch of the tree, but the borders of the homologous families might at Z-scores around 4 (helix-turn-helix DNA-binding domains) or around 14 (TIM barrels). Technical failures. The Dali server at the EBI is running automatically with minimal human administrative effort. The assumption that the fold space graph is complete is critical to exhaustive database searching but can sometimes be violated for the following reasons: unpredictable failure of the database update (black-outs, computer crashes, network failures, over-running disk space, etc. ), failure to process the PDB entry (for example, chains longer than 1000 residues are not handled well), program bugs. Please report unexpected behaviour to dalihelp@ebi.ac.uk. LITERATURE CITED Chothia C, Lesk AM (1986) The relation between the divergence of sequence and structure in proteins. EMBO J. 5, 823-826. Dietmann S, Holm L (2001) Identification of homology in protein structure classification. Nature Structural Biology 8, 953-957 Heger A, Holm L (2003) Exhaustive enumeration of protein domain families. J Mol Biol 328, 749-767. Holm L, Sander C (1997) An evolutionary treasure: unification of a broad set of amidohydrolases related to urease. Proteins 28, 72-82 Holm, L., & Sander, C. (1994). Parser for protein folding units. Proteins, 19, 256-268. Holm, L., & Sander, C. (1996). Mapping the protein universe. Science 273, 595-602. Kabsch W. & Sander C. (1983) Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features, 22:2577-2637. Novotny M, Madsen D, Kleywegt GJ (2004) Evaluation of protein fold comparison servers. Proteins 54, 260-270. Sierk ML, Kleywegt GJ (2004) Deja vu all over again: finding and analyzing protein structure similarities. Structure 12, 2103-2111. 20

Kolodny R, Linial N (2004) Approximate protein structural alignment in polynomial time. PNAS 101, 12201-12206. Key References (optional) Holm, L., & Sander, C. (1993). Protein structure comparison by alignment of distance matrices. J. Mol. Biol., 233, 123-138. The original Dali reference. Holm, L., & Sander, C. (1996). Mapping the protein universe. Science 273, 595-602. Reviews structure comparison methodology, key results and implications. Holm L & Park J (2000). DaliLite workbench for protein structure comparison. Bioinformatics 16:566-567. The main DaliLite reference, which should be cited in any publication in which you use DaliLite results. Internet Resources (optional) http://www.ebi.ac.uk/dalilite The interactive DaliLite server for comparing two structures to each other and visualizing the structural superimposition. http://www.ebi.ac.uk/dali The Dali e-mail server for comparing a new structure against the database of known structures. http://www.bioinfo.biocenter.helsinki.fi/dali The Dali database for browsing structural and sequence neighbours of proteins. http://www.bioinfo.biocenter.helsinki.fi/sqgraph/pairsdb The ADDA classification assigns every residue of known protein sequences into a domain family and interactively visualizes the sequence neighbours of any query protein in a multiple alignment. http://srs.ebi.ac.uk http://www.ncbi.nlm.nih.gov/ SRS at the EBI and Entrez at NCBI are comprehensive search engines cross-reference the PDB identifier of a protein to many other databases. FIGURE LEGENDS 1. Results summary page of the DaliLite server. 21

2. Structural alignment by the DaliLite server. 3. Click on the Superimposed C-alpha traces link to view the superimposition in Rasmol (stereo view). 4. Clicking on the browse link in Figure 3 leads to the list of structural neighbours of estradiol receptor. Hits 1-21 are members of the same fold class comprising nuclear receptors. The last hit (number 22) has a much lower Z-score than the nuclear receptors and represents a biologically non-interesting hit that matches in a helical bundle motif. 5. Home page of the Dali database. The user has typed in Estradiol receptor in the querybox. 6. The result of the query for estradiol receptor structures. 7. A large number of nuclear receptors belong to the same fold class as estradiol receptor. Where a sequence-structure-domain mapping is available, they have all been classified into the same Adda domain family (numbered 523). 8. Multiple structure-alignment of estradiol receptor and selected structural neighbours. Notation: three-state secondary structure definitions by DSSP (reduced to H=helix, E=sheet, L=coil) are shown above the amino acid sequence. 9. Left: Distance matrix representation of two different proteins, one in the upper and the other in the lower triangle. Right: Structural alignment identifies a one-to-one correspondence between a subset of residues. The respective sub-matrices of the distance matrix display similar contact patterns. 22

Table 1: Overview of Dali resources and their relations. Dali server DaliLite Dali database Adda database Input One PDB structure Two (lists of) PDB structures All PDB structures NRDB (all protein sequences) Steps Database search Pairwise - Remove redundancy - Remove redundancy using cascaded algorithms structure comparison - All-against-all structure comparison - All-against-all sequence comparison - Domain - Domain decomposition decomposition - Clustering - Clustering Output Structure neighbours of query Structure neighbours of query Protocol Basic Protocol 2 Basic Protocol 1 Alternate 1-2 Support protocol Protein fold classification Basic Protocol 3 Protein family classifcation Linked to Dali database 23

Table 2: Program modules of the Dali suite. Program Purpose Reference DSSP dsspcmbi Parse PDB entry. Define secondary structure elements. Kabsch W & Sander C (1983). Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22:2577-2637. Puu Derive a tree of compact substructures to guide alignment. Holm L & Sander C (1994). Parser for protein folding units. Proteins 19:256-268. Wolf Very fast filter to identify obvious similarities. Holm L & Sander C (1995). 3-D lookup: fast protein structure database searches at 90% reliability. ISMB'95:179-187. Soap Used to align structures with little secondary structure. Falicov A & Cohen FE (1996). A surface of minimum area metric for the structural comparison of proteins. J Mol Biol 258(5):871-92. Parsi Sensitive branch-and-bound alignment algorithm. Holm L & Sander C (1996). Mapping the protein universe. Science 273:595-602. Dalicon All alignments generated by the above methods (with different objective functions) are refined using a Monte Carlo algorithm that maximizes the Dali score. Holm L & Sander C (1993). Protein structure comparison by alignment of distance matrices. J Mol Biol 233:123-138. 24

APPENDIX A. OBJECTIVE FUNCTION Here we describe the objective function of the Dali algorithm and the normalization of structural similarity scores to obtain the Z-score. Let s consider two proteins labeled A and B. The match of two substructures is evaluated using an additive similarity score S of the form: Equation 1 S = ( i, j) L L i= 1 j= 1 ϕ, where i and j label residues, L is the number of matched pairs (the size of each substructure), and φ is a similarity measure based on some pairwise relationship, here on the Cα-Cα distances d, d A ij B ij. Unmatched residues do not contribute to the overall score. For a given functional form of ϕ ( i, j), the largest value of S corresponds to the optimal set of residue equivalences. Structural similarity searches here for the largest common substructure between two proteins. So one need to define a similarity measure that balances two contradictory requirements: maximizing the number of equivalenced residues and that of minimizing structural deviations. The use of relative rather than absolute deviations of equivalent distances is tolerant to the cumulative effect of gradual geometrical distortions. In Dali, the residue-pair score φ has the form of Equation 2: A B d d ϕ, * ij, d ij ij ij * Equation 2 ( i j) = θ w( d ) where * d ij is the average of d, d A ij B ij, θ is the similarity threshold, and w is an envelope function. Dali uses the value of θ equal to 0.2. Since pairs in the long distance range are abundant but less discriminative, their contribution is weighted down by the envelope function 2 () r exp( r 2 ) w =, where α = 20 Å, calibrated on the size of a typical domain. We report α alignments generated using the similarity measure of Equation 2, imposing the constraint of strictly sequential alignment. The resulting raw Dali score describing the structural similarity is given by Equation 3: Equation 3 S( A, B) = i core j core 0.2 d A ij d d * ij B ij d exp * ij Ο 20A 2, 25

where we explicitly inserted values of constants in the equation. The core is defined as a set of equivalences between residues in A and B proteins, which is analogous to a sequence alignment. For random pairwise comparison expected Dali-score (Equation 3) increases with the number of residues in compared proteins. In order to describe the statistical significance of a pairwise comparison score S(A,B) Dali server uses the Z-score defined as Equation 4 Z( A B) ( A, B) m( L) 0.5 m( L) S, =, where the denominator is an estimation of the average standard deviation of scores for various lengths of protein chains. The approximate experimental relation between the mean score m and the average length L = L L A (with L<400) of two proteins is given by: B 4 2 6 3 Equation 5 ( ) m L 7.95 + 0.71L 2.59 10 L 1.92 10 L. The Z-score is computed for every possible pair of domains, and the highest value is reported as the Z-score of the protein pair. Possible domains are determined by the Puu algorithm (parser for Protein Unfolding Units). The algorithm recursively cuts a structure into smaller compact substructures at the weakest interface. A number of post-processing rules were introduced to supplement numerical criteria. The whole procedure is fully described in the original publication (Holm & Sander 1995). B. PROGRAM PARAMETERS The following parameters are set at the top of the main Perl script. The default values, as used by the Dali server, are indicated. These parameters mainly affect the pruning of search space in the database search. - $MINLEN=30. Structures with fewer residues are excluded from comparison. Dali was designed to detect similarities at the level of globular domain folding patterns that involve several secondary structure elements. It is not designed to compare conformations of short peptides. 26

- $MINSSE=2. The Wolf and Parsi methods reduce the complexity of the structural comparison by representing structures (partly) as secondary structure elements. If there are fewer than $MINSSE secondary structure elements in the protein, then the Soap method is used. - $cut0=20.0; $cut1=4.0; $cut2=2.0. The database search by the Dali server uses a set of rules to prune search space after a strong similarity has been found. If a similarity has been found that is above a Z-score equal to $cut0, then the search is stopped completely the query is structurally almost identical to the best hit. If similarities have been found with Z-scores above $cut1, then the search list is restricted to the first neighbour shells of all hits. If the best Z-score lies between $cut1 and $cut2, then the search list is restricted to the second neighbour shells of all hits. - $nbest=1. This parameter controls the number of hits in output. All hits with a Z- score above 2, or at least $nbest hits, will be reported. 27

FIGURE 1: SNAPSHOT FROM THE RESULTS PAGE OF DALILITE SERVER FOR THE COMPARISON OF 1F0KA TO 1F6DA. DaliLite Results SUBMISSION PARAMETERS Structure 1 1QKU Structure 2 1K4W SUBMIT ANOTHER Results of Structure Comparison Each chain of mol1 is compared structurally to each chain of mol2 using the DaliLite program. The Dali method optimises a weighted sum of similarities of intramolecular distances. Sequence identity and the root-meansquare deviation of C-alpha atoms after rigid-body superimposition are reported for your information only, they are ignored by the structural alignment method. Suboptimal alignments do not overlap the optimal alignment or each other. Suboptimal alignments detected by the program are reported if the Z-score is above 2; they may be of interest if there are internal repeats in either structure. In the C-alpha traces, the chains of the first and second structure are renamed 'Q' and 'S', respectively. The best match to each chain in the second structure is highlighted in the table below. Z-Scores below 2 are not significant. First Structure & Chain: mol1a No. Second Structure & Chain Z- Score Aligned Residues RMSD [Å] Seq. Identity [%] Structural Alignment Superimposed C-alpha Traces PDB Files: mol2 is rotated / translated to mol1 position 1 mol2a 22.2 217 2.4 18 click here CA_1.pdb mol1_original.pdb mol2_1.pdb Additional data Rotation-translation matrices for superimposition Listing of structurally equivalent residue ranges View the log - this is only informative to experts 28

FIGURE 2: STRUCTURAL ALIGNMENT BETWEEN 1QKUA AND 1K4WA. NO 1: QUERY=MOL1A SBJCT=MOL2A Z-SCORE=22.2 DSSP lllllllllllhhhhhhhhhhhl..llll...llllllll..lllhhhhhh Query skknslalsltadqmvsalldae..ppil...yseydptr..pfseasmmg 44 ident Sbjct...tMSEIDRIAQNIIKSHleTCQYtmeelhqlawqthtyeEIKAyqSKSREALWQ 53 DSSP...lHHHHHHHHHHHHHHHhhLLLLlhhhhhllllllllhhHHHHhhLLLHHHHHH DSSP HHHHHHHHHHHHHHHHHHHLLLHHHLLHHHHHHHHHHHHHHHHHHHHHHHLLLLLLEELL Query LLTNLADRELVHMINWAKRVPGFVDLTLHDQVHLLECAWLEILMIGLVWRSMEHPGKLLF 104 ident Sbjct QCAIQITHAIQYVVEFAKRITGFMELCQNDQILLLKSGCLEVVLVRMCRAFNPLNNTVLF 113 DSSP HHHHHHHHHHHHHHHHHHLLHHHHLLLHHHHHHHHHHHHHHHHHHHHHHHEELLLLEEEE DSSP LlLLLEELLHHHHLLlHHHHHHHHHHHHHHHHHHLLLHHHHHHHHHHHHHHLLLLLLLll Query ApNLLLDRNQGKCVEgMVEIFDMLLATSSRFRMMNLQGEEFVCLKSIILLNSGVYTFLss 164 ident Sbjct E.GKYGGMQMFKALG.SDDLVNEAFDFAKNLCSLQLTEEEIALFSSAVLISPDRAWLL.. 169 DSSP L.LEEELHHHHHHHL.LHHHHHHHHHHHHHHHLLLLLHHHHHHHHHHHHLLLLLLLLL.. DSSP llhhhhhhhhhhhhhhhhhhhhhhhhhlllllhhhhhhhhhhhhhhhhhhhhhhhhhhhh Query tlksleekdhihrvldkitdtlihlmakagltlqqqhqrlaqlllilshirhmsnkgmeh 224 ident Sbjct...EPRKVQKLQEKIYFALQHVIQKNHLD...DETLAKLIAKIPTITAVCNLHGEK 219 DSSP...LHHHHHHHHHHHHHHHHHHHHHLLLL...LLHHHHHHLLHHHHHHHHHHHHHH DSSP HHHHHHLL...llLLLHHHHHLLLlllll Query LYSMKCKN...vvPLYDLLLEMLDahrlh 250 ident Sbjct LQVFKQSHpdivntLFPPLYKELFN... 244 DSSP HHHHHHHLhhhhhhLLLHHHHHHHL... 29

FIGURE 3: SUPERIMPOSED C-ALPHA TRACES OF 1QKUA AND 1K4WA, RASMOL STEREO VIEW. 30

FIGURE 4: HOME PAGE OF DALI DATABASE Dali fold classification Reference: L. Holm and C. Sander (1996) Mapping the protein universe. Science 273:595-602. The Dali database is based on exhaustive all-against-all 3D structure comparison of protein structures currently in the Protein Data Bank (PDB). The classification and alignments are automatically maintained and continuously updated using the Dali search engine. This is a preliminary test version dated May 2003. FOLD CLASSIFICATION Fold index - complete list of structural domains in PDB90 ordered by similarity. From the Fold index, you can browse the list of structural neighbours and alignments of each representative. Fold tree - a postscript picture SEARCH PDB CODES OR PROTEIN NAMES Enter PDB code or protein name to search for: estradiol receptor submit reset DOWNLOADS HELP L. Holm, Sep 2003 31

FIGURE 5: TEXT QUERY RESULT Dali database query: estradiol receptor Click on the Repres. link to browse the structural neighbours and alignments of the representative. Click on the Fold link to view all members of the fold class. PDB chain Repres. Fold Compound 1qktA/4-250 1qkuA_1 342 ESTRADIOL RECEPTOR 1qkuA/1-250 1qkuA_1 342 ESTRADIOL RECEPTOR 1qkuB/4-250 1qkuA_1 342 ESTRADIOL RECEPTOR 1qkuC/4-250 1qkuA_1 342 ESTRADIOL RECEPTOR FIGURE 6: FOLD QUERY RESULT Dali fold query: 342 Fold index PDB code Adda Browse Compound 342.1.1.1.1.1 1qkuA_1 523 interact ESTRADIOL RECEPTOR 342.1.1.1.2.1 1kv6A_1 523 interact ESTROGEN-RELATED RECEPTOR GAMMA 342.1.1.1.3.1 1l2jA_1 523 interact ESTROGEN RECEPTOR BETA 342.1.1.1.4.1 1qknA_1 523 interact ESTROGEN RECEPTOR BETA 342.1.1.1.5.1 1e3gA_1 523 interact ANDROGEN RECEPTOR 342.1.1.1.5.1 1a28A_1 523 interact PROGESTERONE RECEPTOR 342.1.1.1.6.1 1nhzA_0 interact GLUCOCORTICOID RECEPTOR 342.1.1.1.7.1 1hg4A_1 523 interact ULTRASPIRACLE 342.1.1.1.7.1 1g2nA_1 523 interact ULTRASPIRACLE PROTEIN 342.1.1.1.8.1 1lv2A_1 523 interact HEPATOCYTE NUCLEAR FACTOR 4-GAMMA 342.1.1.1.9.1 1lbd_1 523 interact RETINOID X RECEPTOR 342.1.1.1.10.1 1gwxB_1 523 interact PPAR-DELTA 342.1.1.1.10.1 1fm9D_1 523 interact RETINOIC ACID RECEPTOR RXR-ALPHA 342.1.1.1.10.1 1kkqA_1 523 interact PEROXISOME PROLIFERATOR ACTIVATED RECEPTOR 342.1.1.1.11.1 1k4wA_1 523 interact NUCLEAR RECEPTOR ROR-BETA 342.1.1.1.11.1 1n83A_1 523 interact NUCLEAR RECEPTOR ROR-ALPHA 342.1.1.1.12.1 1dkfB_1 523 interact RETINOID X RECEPTOR-ALPHA 342.1.1.1.12.1 2lbd_1 523 interact RETINOIC ACID RECEPTOR GAMMA 342.1.1.1.13.1 1nq2A_0 interact THYROID HORMONE RECEPTOR BETA-1 342.1.1.1.14.1 1ie9A_1 523 interact VITAMIN D3 RECEPTOR 342.1.1.1.15.1 1m13A_0 interact ORPHAN NUCLEAR RECEPTOR PXR 32

FIGURE 7: STRUCTURAL NEIGHBOUR LIST FOR ESTRADIOL RECEPTOR 1qkuA: Structural Neighbours in PDB90 and structural alignments PDB90 is a representative subset of PDB chains that are less than 90 % sequence identical to each other No: the top 50 alignments, sorted by Z-score, are shown Chain: PDB entry code plus chain identifier raw-score: the sum of weighted similarities of intramolecular distances that Dali maximizes Z-score: normalized score that depends on the size of the structures %id: percentage of identical amino acids over all structurally equivalent residues lali: number of structurally equivalent residues rmsd: root-mean-square deviation of C-alpha atoms in the least-squares superimposition of the structurally equivalent C-alpha atoms Description: the COMPND record from the PDB entry No Chain raw-score Z-score %id lali rmsd Description 1 1qkuA 3994.6 44.6 100 250 0.0 ESTRADIOL RECEPTOR 2 1qknA 2686.9 30.4 57 219 1.3 ESTROGEN RECEPTOR BETA 3 1kv6A 2629.9 30.0 36 222 1.6 ESTROGEN-RELATED RECEPTOR GAMMA 4 1l2jA 2525.5 28.3 54 224 1.9 ESTROGEN RECEPTOR BETA 5 1e3gA 2536.2 27.6 21 229 0.0 ANDROGEN RECEPTOR 6 1a28A 2504.8 27.2 21 229 1.9 PROGESTERONE RECEPTOR 7 1lv2A 2080.4 23.2 26 207 2.2 HEPATOCYTE NUCLEAR FACTOR 4-GAMMA 8 1nhzA 2111.3 23.0 26 209 2.0 GLUCOCORTICOID RECEPTOR 9 2lbd 2082.4 22.7 22 217 2.6 RETINOIC ACID RECEPTOR GAMMA 10 1k4wA 2053.3 22.2 18 217 2.4 NUCLEAR RECEPTOR ROR-BETA 11 1ie9A 2091.7 22.2 20 222 2.7 VITAMIN D3 RECEPTOR 12 1nq2A 2055.9 21.8 16 214 2.5 THYROID HORMONE RECEPTOR BETA-1 13 1n83A 2042.8 21.8 17 215 2.4 NUCLEAR RECEPTOR ROR-ALPHA 14 1hg4A 1999.7 21.7 26 203 2.5 ULTRASPIRACLE 15 1g2nA 1990.3 21.3 26 204 3.0 ULTRASPIRACLE PROTEIN 16 1dkfB 1924.4 21.1 20 208 2.6 RETINOID X RECEPTOR-ALPHA 17 1fm9D 2015.2 20.9 18 218 2.5 RETINOIC ACID RECEPTOR RXR-ALPHA 18 1gwxB 1972.3 20.4 15 221 2.9 PPAR-DELTA 19 1m13A 1967.7 20.3 16 210 2.8 ORPHAN NUCLEAR RECEPTOR PXR 20 1lbd 1777.4 19.0 29 194 2.9 RETINOID X RECEPTOR 21 1kkqA 1793.9 18.1 15 214 2.9 PEROXISOME PROLIFERATOR ACTIVATED RECEPTOR 22 1n81A 530.1 4.6 5 114 4.0 PLASMODIUM FALCIPARUM GAMETE ANTIGEN 27/25 Figure 8: Multiple structural alignment of estradiol receptor and selected neighbours. 33

FIGURE 9: DISTANCE MATRIX ALIGNMENT 34