Contemporary approaches to protein structure classification

Transcription

1 Contemporary approaches to protein structure classification Mark B. Swindells, 1 * Christine A. Orengo, 2 David T. Jones, 3 E. Gail Hutchinson, 2 and Janet M. Thornton 2,4 Summary In a similar manner to sequence database searching, it is also possible to compare three-dimensional protein structures. Such methods can be extremely useful because a structural similarity may represent a distant evolutionary relationship that is undetectable by sequence analysis. In this review, we summarise the most popular structure comparison methods, show how they can be used for database searching, and then describe some of the most advanced attempts to develop comprehensive protein structure classifications. With such data, it is possible to identify distant evolutionary relationships, provide libraries of unique folds for structure prediction, estimate the total number of folds that exist, and investigate the preference for certain types of structures over others. BioEssays 20: , John Wiley & Sons, Inc. Introduction Like all public bioinformatics resources, the main repository for experimentally determined protein structures, the Protein Data Bank (1) (or PDB as it is better known) is growing at an enormous rate. There are currently over 7,000 structures in the PDB, most of which have been determined by x-ray crystallographic and nuclear magnetic resonance (NMR) spectroscopic techniques. However, there are also a small number of structures that have been determined by neutron diffraction and electron microscopy and even some hypothetical models. Currently, new coordinates are being deposited at a rate of 150 per month, which is a far cry from a decade ago when there were only about 600 structures in total. To make these data more accessible to the user, the PDB has recently created a relational database that can be interrogated over the web. Using this, one can select proteins based on key words (such as Perutz or hemoglobin) or experimental criteria 1 Helix Research Institute, Kisarazu, Japan. 2 Biomolecular Structure and Modelling Unit, Department of Biochemistry and Molecular Biology, University College London, United Kingdom. 3 Department of Biological Sciences, Warwick University, Warwick, United Kingdom. 4 Department of Crystallography, Birkbeck College, London, United Kingdom. *Correspondence to: Mark B Swindells, Inpharmatica Ltd., 60 Charlotte Street, London W1P 2AX, UK. (such as resolution and R-factor in the case of crystal structures), as well as by a sequence search against all PDB entries. But it is now known that there are many distant relationships that can only be identified through structure comparison, and it is this development that is the subject of this review. By considering the possibility of structure-based searches in a manner that is comparable to a sequence search, it becomes possible to move cleanly through the barriers that prevent sequence searches from identifying all distant homologues. For instance, although a search that uses a globin sequence is not sufficiently sensitive to detect all the globins in PDB, a structure-based search would not only identify all the globins, but also proteins with similar structures, such as phycocyanin, which may be a distant ancestor, (2) and colicin, whose evolutionary relationships are less certain. (3) Another example, shown in Figure 1, reveals a clear similarity between several DNA-binding proteins that could not be identified on the basis of sequence alone. (4) This type of structure has, so far, only been seen in DNA-binding proteins, and in each case the same helix is used to bind the DNA. In the case of the replication terminator protein (5) and histone H5, (6) residues important for interacting with the DNA also appear to be conserved. As there are now many such cases to be identified, it is desirable to describe all relationships between known structures in a systematic and ultimately quantitative manner. 884 BioEssays BioEssays 20: , 1998 John Wiley & Sons, Inc.

2 TABLE 1. Databases and Programs with Internet Links Databases and programs Protein data bank CATH and CATHserver SCOP FSSP and DALI MMDB and VAST 3Dee Homstrad and ddbase 3D-ali Cosec ssap sarf Internet links argos orengo nicka/prerun.html Figure 1. a: Replication terminator protein, a homodimeric structure (32) whose monomeric chains have structural similarities to a large number of winged helix DNA-binding proteins. (33) Colored regions from (a) are shown in detail (b), and the structural similarity (c) to histone H5 (34) is emphasised. Such data would be useful, not only for detecting evolutionary relationships but also for making libraries of unique structures that could be used in structure prediction methods known as threading (7) and other fold recognition (8-10) techniques. In a manner that is analogous to the comparison of two sequences, such algorithms can thread a sequence through the coordinates of a library structure and through the application of a statistically derived energy function identify its most suitable alignment on the structure. By repeating this procedure on each library structure, it is hoped that the threading with the lowest energy will identify both the correct structure and alignment, though in practice the former is more likely than the latter. Describing such prediction methods in any more detail would be a review in itself, (11,12) but the point to appreciate here is that thorough organisation of the structures plays a significant role, for both the derivation of the potential function and the library of structures to be searched. Comparing protein structures Of course, the discussion above gives no indication of how such structural similarities might be identified. The simplest way is by making comparisons by eye, and indeed, this can be a very successful approach, (13-15) particularly when the coordinates of a structure are not available. However, there are few people who are able to do this. As a result, algorithms which automatically identify structural relationships have become a hot topic for research in recent years. Inevitably, they are more complex and computationally intensive than those used for sequence comparison as one has to find similarities between three-dimensional objects. In principle, one could condense three-dimensional information into a linear string, (16,17) but the increase in speed is more than offset by loss of discriminatory power. Assuming that the coordinates for both structures are available, there are now a number of approaches available for comparing protein structures which use techniques as diverse as least square superposition, (18-20) double dynamic programming, (21-23) simulated annealing, (24) graph theory, (25) distance matrix comparison, (26) and geometric hashing. (27) Some of these will be discussed further in our review. Finding similarities to a newly determined structure In an analogous manner to sequence database searching, it is possible to take a probe structure and compare it with every structure in the PDB. If the structure of interest is not already in the PDB, the easiest way is to use the DALI server or CATHserver (Table 1), which automatically compare the query structure with those already in the database. Alternatively, if one does not want to send the coordinates over the web and the PDB is mirrored, one can download a structure alignment program (Table 1) and run the search locally. Comprehensive comparisons of PDB structures In theory at least, if we can find similarities between pairs of structures, we should be able to describe all similarities in the PDB just by running a program many times. However, there are problems with this. Because most structure alignment procedures are central processing unit-intensive and there is significant redundancy in the PDB, a huge amount of computing time can be wasted performing unnecessary compari- BioEssays

3 sons. For instance, there are over 200 lysozyme entries with over 95% sequence identity. As these relationships can be identified on the basis of sequence alone, it is pointless to compare them all. Therefore, if one is interested in generating a comprehensive classification, it would be prudent to first identify obvious relationships using sequence alignment and then use a representative from each cluster in the second, more time-consuming phase of structure comparison. Preliminary attempts at providing comprehensive classifications (28,29) adopted such a two-step procedure. First, all-byall sequence alignments were calculated, and then, based on the alignment score and the overlap between each pair of sequences, these were clustered into families. From each cluster a representative was selected and a similar procedure performed, except that this time structure comparison was used instead of sequence comparison. These lists proved to be popular but still had two problems: no consideration was given to the deconvolution of multidomain structures into their constituents and new data could not be publicised without writing another paper. More recent approaches address the first problem directly by splitting all of the representatives into domains before performing the structure comparison step. Five groups have independently published procedures for splitting proteins into their constituent domains, (30-34) and these can be used to help process the representatives before performing the structure comparison stage. In structural terms, a domain can be thought of as a globular structure, which under normal circumstances would contain a hydrophobic core and be expected to be stable. However, it may also include small proteins that are held together by disulphide bridges (such as epidermal growth factor). The second problem has been helped by web technology as data can be now be distributed over the internet and updated at regular intervals. If one has a database of all-by-all comparisons, it is also possible, though not essential, to create a classification of the structural similarities. This is one of the main differences between the most popular databases. Some groups use all-by-all comparisons to find interesting new relationships but are not as concerned with the construction of a formal classification, whereas others have classification as an important goal of their project. In addition, one of the most widely used classifications, SCOP, is made by hand and does not have the kind of database common to the computationally based methods, though it could be viewed quite validly as a database of binary relationships. Each approach is equally valid and merely reflects the different ideas that each group has on tackling this important area. In the next part of this review, we will concentrate on the work of four groups: SCOP, (35) a manually constructed hierarchical classification; FSSP (36) and MMDB, (37) which are made in a totally automated fashion; and CATH, (38) the semiautomatic classification developed by ourselves. In our opinion, these four are currently the most comprehensive, firstly, because they include most PDB structures and, secondly, because they combine these data with advanced methods of display. However, readers interested in more detail should also look at 3dee, (32) a database of domains that has been clustered using two different hierarchical procedures, and 3D-ali, (39) an occasionally updated version of the list originally provided by the same authors. (28) Web links for all data and programs are shown in Table 1. Finally, there is the Homologous Structure Alignment Database (Homstrad), which, instead of trying to describe all protein structures, concentrates on a smaller number of families (currently release 5b has 130) but provides highly annotated structure-based sequence alignments (Table 1). A further novelty of Homstrad is that each sequence is written in a form that contains a considerable amount of structural information (beyond the normal summary of secondary structure elements), such as solvent accessibility and hydrogenbonding groups, by using a variety of typesetting procedures. (40) SCOP The first classification of PDB structures to be made available on the web was SCOP. (35) This is a complete classification of all proteins in PDB as well as additional structures that have not been deposited. The latter is possible because, as mentioned earlier, SCOP is an entirely manual classification. SCOP takes structure, function, and to some extent cellular location into account at various levels of the hierarchy. The top level is Class, which currently holds 10 levels: all-, all-, /, (following the classification of Levitt and Chothia) (41) multidomain (containing structures that have not been split into domains), as well as five other classes that deal with membrane-associated proteins and peptides. Below each Class there are two more levels that are based on structural rather than sequence data: Fold, which groups together similar proteins purely on structural criteria, and Superfamily, which clusters proteins on the basis of a similar structure and function. In this manner, proteins belonging to the same superfamily are expected to be evolutionarily related, even though their sequences may be quite different. Below the Superfamily level, all clusters correspond to relationships that could be identified by conventional sequence alignment. The clear advantages of SCOP are that classification at the superfamily level has been performed extremely carefully using the authors detailed knowledge of both structures and biological processes. In addition, the web-site is user-friendly and it is very easy for a researcher to browse through the classification. A potential disadvantage is that the absence of a comparison algorithm means that users cannot compare a new structure to the classification, nor can structure-based 886 BioEssays 20.11

4 TABLE 2. Glossary of Terms Class Architecture Topology Homologous superfamily Fold Superfold Structural similarity Overall description of a protein in terms of its regular secondary structure (,,or ) content. Gross shape that results from packing regular secondary structure elements. Names used often describe this appearance in terms of well-known shapes such as barrels and trefoils. No consideration is given to the polarity of the secondary structural elements nor the sequential order in which they are joined together. Considers not only architecture, but the polarity of each secondary structural element as well as the order in which they are encountered in a sequence. Essentially the same as topology but requires a higher degree of structural similarity coupled with a functional similarity that is suggestive of a homologous relationship. In this paper, we use this word interchangeably with topology. The definition of this word can vary between laboratories. A topology which contains more than one Homologous superfamily General phrase often used to describe the degree of structural complementarity resulting from least squares superposition. Hence, high structural similarity, etc. sequence alignments of classified families be generated automatically. FSSP FSSP is the database that results from using the DALI program to compare all PDB structures. It is a completely automatic approach that relies entirely on a set of algorithms (including DALI) to first process data into domains (30,42) and then use distance matrix comparison to compare all domain level structures with no significant sequence similarity. Each structural similarity, as determined by a system-defined score, is stored in the database, and through this automated procedure, the authors have detected several interesting relationships. (3,43,44) FSSP data can be accessed over the www, by either searching for a PDB structure of interest or submitting the coordinates of a new structure and requesting DALI to perform a custom search against the FSSP database. The latter is a very popular approach for groups involved with protein structure determination as they can ascertain whether any interesting similarities exist immediately after the structure has been solved. A large number of the structure report s published these days include the observation of a distant evolutionary relationship detected on the basis of a DALI search against the FSSP data. In many ways, the FSSP and SCOP approaches are extremely complementary in that a similarity first detected by a DALI search can provide a route into the SCOP classification. Although the FSSP database is not a hierarchical classification and is not designed to be viewed by eye, similar structures are clustered together into a tree of folds so that it is easy to analyse a particular family. However, the major advantage of FSSP is the additional availability of structurebased sequence alignments, which can be tailored to the preferences of each researcher. FSSP also contains other interesting features, such as a file of folds that are so far unique in the database. MMDB This database is created at the National Centre for Biotechnology Information (NCBI) using a vector alignment search tool (VAST). The target for MMDB is to provide a structural link with sequence and literature databases maintained by the NCBI. As one might envisage, this creates the exciting possibility of starting with a protein of interest and exploring increasingly distant relationships, while always being able to check the literature as the search progresses. Data for all structural similarities are accessed in a manner similar to other areas of Entrez, where the user is given the opportunity to investigate a set of nearest neighbours that have steadily decreasing similarities to the query. To make MMDB, domains are first identified automatically on the basis of compactness. (45) Similarities between domains are then determined by comparing sets of secondary structure vectors and assessing the significance of each hit with a statistical scoring scheme. By requesting structural neighbours using tools available at the NCBI, one is effectively asking for structures whose scores with the target are highest. However, as no extra assessments of functional similarities are made, the onus is on the user to judge the relevance of each hit. CATH Our own approach to structure classification has developed from two earlier pieces of work. (29,46) Although we wanted to automate the system as far as possible, we also wanted to try and differentiate between homologous proteins and those with merely a common fold. Therefore, we expected that manual intervention would be needed. We also wanted a classification system that could be browsed like a book as BioEssays

5 Figure 3. CATH pyramid of numbers showing how the structural entries cluster from 12,899 domain entries to three main classes. Figure 2. Simple example of the four CATH levels. For the b Class, the two-layer sandwich architecture is shown, in which one layer is formed by a -sheet and the other is formed by the -helices. Two distinct topologies are given for this architecture, and the difference is emphasised by colouring the ribbons from blue to red. For the first topology, we show two separate Homologous superfamilies, which in this case are represented by acylphosphatase and a domain of aspartate transcarbamoylase. well as searched by a structural probe. The result of our approach is a classification called CATH, whose construction has been explained in detail elsewhere. (22) CATH stands for Class, Architecture, Topology, and Homologous superfamily (see glossary in Table 2) and describes four levels of increasingly detailed structural similarity for each protein domain in the PDB. In this manner, proteins which share the same CATH numbers will also have structures which are globally similar, even though their sequences may be quite different. Figure 2 shows a simple example of these four CATH levels. To deal with redundancy in the PDB, we also have hierarchical divisions below the H level which cluster on the basis of sequence similarity. These are Sequence (where identity 35%), Nearly identical ( 95%), and Identical (100%). Version 1.4 of CATH contains nearly 13,000 domains, and the number of entries at each CATH level is summarised in Figure 3. The considerable redundancy in PDB is emphasised by the S level ( 35% identity), where the data reduce to only 1,316 distinct families. To cluster these S levels, we introduce structure comparison, and in the following section we will show how these 1,316 Sequence families are distributed in our CATH hierarchy. Class and Architecture There are only three main classes in CATH: all-, b, and all- (plus a miscellaneous category for small protein chains having no discernible secondary structure). Within each class, the number of architectures (Table 2) varies considerably. Currently, there are three in the class, 10 in the b class, and 18 in the class (Fig. 3). This difference arises because helices can interact with a wide range of angles, (47) despite a known preference for 35 and 25 angles, (48,49) and in combination these lead to complex structures that are difficult to divide in an intuitive manner. In contrast, those formed by -strands have more distinct shapes, such as barrels and sheets, as hydrogen bonds between the strands limit their orientation. Topology and Homologous superfamily These two terms (Table 2) require structures to have a high degree of structural similarity (as determined by the ssap 888 BioEssays 20.11

6 DNA binding Doubly wound Up Down Figure 4. CATHerine wheel, showing how the 757 H-level families distribute among the C, A, and T levels. Protein Class is shown in red, green, and yellow for,, and b, respectively. Within each class, the angle subtended reflects the number of H-level families which belong to each (inner circle) architecture and (outer circle) topology. Twelve superfolds having at least five Homologous superfamilies each are indicated in paler colours, and the structures of selected superfolds are shown around the circumference. αβ plaits TIM Barrel Complex 3 layer sandwich 2 layer sandwich RollBarrel Non bundle Sandwich Bundle Ribbon Barrel UB rolls Jelly roll Single sheet OB folds IG like score). The main difference is that Homologous superfamilies also require evidence that the proteins are related. We use a variety of criteria for this; evidence for a similarity in function (e.g., proteinase activity, DNA binding) is clearly the most useful, but in the absence of such data, we can also use our knowledge of how many distinct functions are associated with each fold. As mentioned earlier (Fig. 1), some topologies appear to be exclusively associated with a particular function. Distinguishing homologues from analogues is extremely difficult, and despite detailed research into methods for dividing these automatically, (50,51) there is currently no method sufficiently reliable to be applied in a routine manner. Some folds occur more than others Our 1,316 S-level clusters occupy 757 Homologous superfamilies and 527 Topologies (Fig. 3). Looking at the outer ring of CATHerine wheel (Fig. 4), two different situations clearly exist: either there is only one H-level within a T-level or there is more than one. Some years ago, we defined the term superfold in order to deal with the observation that certain topologies had noticeably more Homologous superfamilies than others. We required a superfold to have at least three H-levels, and at that time, it meant that 10 out of the 131 topologies identified (7%) were classified as such. Although the database is now an order of magnitude larger, this trend is Figure 5. Superfold graph showing the current data for topologies with more than one superfamily. reinforced with only 4% of the topologies (23 out of 527) having three or more H-levels (Fig. 5). But more significantly, even the data within this small set of topologies are highly skewed and in one case the doubly wound topology has as many as 52 Homologous superfamilies (Fig. 5). This suggests that there are a small number of popular topologies occupied by many proteins with no apparent evolutionary BioEssays

7 relationship and that these are presumably the ones with simple folding pathways. More recently, Brenner et al. (52) have looked at highly populated families in the SCOP database, describing them as frequently occurring domains. A FOD requires the presence of two SCOP superfamilies, and a recent review described 42 FODs in SCOP that represent about 12% of their classified folds. 1,000 or so folds Chothia was the first to estimate the number of folds that may exist in protein structure space. (53) His popular, though not necessarily correct, estimate is 1,000, calculated on the basis of how many folds we know, how many distinct sequence families are known in databases such as Swissprot, and what fraction of all sequences are currently available. Recently, the same group has used a simpler method, which compares the ratio of novel and previously observed folds during a particular year to the total number of folds known in the preceding year. (52) This year-on-year result can vary somewhat but always has the same order of magnitude. Some time ago, we provided a larger estimate of around 6,000. (29) This was based on a slightly different calculation, which tried to correct for the redundancy inherent in all databases and compensate for the presence of superfolds. However, the inability to cluster long sequences on the basis of their constituent domains and predict how many sequence families will ultimately belong to a superfold leads to an overestimation of folds by this calculation. The correct number will probably lie somewhere between the two. However, in all structure classification work, perhaps the biggest obstacle to determining the number of folds is the problem of how to actually define a fold. What is a fold? Take, for example, the all- proteins in Figure 6. Although the two structures at each extreme are sufficiently different to be assigned to separate folds by most current definitions, there are many proteins with intermediate degrees of similarity. On the basis of known variations in truly homologous proteins, one could easily imagine neighbouring structures as belonging to the same fold. As a result, when links such as these are allowed to form, they lead to large numbers of structures being clustered together. We call this the Russian doll effect. Although the effect can occur between proteins of any size, it is predominant in small domains because they are more likely than large structures to have most of their helices and strands superimposed by chance. At the moment, we have no automated way of dealing with these problems, so when they exist, we increase the requirements for a common fold by limiting the number of nonequivalenced helices and Figure 6. Russian doll effect as illustrated by a selection of all- structures. strands that are allowed. However, this will clearly have a knock-on effect towards the contents of superfolds as different criteria for defining topologies will change the number of Homologous superfamilies that a superfold contains. Conclusion Although protein structure classification is an area that remains under development, the results described here emphasise the scale of progress that has been achieved in the past few years. That we can successfully classify the majority of proteins into a manageable number of families is a significant step forwards for both the analysis of protein structure per se as well as its application to prediction programs such as threading and the new field of genome analysis. Our knowledge of evolution and the way it adapts gene products to new functions is the main reason that structure comparison and classification are possible. The ultimate goal is to relate structure to function, and current classifications represent one such route. As we have shown in the latter part of this review though, proteins do not always have the 890 BioEssays 20.11

8 researchers interests in mind when they evolve and create discrepancies that make divergent and convergent evolutionary paths often difficult to distinguish. Structure comparison and classification help us to appreciate both the variety of folds available to globular proteins as well as the limitations that result from having to form a compact structure. It is likely that the exponential rise in determined structures together with the sequencing of complete genomes will deepen our understanding, allowing some old problems to be solved and other, currently unanticipated problems to take their place. REFERENCES 1 Bernstein FC, Koetzle TF, Williams GJB, Meyer EF, Jr, Brice MD, Rodgers JR, Kennard O, Shimanouchi T, and Tasumi M (1977) The Protein Data Bank: A computer based archival file for macromolecular structures. J. Mol. Biol. 122: Pastore A, and Lesk AM (1990) Comparison of globins and physocyanins: Evidence for evolutionary relationship. Proteins 8: Holm L, and Sander C (1993) Globin fold in a bacterial toxin. Nature 361: Swindells MB (1995) Identification of a common fold in the replication terminator protein suggests a possible mode for DNA binding. Trends in Bioc. Sci. 20: Bussiere DE, Bastia D, and White SW (1995) Crystal structure of the replication terminator protein. Cell 80: Ramakrishnan V, and Finch JT (1993) Crystal structure of globular domain of histone H5 and its implications for nucleosome binding. Nature 362: Jones DT, Taylor WR, and Thornton JM (1992) A new approach to fold recognition. Nature 358: Bowie JU, Lüthy R, and Eisenberg D (1991) A method to identify protein sequences that fold into a known three dimensional structure. Science 253: Nishikawa K, and Matsuo Y (1993) Development of psuedoenergy potentials for assessing protein 3D-1D compatibility and detecting weak homologies. Prot. Eng. 6: Flockner H, Braxenthaler M, Lackner-P Jartz-M, Ortner M, and Sippl MJ (1995) Progress in fold recognition. Proteins 23: Fischer D, Rice D, Bowie JU, and Eisenberg D (1995) Assigning amino acid sequences to 3-dimensional protein folds. FASEB 10: Sippl MJ, and Flockner H (1996) Threading thrills and threats. Structure 4: Murzin AG, and Chothia C (1992) Protein architecture: new superfamilies. Curr. Opin. Struct. Biol. 2: Swindells MB (1992) Structural similarity between transforming growth factor- 2 and nerve growth factor. Science 258: Murzin AG (1996) Structural classification of proteins: new superfamilies. Curr. Opin. Struct. Biol. 6: Karpen ME, de-haseth PL, and Neet KE (1989) Comparing short protein substructures by a method based on backbone torsion angles. Proteins 6: Matsuo Y, and Kanehisa M (1993) an approach to systematic detection of protein structural motifs. Comput. Appl. Biosci. 9: Rossmann MG, and Argos P (1976) Exploring structural homology of proteins. J Mol. Biol. 105: Vriend G, and Sander C. (1991) Detection of three dimensional substructures in proteins. Proteins 11: Alexandrov NN, Takahashi K, and Go N (1992) Common spatial arrangements of backbone fragments in homologous and nonhomologous proteins. J. Mol. Biol. 225: Taylor WR, and Orengo C (1989) Protein structure alignment. J. Mol. Biol. 208: Orengo CA, Brown NP, and Taylor WR (1992) Fast structure alignment for protein databank searching. Proteins 14: Orengo CA, and Taylor WR (1996) SSAP: Sequential structure alignment program for protein structure comparison. Methods Enzymol 266: Sali A, and Blundell TL (1990) The definition of general topological equivalence in protein structures: A procedure involving comparison of properties and relationships through simulated annealing and dynamic programming. J. Mol. Biol. 212: Mitchell EM, Artymiuk PJ, Rice DW, and Willett DW (1989) Use of techniques derived from graph theory to compare secondary structure motifs in proteins. J. Mol. Biol. 212: Holm L, and Sander C (1993) Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233: Nussinov R, and Woolfson HJ (1989) Efficient detection of threedimensional structural motifs in biological macromolecules by computer vision techniques. Proc. Natl. Acad. Sci. 88: Pascarella S, and Argos P (1992) A data bank merging related protein structures and sequences. Prot. Eng. 2: Orengo CA, Flores TP, Taylor WR, and Thornton JM (1993) Identification and classification of protein fold families. Prot. Eng. 6: Holm L, and Sander C (1994) Parser for protein folding units. Proteins 19: Swindells MB (1995) A procedure for detecting structural domains in proteins. Protein Science 4: Siddiqui AS, and Barton GJ (1995) Continuous and discontinuous domains: An algorithm for the automatic generation of reliable protein domain definitions. Prot. Sci. 4: Sowdahamini R, Rufino SD, and Blundell TL (1996) A database of globular protein structural domains: Clustering of representative family members into similar folds. Folding and Design 1: Islam SA, Luo J, and Sternberg MJE (1995) Identification and analysis of domains in proteins. Prot. Engineering 8: Murzin AG, Brenner SE, Hubbard T, and Chothia T (1995) SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247: Holm L, and Sander C (1996) The FSSP database of structurally aligned protein fold families. Nucl. Acids Res. 24: Gibrat JF, Madej T, and Bryant SH (1996) Surprising similarities in structure comparison. Curr. Opin. Struc. Biol. 6: Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, and Thornton JM (1997) CATH a hierarchic classification of protein domain structures. Structure 5: Pascarella S, Milpetz F, and Argos P (1996) A databank (3D-ali) collecting related protein sequences and structures. Prot. Eng. 9: Overington J, Johnson MS, Sali A, and Blundell TL (1990) Tertiary structural constraints on protein evolutionary diversity; Templates, key residues and structure prediction. Proc. Roy. Soc. Lond. B241: Levitt M, and Chothia C (1976) Structural patterns in globular proteins. Nature 261: Holm L, and Sander C (1996) Mapping the protein universe. Science 273: Holm L, and Sander C (1995) DNA polymerase beta belongs to an ancient nucleotidyltransferase superfamily. Trends in Bioc. Sci. 20: Holm L, and Sander C (1997) Enzyme HIT. Trends in Bioc. Sci. 22: Madej T, Gibrat JF, and Bryant SH (1995) Threading a database of protein cores. Proteins 23: Orengo CA, Jones DT, and Thornton JM (1994) Protein superfamilies and domain superfolds. Nature 372: Bowie JU (1997) Helix packing angle preferences. Nature Structural Biology 4: Chothia C, Levitt M, and Richardson D (1981) Helix to helix packing in proteins. J. Mol. Biol. 145: Reddy B, and Blundell T (1993) Packing of secondary structural elements in proteins. Analysis and prediction of inter-helix distances. J. Mol. Biol. 233: Russell RB, Saqi MA, Sayle RA, Bates PA, and Sternberg MJ (1997) Recognition of analogous and homologous protein folds: Analysis of sequence structure conservation. J. Mol. Biol. 269: Rost B (1997) Protein structures sustain evolutionary shift. Folding and Design 2:S19 S Brenner SE, Chothia C, and Hubbard TJP (1997) Population statistics of protein structures: Lessions from structural classifications. Curr. Opin. Struc. Biol. 7: Chothia C (1992) One thousand families for the molecular biologist. Nature 357: BioEssays