Predict protein structure through evolution - Rostlab · PDF filePredict protein structure...

1

Predict protein structure through evolutionBurkhard Rost 1, 2, 3, *,

Jinfeng Liu1,4, Dariusz Przybylski1,5, Rajesh Nair1,5, Kazimierz O.Wrzeszczynski1, Henry Bigelow1 & Yanay Ofran 1,6 *

In 'Chemoinformatics - From Data to Knowledge' J Gasteiger & T Engel (eds.), Wiley

* 1 CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th StreetBB217, New York, NY 10032, USA; 2 Columbia University Center for Computational Biology and Bioinformatics(C2B2), Russ Berrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA; 3 North East StructuralGenomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 650West 168th Street BB217, New York, NY 10032, USA; 4 Dept. of Pharmacology, Columbia Univ., 630 West 168th

Street, New York, NY 10032, USA; 5 Dept. of Physics, Columbia Univ., 538 West 120th Street, New York, NY10027, USA; 6Dept. of Medical Informatics, Columbia Univ., 630 West 168th Street, New York, NY 10032, USA; *Corresponding author: [email protected], http://cubic.bioc.columbia.edu/, Tel: +1-212-305-3773, fax: +1-212-305-7932

Abbreviations used: 3D, three-dimensional; 3D structure, three-dimensional (co-ordinates of protein structure);2D, two-dimensional; 2D structure , two-dimensional (e.g. inter-residue distances); 1D, one-dimensional; 1Dstructure, one-dimensional (e.g. sequence or string of secondary structure); U, protein sequence of unknown 3Dstructure (e.g. search sequence); T, target used for homology modeling (protein of known 3D structure); PDB,Protein Data Bank of experimentally determined 3D structures of proteins [1]; SWISS-PROT, data base of proteinsequences [2]; TrEMBL, translation of the EMBL-nucleotide database coding DNA to protein sequences [2].

Abstract

The ultimate goal of protein structureprediction is to extend our knowledge andunderstanding of the structures andfunctions of proteins beyond that which ispossible by experiment. Virtually alltechniques, including 1D, 2D, and 3Dstructure prediction, and diverse kinds offunction prediction use profiles ratherthan single sequences as the ‘informationobject’ for prediction. Database methodsrely on structural information to evaluatethe fitness of a protein sequence for agiven structure according to a statisticalmodel. Energetic methods derivepredictions by calculating the fitnessaccording to thermodynamic and kineticprinciples. The two approaches havetheir limitations: database methods sufferfrom sparse statistics and therefore oftenover-fit the data, while energetic methodsmust vastly simplify theory to be tractablewith the limited computational poweravailable. The best predictors almostalways use a combination of both in anintelligent way.

1 Introduction

Proteins are the machinery of life. T h einformation for life is stored by a four-letteralphabet in the genes (DNA). Proteins are,among others, the macromolecules thatperform all important tasks in organisms, suchas catalysis of biochemical reactions, transportof nutrients, recognition, and transmission ofsignals. Thus, genes are the blueprints orlibrary, and proteins are the machinery of life.Proteins are formed by joining amino acids bypeptide bonds into an unbranched chain. Thisprotein sequence comprises a translation ofthe four-letter DNA alphabet into a 20-letteralphabet of native amino acids. Proteins differin length (from 30 to over 30,000 amino acids),and in the arrangement of the amino acids(dubbed residues, when joined in proteins). Inwater, the chain folds up into a unique three-dimensional (3D) structure. The main drivingforce is the need to pack residues for which acontact with water is entropically unfavorable(hydrophobic residues) into the interior of themolecule. A detailed analysis of the underlyingchemistry shows that this is only possible if the

2

protein forms regular patterns of a substructurecalled secondary structure (Fig. 1; for anexcellent introduction into protein structure: [3];for a short review of the basic principles offolding: [4]).

Sequence determines structure determinesfunction. Protein three-dimensional (3D)structure (i.e. the co-ordinates of all atoms)determines protein function. The hypothesisthat structure (also referred to as 'the fold') isuniquely determined by the specificity of thesequence has been verified for many proteins[5]. While it is now known that particularproteins (chaperones) often play a role in thefolding pathway, and in correcting misfolds [6],it is still generally assumed that the finalstructure is at the free-energy minimum. Thus,all information about the native structure of aprotein is coded in the amino acid sequence,plus its native solution environment. The codecould by deciphered, i.e. 3D structurepredicted, from physico-chemical principlesusing, for example, molecular dynamicsmethods [7; 8]. However, in practice suchapproaches are frustrated by two principleobstacles [9]. Firstly, energy differencesbetween native and unfolded proteins areextremely small (order of 1 kcal/mol).Secondly, the high complexity resulting fromthe cooperativity of protein folding requiresseveral orders of magnitude more computingtime than we anticipate to have over the nextdecades. Thus, the inaccuracy inexperimentally determining the basicparameters, and the limited computingresources become fatal for predicting proteinstructure from first principles. All successfulstructure prediction tools are knowledge-based: they use a combination of statisticaltheory and empirical rules.

The sequence-structure gap is rapidlyincreasing. Databases for protein sequencesare expanding rapidly, largely due to large-scale genome sequencing projects. We nowknow the sequences for about a millionproteins [2]. Only seven years after the firstorganism was entirely sequenced, we nowhave the entire genomes for over 50organisms [10; 11] from all kingdoms of life,amongst them six multi-cellular eukaryotes(worm, fly, weed, human, and mouse). Thisimplies that the explosion of genome, andhence, protein sequences is supposedly theonly field outgrowing the speed in developmentof computer hardware. It also implies that

despite significant improvements of structuredetermination techniques the gap between thenumber of proteins for which structure isdeposited in public databases (PDB [1]), andthe number of proteins for which sequencesare known is increasing.

No general prediction of structure fromsequence, yet. John Moult (CARB,Washington DC) initiated the biannualexperiment for a critical assessment ofstructure prediction (CASP): those whodetermine protein structures submitted thesequences of proteins for which they wereabout to solve the structure to a 'to-be-predicted' database; for each entry in thatdatabase predictors could send in theirpredictions before a given deadline (the publicrelease of the structure); finally, the resultswere compared and discussed during aworkshop (in Asilomar, California). Four suchexperiments have been completed andpublished in special issues of the journalProteins (Vol. 23 1995, Suppl. 1 1997, Suppl. 21999, Suppl. 3 2001); the prediction-season forCASP5 is this summer, and the meeting will beheld in Dec 2002. Most groups that everworked on protein structure prediction methodshave participated in this large-scale enterprise.Undoubtedly, CASP has influenced the field inmany ways. Many types of methods haveimproved significantly since 1994 (CASP1).However, while a few predictions weresurprisingly close to the experimental structureat CASP4, overall we still cannot reliablypredict protein structure from sequence.

In this review, focus will be laid on predictionmethods that actually contribute to bridging thesequence-structure gap with a view toanalyzing entire genomes and proteomes. Thefirst section summarizes briefly where we aretoday in protein structure prediction. Thefollowing chapters sketch the problems andsome of the solutions in database searches,and the prediction of protein structure in 1D,2D, and 3D (Fig. 1).

2 Protein structure predictionfor comparative genomics

2.1 Synopsis

Many more protein sequences are known(~1,000,000) than protein structures (~20,000).Three major prediction techniques are used to

3

Fig. 1: Protein structure in 1D, 2D and 3D.Representation of HIV-1 protease monomer (ProteinData Bank code 1HHP) in one, two and threedimensions. Each of the representations gives riseto a different type of prediction problem.1D - predict secondary structure and solventaccessibility. From left to right: amino acids for thefirst 33 residues (one letter code, first column);alignment exemplified by 5 sequences (secondcolumn); secondary structure [93] (H, helix; E,strand; blank, other: third column), solventaccessibility (measured in square Å, fourth column,[93]), and a typical prediction by the neural networkprogram PHD [37] for secondary structure andsolvent accessibility (in italics, fifth and sixthcolumn).

2D - predict contact map. The 3D structure can beprojected onto a two-dimensional matrix of inter-residue distances or contacts (as shown here). Theentry at position ij of the matrix gives the contactstrength between residue i and residue j; thestronger a contact, the darker the marker. Horizontaland vertical lines give borders of secondarystructure segments. Graph made with CONAN [94].3D - predict three-dimensional coordinates. Thetrace of the protein chain in 3D is plottedschematically as a ribbon Calpha-trace. Strands areindicated by arrows; the short helix is visible on theright towards the end (C-term) of the protein. Graphmade with MOLSCRIPT [95]. Predictions notshown.

extend this knowledge to sequences:homology modeling, threading, and 1Dprediction. The results of many predictionefforts are examined closely to produce a non-redundant target list for structural genomics tocomplete our knowledge of protein structuremissed by predictions. Because many costlyresearch decisions are based on predictions,evaluating and fairly comparing predictionaccuracies is crucial. The main pitfalls are

associated with sparse or different datasets,and inappropriate measures given theintended use of the prediction.

2.2 Bridging the gap betweensequence and structure forproteomes

Bridging the sequence-structure gap for 5 -40% of all sequences. The gap between the

4

number of known sequences (~ 1,000,000)and the number of known structures (~ 20,000)is widening rapidly. The most successfultheoretical approach to bridging this gap ishomology modeling: Although each nativeprotein sequence adopts a unique structure,different sequences can adopt the same fold.In other words, proteins with similar sequencestend to fold into similar structures. Indeed, for apair of naturally evolved proteins that havemore than 33 in 100 pairwise identical residueswe can infer that the two proteins fold intosimilar structures (note: for shorter alignmentsthe level of significant pairwise sequenceidentity is much higher; for very longalignments, it approaches a value around 20%[12], Fig. 4). Thus, if a sequence of unknownstructure (U) has significant sequencesimilarity to a protein of known structure (T), itis possible to build an approximate 3D modelfor U based on the assumption that U hasbasically the same structure as T . Thistechnique is referred to as homology orcomparative modeling. It effectively raises thenumber of 'known' 3D structures by more than10-fold. For example, for 30 of the about 100completely sequenced organisms, thepercentage of proteins with experimentallydetermined structures remains below 2% [13;10]. Comparative modeling at very high levelsof accuracy already more than doubles thisnumber (Fig. 2). We can predict the basic foldcorrectly for about 38% of all proteins inentirely sequenced proteomes (Fig. 2).

Widening the bridge by threading.Comparative modeling allows prediction of 3Dstructure for 5-40% of all protein sequences.However, there is evidence that most pairs ofproteins with similar structure are remotehomologues with less than 25% pairwisesequence identity [14]. These remotehomologues cannot usually be recognized byconvent ional sequence al ignments.Consequently, searching for remotehomologues is similar to the task of finding aneedle in a haystack. Techniques to managethis difficult task are referred to as 'threadingtechniques' [15]. Most of these techniques areapplicable if and only if the remote homologueto U has known structure. Once a remotehomology is detected, remote homologymodeling may be used to construct a 3Dmodel. This could potentially reduce thesequence-structure gap by an additional tenpercentage points (Fig. 2).

Fig. 2: Scope of structure prediction. For theentire proteomes of 30 organisms, we investigatedthe structural coverage by experiment andcomparative modeling: black bars give high-accuracy models (2-3Å), gray bars accurate models(2-5Å), and striped bars give the coverage with low-accuracy models, i.e. models that give the basic foldcorrectly. The lowest threshold for which we canlearn about structure through comparative modellingis a PSI-BLAST E-value of 10-3. At this level, about38% of all proteins have similarity to knownstructures. Threading techniques may bridge thesequence-structure gap by another 5-10 percentagepoints. However, for the remaining half of allproteins we can currently not predict 3D structure.The only general methods available for this half arepredictions in 1D and 2D. Note that the graph showsthe coverage in terms of proteins. The coverage ofresidues for which we have structural knowledge isabout 10 percentage points lower [13].

Accurate prediction for 1D aspects of 3Dstructure. If neither a close nor a remotehomologue can be detected for U, we areforced to simplify the prediction problem. Thereis a pay-off from making this simplification:using the rich diversity of information in current

5

databases, it is possible to make very accurate1D predictions from the sequence alone.Automatic prediction services are readilyavailable for secondary structure, solventaccessibility, location and topology fortransmembrane helices, and the location ofhelices for the special class of coiled-coilproteins. These simplified predictions are theonly means that we have today to compareentire proteomes [16]. A few results are thatmost organisms have a similar percentage ofhelical transmembrane proteins (15-30%), andthat eukaryotes have significantly higherfractions of coiled-coil proteins (10%) and ofproteins with long regions lacking regularsecondary structure (20-30%) than all otherkingdoms [10; 13; 17].

3 Sequence alignments

3.1 Synopsis

For every protein structure nature hasinvented, it has produced a myriad ofsequence variants. Sequence alignment is ourprimary tool for both discriminating differentstructures and estimating residue structuralneighbors among these variants. The HSSPcurve, a calibration of alignment quality againststructural knowledge, enables us to inferstructural similarity without knowing thestructure. A ‘profile’ calculated from a multiplesequence alignment extracts signal (conservedresidues) from noise (non-conservedresidues).

3.2 Alignment methods in practice

Basic concept. The principle problem ofsequence alignments is to find the 'optimal'superposition between two strings of amino (ornucleic) acid sequences, i.e. to optimally alignthe two strings. One mathematical criterion for'optimal' is the percentage of pairwise identicalresidues in the final superposition. A dynamicprogramming algorithm is guaranteed to findthe optimal solution for a given 'objectivefunction' (here pairwise identity) in analgorithmic time quadratic in the length of both

sequences [18] (Fig. 3). For protein sequencesthis simple approach is not sufficient; findingthe best alignment usually requiresintroduction of gaps in one sequence, orinsertions in the other [19] (Fig. 3: rather thanplacing two dots into sequence T , residues Aand E could be deleted in sequence U ; note:the gap increases the score from 2 to 4identical residues). Usually, gaps areintroduced mathematically by adding aconstant (gap open penalty) to the final score(here number of identical residues). However,to sensitively align protein sequences, this isstill not sufficient. The major addition to thesimple approach described so far is toevaluate scores that are not based on residueidentities but based on biochemical propertiesof the amino acids. For example, aligning twohydrophobic residues (I and L) is morebeneficial than aligning a hydrophobic and acharged residue (L and K; note: when treatinghydrophobic residues as identical, the scorefor the best gapped alignment in Fig. 3increases from 4 to 6).

Evolution distinguishes signal from noise. Atthe level of protein molecules, selectivepressure results from the need to maintainfunction, which in turn requires maintenance ofthe specific 3D structure. This evolutionaryhistory is the basis for the success in aligningprotein (or nucleotide) sequences. Accordingly,conservation and mutation patterns observedin alignments contain very specific informationabout 3D structure. How much variation istolerated without loss of structure or function?Obviously, it is more likely to find two strings often residues with five identical residues bychance than to find two strings of hundredresidues with fifty identical residues. Therefore,pairwise sequence identity alone is not enoughto separate between proteins of similar anddissimilar structure. Reinhard Schneider andChris Sander pioneered the HSSP thresholdthat describes the line of similarity below whichwe cannot infer similarity in structure throughsimilarity in sequence [20]. The details of thecurve originally proposed did not sustain theflood of new structures in the 90's, however,the basic shape remains correct [12]. For veryshort alignments, we cannot infer anythingabout structural similarity (Fig. 4A). For verylong alignments, 20% pairwise sequence iden-

6

Fig. 3: Concept of dynamic programming. Theidea of dynamic programming is the following. Thetwo sequences to be aligned (U and T) are writteninto a matrix. Starting from the first element in thematrix, identities are counted and summed along

the diagonal. The two best paths are marked bygray lines. The two best alignments match only twoidentical residues for the example given. However, ifinsertions (marked by dots) were allowed, the bestalignment, actually, matches four residues.

tity implies structural similarity (Fig. 4A). Thetransition from the 'safe' zone in which we caninfer structural similarity without errors into thetwilight zone in which the sequence signalbegins to fade, is characterized (1) by anincrease in structural homologues by morethan an order of magnitude (explosion of truepairs, Fig. 4B), and (2) by an even moredramatic explosion of pairs that are structurallyunrelated (false pairs, Fig. 4B). This simpleobservation highlights the struggle of today'ssequence analysis: on the one hand, we wantto intrude as far as possible into the twilightzone in order to unravel distant similarities. Onthe other hand, for each percentage point thatwe advance, the false pairs increase by somefactor. For example, at HSSP-distances of 0(marked by curve in Fig. 4A), there are no falsepositives; at distance of -2.5 already half of thenew relations found are false (dissimilarstructures), at -5 the false pairs dominate by4:1, and at -7.5 they dominate 9:1 (Fig. 4B).

Routine database searches by simplifiedprocedures. Any sequence analysis starts withdatabase searches: all known databases arescanned by sequence alignment procedures

for proteins homologous to the searchsequence U . When the pairwise sequenceidentity between U and a putative homologueH is high (Fig. 4), alignment procedures areusually straightforward [21; 22]. For less similarprotein pairs, alignments may fail. Aligning twosequences by dynamic programming is amatter of seconds on a modern workstation.However, database searches require repeatingthis many times, and since the databasesgrow, CPU time becomes a constraint ineveryday sequence analysis. This bottleneck isopened by methods that start to find 'identicalwords' (sub-strings), and then grow thealignment around such blocks. The mostwidely used programs of this sort are BLASTand FASTA [22; 23]. In practice, advancedalignment algorithms typically first run a fastscan with BLAST and/or FASTA and thenapply the full dynamic programming algorithm.This is also implemented in the fast and goodalignment bestseller PSI-BLAST [24]. Forusers who want to fine-tune the finalalignment, ClustalW and its newerimplementation ClustalX provide an excellenttool [25].

7

Fig. 4: Thresholds of significant sequencesimilarity. (A) The HSSP-curve [12; 20] describesthe transition between the region in which we knowfrom sequence similarity that two proteins havesimilar structures and the region in which we do notknow. The HSSP-distance simply measures thedistance from the curve according to the followingformula:

HSSPdistance=PIDE

100

480⋅L-0.32.{1+e-L / 1000}

19.5

, for L £ 11, for L £ 450, for L > 450

Ï

Ì Ô Ô

Ó Ô Ô

where PIDE is the percentage of pairwise identicalresidues between a pair of proteins over the

alignment length of L residues (the two thin linesillustrate HSSP distances of 10 and -10). (B) Thetwilight zone is the region in which the signal fromsequence starts to fade, i.e. in which we can nolonger safely infer structural similarity fromsequence similarity. The transition from the regionabove the HSSP curve into the twilight zone ischaracterized by an explosion of pairs of proteinswith similar structure (thick black line with open plussigns) and by an even more dramatic explosion ofpairs with dissimilar structure (thick gray line withopen triangles). Note: the thin line with open circlesshows the accuracy, i.e. the ratio of true (similar3D)/true + false (dissimilar 3D) that is pulled in at

8

given distance (data compiled in steps of one). (C)The midnight zone is characterized by our inabilityto find any true relation (similar 3D) without anexcess of false positives. When we structurallysuperpose all unique PDB proteins against oneanother and plot the distribution of pairwisesequence identity as measured by the structuralalignment (here from FSSP [96]), we see that themajority of proteins of similar structure populate themidnight zone; this finding becomes more extreme

the less conservative the cut-off for 'similarity instructure' (the higher the threshold in the FSSP z-score, the more similar the two structures). (D+E)The curves illustrate that thresholds for sequencesimilarity implying structural similarity are not theidentical to those implying functional similarity. Twoaspects of function are compared to structuralsimilarity: identity of enzymatic activity [97] andidentity of sub-cellular localization .

tity implies structural similarity (Fig. 4A). Thetransition from the 'safe' zone in which we caninfer structural similarity without errors into thetwilight zone in which the sequence signalbegins to fade, is characterized (1) by anincrease in structural homologues by morethan an order of magnitude (explosion of truepairs, Fig. 4B), and (2) by an even moredramatic explosion of pairs that are structurallyunrelated (false pairs, Fig. 4B). This simpleobservation highlights the struggle of today'ssequence analysis: on the one hand, we wantto intrude as far as possible into the twilightzone in order to unravel distant similarities. Onthe other hand, for each percentage point thatwe advance, the false pairs increase by somefactor. For example, at HSSP-distances of 0(marked by curve in Fig. 4A), there are no falsepositives; at distance of -2.5 already half of thenew relations found are false (dissimilarstructures), at -5 the false pairs dominate by4:1, and at -7.5 they dominate 9:1 (Fig. 4B).

Intruding into the twilight zone by profile-basedalignments. It was also recognized very earlyon that information from the position-specificevolutionary exchange profile of a particularprotein family facilitates discovering moredistant members of that family [26]. Automaticdatabase search methods successfully usedposition-specific profiles for searching [27].However, the breakthrough to large-scaleroutine searches has been achieved by thedevelopment of PSI-BLAST [24] and HiddenMarkov models [28; 29]. In particular, thegapped, profile-based, and iterated search toolPSI-BLAST continues to revolutionize the fieldof protein sequence analysis through itsunique combination of speed and accuracy.More distant relationships are found throughiteration starting from the safe zone ofcomparisons and intruding deeply and reliablyinto the twilight zone.

4 Prediction in 1D

4.1 Synopsis

A 1D prediction assigns a state from a discreteset, e.g. {helix, strand, loop}, defined byphysical or functional criteria, to each residuein the sequence. Prediction is achieved bytraining, in most cases, a Neural Network (NN)on fixed-size sequence windows classified bythe state of the central residue. Thus thepredictive information is sequence-local. Themain improvement in accuracy was due tousing family-derived profiles for both trainingand input. The most widely used 1Dpredictions today are secondary structure,solvent accessibility, transmembrane strandsand helices, and recently, regions of structuralswitches.

4.2 Secondary structure

Basic concept. The principal idea underlyingmost secondary structure prediction methodsis the fact that segments of consecutiveresidues have preferences for certainsecondary structure states [3]. Thus, theprediction problem becomes a pattern-classification problem tractable by patternrecognition algorithms. The goal is to predictwhether the residue at the center of a segmentof typically 13-21 sequence-consecutiveresidues is in a helix, strand or in none of thetwo (no regular secondary structure, oftenreferred to as the 'coil' or 'loop' state). The firstmethods in the field used simple statisticalcorrelations between the amino acid type andits most typical secondary structure state topredict [30]. Since, many different algorithmshave been applied to tackle this simplestversion of the protein structure prediction

9

problem [31-33]: physico-chemical principles,rule-based devices, expert systems, graphtheory, linear and multi-linear statistics,nearest-neighbor algorithms, moleculardynamics, and neural networks. In particular,some of the concepts originally worked out inthe early applications of neural networks [34;35] have since been improved through avariety of refinements [36; 37].

Evolutionary information key to significantlyimproved predictions. On the one hand, about75 out of 100 residues can be exchanged in aprotein without changing structure. On theother hand, exchanges of 1-5 residues canalready destabilize a protein structure. Theexplanation for this contradiction is thatevolution has explored exactly the unlikelyexchanges of particular amino acids atparticular positions that do not changestructure/function to survive. Thus, the residueexchange patterns extracted from multiplealignments are indicative of specific structuraldetails for that family. The first method thatreached a sustained level of a three-stateprediction accuracy above 70% (percentage ofresidues correctly predicted in one of the three-states helix, strand, other) was the profile-based neural network system PHD which usesexactly such evolutionary information derivedfrom multiple sequence alignments as input[37]. The currently best methods (PROFsec[36] and PSIPRED [38]) reach a level of 76%three-state per-residue accuracy [36]. Thisconstitutes a sustained level more than fourpercentage points above last century's bestmethod not using diverged profiles. About twothirds of the improvement achieved by today'sbest methods comes from larger databasesand more diverged profiles [36].

Secondary structure predictions now useful, inpractice. Prediction accuracy varies betweendifferent proteins [37]. For applications thisimplies that predictions can be as good as >95%, but also as bad as <54%. A few methodssuccessfully use reliability indices that indicateresidues for which predictions are moreaccurate. For PSIPRED and PROFsec thecorrelation between such a reliability index andaccuracy is linear [37]. Thus, the reliabilityindex effectively becomes a means to predictprediction accuracy, and hence to assess towhich class a protein of unknown structure (U)belongs: to the well predicted, or to the badlypredicted ones. Young, Kirshenbaum, Dill &Highsmith [39] have unraveled an impressive

correlation between local secondary structurepredictions and global conditions. The authorsmonitor regions for which secondary structureprediction methods give equally strongpreferences for two different states. Suchregions are processed combining simplestatistics and expert-rules. The final method istested on 16 proteins known to undergostructural rearrangements, and on a number ofother proteins. The authors report no falsepositives, and identify most known structuralswitches. Subsequently, the group applied themethod to the myosin family identifyingputative switching regions that were not knowbefore, but appeared reasonable candidates[40].

4.3 Solvent accessibility

Basic concept. It has been argued that we canpredict 3D structure by simply trying differentarrangements of predicted secondary structuresegments in space [41]. One criterion forassessing each arrangement could be to usepredictions of residue solvent accessibility. Theprincipal goal is to predict the extent to which aresidue embedded in a protein structure isaccessible to solvent. Solvent accessibility canbe described in several ways [42]. Thesimplest is a two-state descriptiondistinguishing between residues that areburied (relative solvent accessibility < 16%)and exposed (relative solvent accessibility ≥16%). The classical method to predictaccessibility is to assign either of the twostates, buried or exposed, according to residuehydrophobicity. However, a neural networkprediction of accessibility has been shown tobe superior to simple hydrophobicity analyses[43]. More recently, other advanced methodshave been applied successfully to predictaccessibility [44]. A particular twist comes frommethods that predict the contact environmentof a residue [45]: rather than predicting thatresidue X is n% accessible, these methodspredict that X is in contact with less than mother residues.

Evolutionary information improves predictionaccuracy. Solvent accessibility at each positionof the protein structure is evolutionarilyconserved within sequence families. This facthas been used to develop methods forpredicting accessibility using multiplealignment information [46]. Prediction accuracyis about 75±7%, four percentage points higher

10

than for methods not using alignmentinformation. Recently, PHDacc has beensuperseded by PROFacc, a neural networkbased method that improves predictionaccuracy significantly. PROFacc is particularlyaccurate in predicting surface residues reliably:almost 80% of the residues on the surface(>16% accessible) are correctly predicted.Predictions of solvent accessibility have alsobeen used successfully for prediction-basedthreading, as a second criterion towards 3Dprediction by packing secondary structuresegments according to upper and lowerbounds provided by accessibility predictions,and as basis for predicting functional sites [46].More recently, predictions of accessibility werealso used successfully to predict sub-cellularlocation (below).

4.4 Transmembrane helices

Basic concept. Even in the optimistic scenariothat in the near future most protein structureswill be experimentally determined, one class ofproteins will still represent a challenge forexperimental determination of 3D structure:transmembrane proteins. The major obstaclewith these proteins is that they do notcrystallize, and are hardly tractable by NMRspectroscopy. Consequently, for this class ofproteins structure prediction methods are evenmore needed than for globular water-solubleproteins. Fortunately, the prediction task issimplified by strong environmental constraintson transmembrane proteins: the lipid bilayer ofthe membrane reduces the degrees offreedom to such an extent that 3D structureformation becomes almost a 2D problem. Twomajor classes of membrane proteins areknown: proteins that insert helices into the lipidbilayer, and proteins that form pores by abarrel of an even number of anti-parallel ß-strands (below). For helical membraneproteins, elaborate combinations of expert-rules, hydrophobicity analysis and statisticswas believed to yield a two-state per-residueaccuracy of about 90% (residues predictedcorrectly as either transmembrane helix, orother; recent reviews in [47-49]).

Evolutionary information improves predictionaccuracy. For two methods the use of multiplealignment information is reported to clearlyimprove the accuracy of predictingtransmembrane helices [37; 50]. In order topredict the orientation of the helices (i.e. the

topology) a simple rule is applied: positivelycharged residues occur more often in intra-cytoplasmic than in extra-cytoplasmic regions.The advanced neural network system hasbeen improved significantly by adding adynamic programming algorithm to the neuralnetwork output. The principle idea is to use theneural network output as an energy landscapeand to find the optimal path through thislandscape [51]. TMAP was another earlyapplication of multiple sequence alignments todetermine membrane-spanning segments [50].More recently, Hidden Markov Models used acombination of intelligence and evolutionaryinformation (below).

Grammatical rules reflect global aspects ofmembrane regions. The lipid bilayerconstrains the structure of the membrane-passing regions of proteins in many ways.TMHMM pioneered building models ofpredicted membrane proteins considering avariety of such constraints in one consistentmethodology [52]. A similar concept wasimplemented in HMMTOP [53]. TMHMMimplements a cyclic HMM with seven states fortransmembrane-helix (TMH) core, TMH-capson the N- and C-terminal sides, non-membraneregions on the cytoplasmic side, two non-membrane regions on the non-cytoplasmicside, and a globular domain state in the middleof each non-membrane region. In contrast,HMMTOP is an HMM representing thefollowing five structural states: inside non-membrane region, inside TMH-cap, membranehelix, outside TMH-cap, and outside non-membrane region. Conceptually, this model issimilar to the one used in MEMSAT [54]. Itdiffered in the placement and interpretation ofTMH-caps, which Tusnady et al. interpret asnot being in the membrane [53].

4.5 Transmembrane strands

Basic concept. Beta-barrel membrane proteinsare found in the outer membranes (OMs) ofGram-negative bacteria, mitochondria andchloroplasts. In prokaryotes, they mediate non-specific, passive transport of ions and smallmolecules or can selectively pass moleculessuch as maltose and sucrose [55]. Ineukaryotic organelles, beta-barrel membraneproteins have been suggested to be involvedin voltage-dependent anion channels. Thiswide range of functions is associated with awide range of structural variants: barrels with

11

as few as 8 and as many as 22 strands. Porinsalso contain a central channel that is partiallyblocked by a loop that folds inwardly and isattached to the inner side of the barrel wall.Currently, high-resolution structures are onlyavailable for bacterial OM proteins. Unlike forhelical membrane proteins, there are no simplelow-resolution experiments that yield largeamounts of data for beta-barrel membraneproteins. This has constrained the ability todevelop prediction methods. Many beta-strands contain alternating hydrophobic andhydrophilic side-chains. However, this simplerule usually does not suffice to identifymembrane strands . All early attempts topredict membrane strands employed theamphipathicity and hydrophobicity of beta-strands. Unfortunately, membrane strandshave no long stretch of consecutivehydrophobic residues. In fact, the overallhydrophobicity for beta-barrel membraneproteins is similar to that of soluble proteins.Cowan and colleagues suggested to use themean hydrophobicity of one side of a strand byaveraging over hydrophobic moments [56] ofevery second residue within a sliding window[57]. Gromiha and colleagues combined aminoacid preferences for beta-strands with thesurrounding hydrophobicity of the respectiveresidues to predict beta-strands [58].

Non-linear statistics enables to predictmembrane beta-strands. Diederichs andcolleagues proposed to use a neural networkto predict the topology of the bacterial OMbeta-barrel proteins and to locate residuesalong the axes of the pores [59]. The neuralnetwork predicts the z-coordinate of C-alphaatoms in a coordinate frame with the outermembrane in the xy-plane, such that low z-values indicate periplasmic turns, medium z-values indicate transmembrane strands, andhigh z-values indicate extracellular loops. Mostrecently, Jacoboni and colleagues applied amethod combining neural networks anddynamic programming to predict the location ofmembrane strands [60]. The networks usedalignment information as input and predictedwhether or not a particular residue is part of amembrane strand. In the second step, themethod simply finds the optimal path throughthe network prediction, much like the methodsapplied to predict membrane helical proteins[54; 37]. Finally, the topology is assignedbased on the location of the longest loop thatis taken to be exterior. The authors estimatedthat their system correctly predicts about 93%

of all known membrane-strands. It is not clearwhether or not the estimates from Diederichset al. and Jacoboni et al. will hold true for allbeta-strand membrane proteins.

5 Prediction in 2D

5.1 Synopsis

It is observed that sequence distant mutationsin a structural family are found to have highercorrelation if neighboring in structure. Thissecond order statistic is used to predict inter-residue contacts (2D prediction or contactprediction). A special case, inter-strandcontacts, is a simpler prediction problem.Overall, the field of contact prediction is in itsbeginning, and accuracy still rather low.However, such predictions can refine many 1Dand 3D prediction algorithms, and their usesare actively being explored.

5.2 Inter-residue contacts

Prediction problem is a hard one, but thestakes are high. Given all inter-residuecontacts or distances (Fig. 1), 3D structure canbe reconstructed by distance geometry ormolecular dynamics. This is used for thedetermination of 3D structures by nuclearmagnetic resonance (NMR) spectroscopywhich produces experimental data of distancesbetween protons [61]. Can inter-residuecontacts be predicted? Obviously, somefraction of these contacts can be: helices andstrands can be assigned based on hydrogen-bonding pattern between residues. Thus, asuccessful prediction of secondary structureimplies a successful prediction of some fractionof all the contacts. However, contactspredicted from secondary structure assignmentare short-ranged, i.e., between residuesnearby in sequence. For a successfulapplication of distance geometry, long-rangecontacts have to be predicted, i.e., contactsbetween residues far apart in sequence. A fewmethods have been proposed for theprediction of long-range inter-residue contacts.Two questions surround such methods: (1)how accurate are these prediction methods onaverage and (2) are all the important contactspredicted? The answers appear to havedeterred many researchers from this importantsub-field of protein structure prediction:

12

accuracy is low and only some of the importantcontacts are correctly predicted. Nevertheless,contact predictions help to predict folds [62;63] and can be used to predict protein-proteininteractions [64].

Correlated mutations and neural networks. Insequence alignments, some pairs of positionsappear to co-vary in a physico-chemicallyplausible manner, i.e., a 'loss of function' pointmutation is often rescued by an additionalmutation that compensates for the change[65]. One hypothesis is that compensationswould be most effective in maintaining astructural motif if the mutated residues werespatial neighbors. Attempts have been madeto quantify such a hypothesis and to use it forcontact predictions. In general, predictionaccuracy is rather poor, with a direct trade-offbetween predicting enough contacts, andpredicting only correct ones, e.g., taking 5% ofthe best-predicted long-range contacts(sequence separation above 10 residues) theaccuracy prediction is about 50% [66]. Thesignal from correlated mutations is somehowindependent of the signal from evolutionaryconservation [67]. Another obvious approach isto apply neural networks to the problem.Recently, neural networks were combined withcorrelated mutations to yield the currently mostaccurate prediction tool [68].

5.3 Inter-strand contacts

Identifying the correct b-strand alignment. Theonly method published for predicting inter-strand contacts is based on potentials ofmean-force [69] similar to those used in theevaluation of strand-strand threading.Propensities are compiled by database countsfor 2 ¥ 2 ¥ 2 classes (parallel/anti-parallel, H-bonded/not H-bonded, N-/C-terminal). Each ofthe eight classes is divided further into fivesub-classes in the following way. Suppose thetwo strand residues at positions i and j are inclose in space. Then the following five residuepairs are counted in separate tables: i/j-2, i/j-1,i/j, i/j+1, i/j+2. Such pseudo-potentials identifythe correct b-strand alignment in 35-45% of thecases.

Using evolutionary information to predict inter-strand contacts. Even if the locations ofstrands in the sequence are known exactly, thepseudo-potentials cannot predict the correctinter-strand contacts in most cases [69].

However, when using multiple alignmentinformation, the signal-to-noise ratio increasessuch that inter-strand contacts have beenpredicted correctly for most of the strandsinspected in some test cases. For the purposeof reliable contact prediction, this result isinadequate, but in some cases is still highenough to be useful for approximate modelingof 3D structure. Implementing rules improvesthe accuracy [70]; elaborate recurrent neuralnetworks improve the prediction of beta-strandcontacts even further [71]. However, all thesemethods do not systematically combineclusters of predicted contacts. This task wassolved systematically by implementing aHopfield-like neural network [72]: expert-rulebased constraints (e.g. each strand can onlybe contacting two other strands) are optimizedto predict the entire contact map for beta-strand proteins. Unfortunately, none of theseapparently helpful prediction methods ispublicly available at the moment.

6 Prediction in 3D

6.1 Synopsis

The ultimate goal of structure prediction is anatomic resolution (3D) model of a givensequence. This goal can only be achievedwhen the sequence has a close homologue ofknown structure. Both comparative modelingand threading combine statistical informationand simplified models for the energy andentropy of folding to evaluate the fitness of asequence to a particular structure. The majorhurdles for both endeavors are (1) identifyhomologue of known structure, (2) align thehomologue, and (3) adjust the final structure tooptimize its fitness. If no homologue of knownstructure exists, ab initio prediction is the onlyoption. However, most ab initio methods arestill being developed and trained on proteinswith homologues of known structure.

6.2 Known folds: comparativemodeling

Basic concept. Proteins with similar sequencehave similar structure (Fig. 4). The basictopology of a 3D structure is often describedthrough a simple sketch (Fig. 1) that isintuitively referred to as 'the fold'. Experts canrecognize the similarity of two structures by

13

visually classifying them into folds (as e.g.done in the SCOP database [73]; note that theconcept of 'fold similarity' is at the base of thedata presented in Fig. 4). This is the pillar forthe success of comparative modeling [74-77].The principal idea is to model the structure ofU (protein of unknown structure) based on thetemplate of a sequence homologue of knownstructure (T). Consequently, the preconditionfor comparative modeling is that a sequencehomologue of known structure is found in PDB.Since comparative modeling is currently theonly theoretical means to successfully predict3D structure, this has two implications. First,comparative modeling is applicable to 'only' 5-40% of the known protein sequences (Fig. 2).Second, as the template of a homologue isrequired, no unique 3D structure can bepredicted by comparative modeling, i.e., nostructure that has no similarity to anyexperimentally determined 3D structure.

High level of sequence identity: atomicreso lu t i on . The basic assumption ofcomparative modeling is that U and T haveidentical backbones (main chain C-alpha). Themodeling task then becomes to correctly placethe side chains of U into the backbone of T.For very high levels of sequence identitybetween U and T (ideally differing by oneresidue only), side chains can be 'grown'during molecular dynamics simulations [78].For slightly lower levels (still of high sequencesimilarity), side chains are built based onsimilar environments in known structures [79].Rotamer libraries (libraries containing all side-chain orientations observed in knownstructures) are used. Over the whole range ofsequence identity between U and T for whichcomparative modeling is applicable, theaccuracy of the model drops with decreasingsimilarity. For levels of at least 60% sequenceidentity, the resulting models are quiteaccurate [79; 80], for even higher values, themodels are as accurate as is experimentalstructure determination.

Low level of sequence identity: loop regionssometimes correct. With decreasing sequenceidentity between the known structure H and thequery protein U the number of loops that haveto be inserted to align the two grows. Anaccurate modeling of loop regions, however,implies solving the structure predictionproblem. The problem is simplified in twoways. First, loop regions are often relativelyshort and can thus be simulated by molecular

dynamics (note the CPU time required formolecular dynamics simulations growsexponentially with the number of residues ofthe polypeptide to be modeled). Second, theends of the loop regions are fixed by thebackbone of the template structure. Variousmethods are employed to model loop regions.The best have the orientation of the loopregions correct in some cases [80]. Thisillustrates the current limitations of moleculardynamics: not even short loop regions can bepredicted from sequence. Furthermore, forexperimental structure refinement (use ofmolecular dynamics to improve consistency,and accuracy of experimental data) moleculardynamics is successfully applied to find abetter solution when starting from an almostcorrect structure. However, for comparativemodeling, molecular dynamics refinementusually reduces prediction accuracy [80].Below about 40% sequence identity theaccuracy of the sequence alignment used asbasis for comparative modeling becomes anadditional problem. Nevertheless, even downto levels of 25-30% sequence identity,comparative modeling produces coarse-grained models for the overall fold of proteinsof unknown structure.

6.3 Known folds: remote homologymodeling (threading)

Basic concept. As noted above, the majority ofprotein pairs with similar structure populate themidnight zone in which the sequence signalalone does not suffice to detect the relation(Fig. 4). If we knew that a protein of unknownstructure (U ) has a similar structure as aprotein of known structure (T ) that has nodetectable sequence similarity to U, we couldbuild the 3D structure of U by (remote)homology modeling based on the template ofT . Thus, remote homology modeling mustsolve three tasks: (1) The remote homologue(T) has to be detected, (2) U and T have to becorrectly aligned, and (3) the homologymodeling procedure has to be tailored to theharder problem of extremely low sequenceidentity (with many loop regions to bemodeled). Methods that address these goalsare often referred to as 'threading' or 'foldrecognition' techniques. Most methodsdeveloped over the last decade have beenprimarily optimized to solve the first twoproblems. The basic idea is to thread the

14

sequence of U into the known structure of Tand to evaluate the fitness of sequence forstructure by some kind of environment-basedor knowledge-based potential [81; 15].Threading is in some respects a harderproblem than is the prediction of 3D structure.However, the stakes are high: solving thethreading problem could enable the predictionof thousands of protein structures. Indeed,threading has evolved to become one of themost active fields in the arena of proteinstructure prediction. The field of threading hasgrown beyond what can be covered in anydepth in this review. Thus, we onlysummarized some of the major findings.

Remote homologues can often be detected.First the good news: since the differentthreading approaches which have beenproposed capture different aspects of proteinstructure, the correct remote homologue islikely to be found by at least one of them [82].Now the bad news: so far, no single methodhas been able to detect the correct remotehomologue for more than half of all test cases[82; 66]. For the methods that have beentested rigorously evaluated, the correct remotehomologue is detected in less than about 40%of all cases [82; 66]. However, thisperformance is clearly superior to that oftraditional sequence alignments at this lowlevel (<25%) of sequence identity.

3D prediction by threading is still not reliable.Detecting the remote homology is only the firstof the three obstacles. It appears that thesecond obstacle (correct alignment) is muchmore difficult and, unfortunately, there is nogeneral solution so far [82; 66]. Thus the finalstep, building a 3D model, usually fails sincethe modeling procedures available todaycannot correct the mistakes in the alignments.Currently, the successful use of threadingmethods requires skeptical, expert userintervention to spot wrong hits and falsealignments.

6.4 Unknown folds: ab initioprediction of structure?

Recent breakthrough in structure prediction?In the 1994 Asilomar meeting, none of the 3Dab initio methods were able to predict thecorrect protein structure [80]. Since that time,new methods have been proposed whichindicate possible directions for the future.

Several groups have obtained promisingresults using distance geometry methods [46].Simplified force-fields in combination withdynamic optimization strategies have yieldedpromising, but still relatively inaccurate results[83; 84]. Srinivasan and Rose have reportedvery encouraging results with their hierarchicalsearch method [85]. Most recently methodshave successfully predicted some coarse-grained features of 3D structures byassembling fragments taken from the largedatabase of protein structures [86-89].

Accurate prediction of 3D structure for coiled-coil proteins. A particular class of proteins arecoiled-coils. These are proteins can be definedby a rather simple geometry of long helices, ofwhich two or more wind around one another[90]. Nilges and Brünger [91] have achievedatomic-accuracy in an ab initio prediction of theGCN4 leucine zipper using a hybrid moleculardynamics/simulated annealing search strategy.Recently, equally accurate models for threeleucine zippers were obtained with fastercalculations based on mean-force-potentials[92].

Recognizing incorrect structures. The singlemost important theoretical advance in 3Dprediction in recent years may have been thedevelopment of mean-force-potentials. Beforethese potentials, structure prediction wasnormally done with 'physical' potentials, i.e.,bonds, angles, torsion angles, and van derWaals as well as electrostatic non-bondedterms describing the internal energy of themolecule. In contrast, the mean-force-potentials, derived from databases of proteinstructure, attempt to describe the free energyof the molecule. The physical potentials havebeen used very successfully to refineexperimentally determined structures [61].However, these terms cannot distinguishbetween a native fold and a grossly misfoldedstructure. In contrast, mean-force-potentials ofpairwise residue distances are quite successfulin fold recognition, as well as remote homologymodeling [15]. It remains to be seen how bestto combine these two different potentials. Inone pilot study on the use of mean-force-potentials for 3D structure prediction, bestresults where obtained by combining bothpotentials (O'Donoghue and Nilges, privatecommunication).

15

7 Conclusions

Native 3D structures of proteins are encodedby a linear sequence of amino acid residues.To predict 3D structure from sequence is atask challenging enough to have occupied ageneration of researchers. Have they finallysucceeded in their goal? The bad news is: no,we still cannot predict structure for anysequence. The good news are: we have comecloser, and growing databases facilitate thetask.

Sequence alignment. Database comparisonsare at the heart of any method attempting tomake use of today's data. In particular, manymethods addressed at predicting aspects ofprotein structure and function implementsequence alignment algorithms. Although thebasic algorithms for alignments wereintroduced decades ago, recent advancementshave improved the performance of thesealgorithms substantially. One challenge thatstill remains is the development of a statisticalsystem to assess alignment and databasesearching tools. This is of particular importancein the so-called twilight zone.

Predictions in 1D: significant improvement bylarger databases. The rich informationcontained in the growing sequence andstructure databases has been used to improvethe accuracy of predictions of some aspects ofprotein structure. Evolutionary information issuccessfully used for predictions of secondarystructure, solvent accessibil i ty, andtransmembrane helices. These predictions ofprotein structure in 1D are significantly moreaccurate and more useful than five years ago.Some methods have indicated that 1Dpredictions can be useful as an intermediatestep on the way to predicting 3D structure(inter-strand contacts; prediction-basedthreading). Another advantage of predictions in1D is that they are not very CPU-intensive, i.e.,1D structure can be predicted for the proteinsequence of, for example, entire yeastchromosomes overnight.

Predictions in 2D and 3D: so far of limitedsuccess. The prediction accuracy of chain-distant inter-residue contacts is so far relativelylimited. Analysis of correlated mutations can beused to distinguish between alternative models(e.g. for threading techniques). The prediction

of inter-strand contacts appears to be useful insome cases. An accurate method for theautomatic prediction of contacts betweenresidues not close in sequence remains to bedeveloped. Most breakthroughs in proteinstructure prediction were achieved over thelast six years. Thus, although we still cannotsolve the general prediction problem, progresshas been made. In particular, in the last fewyears the idea of replacing, or at leastcombining, the physical-theoretical energycalculations with knowledge derived ones,yielded exciting results.

8 Acknowledgements

Thanks to Phil Carter and Hedi Hegyi (bothColumbia University) for helpful comments.Particular thanks to Volker Eyrich and IngridKoh (Columbia) for programming andmaintaining most of the immensely valuablesoftware that runs the EVA and META-PredictProtein servers, and to the EVA teamsfrom Rockefeller University (Marc Marti-Renom, Andrej Sali and group) and fromMadrid (Sito Pazos and Alfonso Valencia). Thiswork was supported by the grants 1-P50-GM62413-01 and RO1-GM63029-01 from theNational Institute of Health (NIH), and the grantDBI-0131168 from the National ScienceFoundation (NSF). Last, not least, thanks to allthose who deposit their experimental data inpublic databases, and to those who maintainthese databases.

9 References

1. H.M. Berman; J. Westbrook; Z. Feng; G.Gillliland; T.N. Bhat; H. Weissig; I.N. Shindyalovand P.E. Bourne, Nucl. Acids Res. 2000, 28,235-242.

2. A. Bairoch and R. Apweiler, Nucl. Acids Res.2000, 28, 45-48.

3. C. Brändén and J. Tooze, Introduction toProtein Structure, Garland Publ., New York,London, 1991.

4. E.E. Lattman and G.D. Rose, Proc. Natl. Acad.Sci. U.S.A. 1993, 90, 439-441.

5. C.B. Anfinsen, Science 1973, 181, 223-230.6. M.E. Gottesman and W.A. Hendrickson, Curr.

Opin. Microbiol. 2000, 3, 197-202.7. M. Levitt and A. Warshel, Nature 1975, 253,

694-698.8. A.T. Hagler and B. Honig, Proc. Natl. Acad. Sci.

U.S.A. 1978, 75, 554-558.

16

9. B. Honig and F.E. Cohen, Folding & Design1996, 1, R17-R20.

10. J. Liu and B. Rost, Prot. Sci. 2001, 10, 1970-1979.

11. D. Frishman and Mewes, TIGS 1997, 13, 415-416.

12. B. Rost, Prot. Engin. 1999, 12, 85-94.13. J. Liu and B. Rost, Bioinformatics 2002, 18,

922-933.14. B. Rost, Folding & Design 1997, 2, S19-S24.15. M.J. Sippl, Curr. Opin. Str. Biol. 1995, 5, 229-

235.16. B. Rost, Curr. Opin. Str. Biol. 2002, 12, 409-

416.17. J. Liu; H. Tan and B. Rost, J. Mol. Biol. 2002,

322, 53-64.18. S.B. Needleman and C.D. Wunsch, J. Mol. Biol.

1970, 48, 443-453.19. T.F. Smith and M.S. Waterman, J. Mol. Biol.

1981, 147, 195-197.20. C. Sander and R. Schneider, Proteins 1991, 9,

56-68.21. D.G. Higgins; J.D. Thompson and T.J. Gibson,

Meth. Enzymol. 1996, 266, 383-402.22. S.F. Altschul and W. Gish, Meth. Enzymol.

1996, 266, 460-480.23. W.R. Pearson, Meth. Enzymol. 1996, 266, 227-

258.24. S. Altschul; T. Madden; A. Shaffer; J. Zhang; Z.

Zhang; W. Miller and D. Lipman, Nucl. AcidsRes. 1997, 25, 3389-3402.

25. D.G. Higgins, Adv. Prot. Chem. 2000, 54, 99-135.

26. R.E. Dickerson; R. Timkovich and R.J. Almassy,J. Mol. Biol. 1976, 100, 473-491.

27. G.J. Barton, Protein sequence alignment anddatabase scanning in Protein structureprediction, M.J.E. Sternberg (Eds), Oxford Univ.Press, Oxford, 1996.

28. S.R. Eddy, Bioinformatics 1998, 14, 755-763.29. K. Karplus; C. Barrett and R. Hughey,

Bioinformatics 1998, 14, 846-856.30. P.Y. Chou and G.D. Fasman, Annu. Rev.

Biochem. 1978, 47, 251-276.31. G.D. Fasman, The development of the

prediction of protein structure in Prediction ofprotein structure and the principles of proteinconformation, G.D. Fasman (Eds), PlenumPress, New York, London, 1989.

32. G.E. Schulz, Annu. Rev. Biophys. Biophys.Chem. 1988, 17, 1-21.

33. B. Rost and C. Sander, Methods in MolecularBiology 2000, 143, 71-95.

34. H. Bohr; J. Bohr; S. Brunak; R.M.J. Cotterill; B.Lautrup; L. Nørskov; O.H. Olsen and S.B.Petersen, FEBS Lett. 1988, 241, 223-228.

35. N. Qian and T.J. Sejnowski, J. Mol. Biol. 1988,202, 865-884.

36. B. Rost, J. Struct. Biol. 2001, 134, 204-218.37. B. Rost, Meth. Enzymol. 1996, 266, 525-539.38. D.T. Jones, J. Mol. Biol. 1999, 292, 195-202.

39. M. Young; K. Kirshenbaum; K.A. Dill and S.Highsmith, Prot. Sci. 1999, 8, 1752-1764.

40. K. Kirshenbaum; M. Young and S. Highsmith,Prot. Sci. 1999, 8, 1806-1815.

41. F.E. Cohen and S.R. Presnell, Thecombinatorial approach in Protein structureprediction, M.J.E. Sternberg (Eds), Oxford Univ.Press, Oxford, 1996.

42. B.K. Lee and F.M. Richards, J. Mol. Biol. 1971,55, 379-400.

43. J.A. Cuff and G.J. Barton, Proteins 2000, 40,502-511.

44. H. Naderi-Manesh; M. Sadeghi; S. Arab andA.A. Moosavi Movahedi, Proteins 2001, 42,452-459.

45. P. Fariselli and R. Casadio, Bioinformatics2001, 17, 202-204.

46. B. Rost and S.I. O'Donoghue, CABIOS 1997,13, 345-356.

47. I. Simon; A. Fiser and G.E. Tusnady, BiochimBiophys Acta 2001, 1549, 123-136.

48. S. Möller; D.R. Croning and R. Apweiler,Bioinformatics 2001, 17, 646-653.

49. C.P. Chen; A. Kernytsky and B. Rost, Prot. Sci.2002, 11, 2774-2791.

50. B. Persson and P. Argos, Prot. Sci. 1996, 5,363-371.

51. B. Rost; R. Casadio and P. Fariselli, Prot. Sci.1996, 5, 1704-1718.

52. A. Krogh; B. Larsson; G. von Heijne and E.L.Sonnhammer, J. Mol. Biol. 2001, 305, 567-580.

53. G.E. Tusnady and I. Simon, J Chem Inf ComputSci 2001, 41, 364-368.

54. D.T. Jones; W.R. Taylor and J.M. Thornton,Biochem. 1994, 33, 3038-3049.

55. J.E.W. Meyer; M. Hofnung and G.E. Schulz, J.Mol. Biol. 1997, 266, 761-775.

56. D. Eisenberg; E. Schwartz; M. Komaromy andR. Wall, J. Mol. Biol. 1984, 179, 125-142.

57. T. Schirmer and S.W. Cowan, Prot. Sci. 1993,2, 1361-1363.

58. M.M. Gromiha; R. Majumdar and P.K.Ponnuswamy, Prot. Engin. 1997, 10, 497-500.

59. K. Diederichs; J. Freigang; S. Umhau; K. Zethand J. Breed, Prot. Sci. 1998, 7, 2413-2420.

60. I. Jacoboni; P.L. Martelli; P. Fariselli; V. DePinto and R. Casadio, Prot. Sci. 2001, 10, 779-787.

61. M. Nilges, Curr. Opin. Str. Biol. 1996, 6, 617-623.

62. A.R. Ortiz; A. Kolinski; P. Rotkiewicz; B.Ilkowski and J. Skolnick, Proteins 1999, Suppl3, 177-185.

63. F. Pazos; B. Rost and A. Valencia,Bioinformatics 1999, 15, 1062-1063.

64. A. Valencia and F. Pazos, Curr. Opin. Str. Biol.2002, 12, 368-373.

65. D. Altschuh; A.M. Lesk; A.C. Bloomer and A.Klug, J. Mol. Biol. 1987, 193, 693-707.

66. V. Eyrich; M.A. Martí-Renom; D. Przybylski; A.Fiser; F. Pazos; A. Valencia; A. Sali and B.Rost, Bioinformatics 2001, 17, 1242-1243.

17

67. O. Olmea; B. Rost and A. Valencia, J. Mol. Biol.1999, 293, 1221-1239.

68. P. Fariselli; O. Olmea; A. Valencia and R.Casadio, Prot. Engin. 2001, 14, 835-843.

69. T.J.P. Hubbard, Use of b-strand interactionpseudo-potential in protein structure predictionand modelling in 27th Hawaii InternationalConference on System Sciences, L. Hunter(Eds), IEEE Society Press, Maui, Hawaii, USA,1994.

70. H.A. Nagarajaram; B.V. Reddy and T.L.Blundell, Prot. Engin. 1999, 12, 1055-1062.

71. P. Baldi; G. Pollastri; C.A. Andersen and S.Brunak, Ismb 2000, 8, 25-36.

72. M. Asogawa, Ismb 1997, 5, 48-51.73. L. Lo Conte; S.E. Brenner; T.J. Hubbard; C.

Chothia and A.G. Murzin, Nucl. Acids Res.2002, 30, 264-267.

74. R.L. Dunbrack Jr, Proteins 1999, 37, 81-87.75. A.S. Yang and B. Honig, J. Mol. Biol. 2000, 301,

679-689.76. P.A. Bates and M.J. Sternberg, Proteins 1999,

37, 47-54.77. F. Melo; R. Sanchez and A. Sali, Prot. Sci.

2002, 11, 430-448.78. M. Karplus and G.A. Petsko, Nature 1990, 347,

631-639.79. A.C.W. May and T.L. Blundell, Curr. Opin.

Biotech. 1994, 5, 355-360.80. J. Moult; J.T. Pedersen; R. Judson and K.

Fidelis, Proteins 1995, 23, ii-iv.81. M.J. Sippl; P. Lackner; F.S. Domingues; A.

Prlic; R. Malik; A. Andreeva and M.Wiederstein, Proteins 2001, Suppl, 55-67.

82. J.M. Bujnicki; A. Elofsson; D. Fischer and L.Rychlewski, Prot. Sci. 2001, 10, 352-361.

83. A. Elofsson; S.M. Le Grand and D. Eisenberg,Proteins 1995, 23, 73-82.

84. J.T. Pedersen and J. Moult, Curr. Opin. Str.Biol. 1996, 6, 227-231.

85. R. Srinivasan and G.D. Rose, Proteins 1995,22, 81-99.

86. D. Baker and A. Sali, Science 2001, 294, 93-96.87. A.M. Lesk; L. Lo Conte and T.J.P. Hubbard,

Proteins 2001, 45 Suppl 5, 98-118.88. D.T. Jones, Proteins 2001, 45 Suppl 5, S127-

S132.89. C. Bystroff; V. Thorsson and D. Baker, J. Mol.

Biol. 2000, 301, 173-190.90. A. Lupas, TIBS 1996, 21, 375-382.91. M. Nilges and A.T. Brünger, Proteins 1993, 15,

133-146.92. S.I. O'Donoghue and M. Nilges, Folding &

Design 1997, 2, S47-S52.93. W. Kabsch and C. Sander, Biopolymers 1983,

22, 2577-2637.94. M. Scharf, CONAN (CONtact ANalysis),

Heidelberg, 1988.95. P. Kraulis, J. Appl. Cryst. 1991, 24, 946-950.96. L. Holm and C. Sander, Nucl. Acids Res. 1999,

27, 244-247.97. B. Rost, J. Mol. Biol. 2002, 318, 595-608.

Predict protein structure through evolution - Rostlab · PDF filePredict protein structure...

Documents

Transcript of Predict protein structure through evolution - Rostlab · PDF filePredict protein structure...