| Home Page | Topics | Evaluation | Assignments | Resources | News |

Homology Modeling

Introduction

What is Homology?

General Procedures

Identifying Homologues

Aligning Sequences

Identification of Structurally Conserved and Structurally Variable Regions

Generating Coordinates for the Unknown Structure

Databases of Structures from Homology Modeling

Automated Web-Based Homology Modeling

Evaluation and Refinement of the Structure

Class-Directed Structure Determination

On-line Resources

Printed References

Introduction
With the development of techniques in molecular biology that allow rapid identification, isolation, and sequencing of genes, we are now able to infer the sequences of many proteins. However, it is still a time-consuming task to obtain the three-dimensional structures of these proteins. A major goal of structural biology is to predict the three-dimensional structure from the sequence, a pursuit that has not yet been realized. Thus, alternative strategies are being applied to develop models of protein structure when the constraints from X-ray diffraction or NMR are not yet available.

One method that can be applied to generate reasonable models of protein structures is homology modeling. This procedure, also termed comparative modeling or knowledge-based modeling, develops a three-dimensional model from a protein sequence based on the structures of homologous proteins. Several reviews on this topic have appeared [1-5]. In the description that follows, some aspects of homology modeling that you may find useful in this course and in your research are discussed.


What is Homology?
Care must be used in applying the term, "homology modeling." In fact, as noted above some authors prefer alternative names for the procedure. One must recognize that homology does not necessarily imply similarity. Homology has a precise definition: having a common evolutionary origin [6,7]. Thus, homology is a qualitative description of the nature of the relationship between two or more things, and it cannot be partial. Either there is an evolutionary relationship or there is not. An assertion of homology usually must remain an hypothesis. Supporting data for a homologous relationship may include sequence or three-dimensional similarities, the relationships between which can be described in quantitative terms.

An observation of importance in homology modeling is that for a set of proteins that are hypothesized to be homologous, their three-dimensional structures are conserved to a greater extent than are their primary structures. This observation has been used to generate models of proteins from homologues with very low sequence similarities. Thus, in homology modeling, we are attempting to develop models of an unknown from homologous proteins. These proteins will have some measure of sequence similarity but we are relying on the conservation of folds among homologues to guide us as well.

General Procedures
The steps to creating a homology model are as follows:

  • identify homologous proteins and determine the extent of their sequence similarity with one another and the unknown
  • align the sequences
  • identify structurally conserved and structurally variable regions
  • generate coordinates for core (structurally conserved) residues of the unknown structure from those of the known structure(s)
  • generate conformations for the loops (structurally variable) in the unknown structure
  • build the side-chain conformations
  • refine and evaluate the unknown structure.

Identifying Homologues
Several computerized search methods are available to assist in identifying homologues. In most cases of homology modeling, we have the sequence of a protein for which we want to model the three-dimensional structure (the unknown). We then apply sequence search methods to identify proteins with which the unknown has some degree of sequence similarity and for which the three-dimensional structures are available. We then assume that these proteins are homologous with our unknown and use the three-dimensional structures of these proteins to develop a model of the structure of our unknown. Ideally, one will have several homologues with which to develop a homology model, but modeling can be done with only one known structure.

Sequence comparison also is applied when attempting to identify possible functions of an uncharacterized protein for which the sequence has been deduced from a DNA sequence. For example, one can search for motifs that distinguish a protein family, such as residues critical to binding or catalysis. The PROSITE database contains many protein patterns that are characteristic of particular families of proteins.

Although less common, some cases do arise in which the three-dimensional structure of a protein is known and one wants to identify homologues. In these cases, searches of three-dimensional databases are performed. Because structural folds are conserved to a greater extent than sequence, one may identify homologues with very little sequence similarity. An example of a program that provides this type of database searching is Dali. One submits coordinates of a protein structure, and the program performs a multiple structural alignment with proteins that in the protein data bank.

Aligning Sequences
A critical step in the development of a homology model is the alignment of the unknown sequence with the homologues. Many methods are available for sequence alignment, and sometimes the most perplexing task is deciding which methods to apply. Access to the methods and databases for sequence alignment has been simplified with the development of programs such as the Biology Workbench. The challenge to the researcher is to understand the options that are applied in alignment so that correct interpretation of results is possible.

Factors to be considered when performing an alignment are (1) which algorithm to use for sequence alignment, (2) which scoring method to apply, and (3) whether and how to assign gap penalties.

Algorithms for Alignments


Sequence alignments generally are based on the dynamic programming algorithm of Needleman and Wunsch [8]. Current methods include FASTA, Smith-Waterman, and BLASTP, with the last method differing from the first two in not allowing gaps.

Scoring Alignments


Scoring of alignments typically involves construction of a 20x20 matrix in which identical amino acids and those of similar character (i.e., conservative substitutions) may be scored higher than those of different character. Four general types of scoring have been applied to alignments:

Identity: considers only identical residues

Genetic Code: considers the number of base changes in DNA or RNA to interconvert the codons for the amino acids

Chemical Similarity: considers the physico-chemical properties (e.g., polarity, size, charge) with greater weight given to alignment of similar properties

Observed Substitutions: considers substitution frequencies observed in alignments of sequences.


The substitution schemes are generally considered to be the best methods for scoring alignments. These methods are based on an analysis of the frequency with which a given amino acid is observed to be replaced by other amino acids among proteins for which the sequences can be aligned.

PAM Matrices
One of the first substitution scoring schemes to be developed was the Dayhoff mutation data matrix. Dayhoff and co-workers [9-11] developed this method during analysis of the evolution of proteins. The mutation probability matrix that they derived gives the probability of one amino acid mutating to a second amino acid within a particular evolutionary time. The scoring schemes are denoted PAM (Percentage of Acceptable point Mutations) followed by a number. For example, if alignments were scored using PAM40 and PAM250, the lower PAM matrix would recognize short alignments of highly similar sequences and the higher PAM matrix would find longer, weaker local alignments. Using PAM250, 20% of the amino acids must remain unchanged for the sequences to be considered to be related.

BLOSUM Matrices
The substitution matrices derived by Dayhoff and co-workers were based on substitution frequencies from global alignments of very similar sequences. Henikoff and Henikoff [12] extended this approach by developing substitution matrices using local multiple alignments of more distantly related sequences. A database was assembled that contained multiple alignments (without gaps) of short regions of related sequences. These sequences were clustered into groups (blocks) based on their similarity at some threshold value of percentage identity. Blocks substitution matrices (BLOSUM) were derived based on substitution frequencies for all pairs of amino acids within a group. The different BLOSUM matrices were obtained by varying the threshold. For example, a BLOSUM80 matrix is derived using a threshold of 80% identity.

Of current interest is the development of scoring matrices based on alignments derived from three-dimensional structures. One example is that of Johnson and Overington (JO matrices) [13]. These investigators aligned the three-dimensional structures in 65 homologous sets of proteins. From these structures, 207,795 amino acid replacements were tabulated. The proteins in each homologous set had 15-40% sequence identity, so this substitution matrix should provide a sensitive basis for scoring sequence alignments. They demonstrated that their substitution matrix performed well relative to other matrices.

Choosing a Scoring Matrix
It is not possible to choose one best scoring system for all alignment problems you might undertake. As noted above, Johnson and Overington compared results obtained with 12 difference scoring matrices, and Pearson [14] also recently published a comparison of several scoring methods. In general, different scoring matrices may perform better than others depending on the problem being studied and the conditions used for alignments. In any case, you will need to select the alignment algorithm, scoring matrix, and gap penalty when doing your alignments. You also will need to decide if you want to do local and/or global alignments. One advantage of local alignments is that they do not make the assumption that the unknown protein and the database sequence are of similar length.

Evaluating the Alignment
A final aspect of sequence alignment that should be considered is evaluation of the accuracy of the alignment. The best way to assess the accuracy is to compare alignments from sequence comparisons with alignments from protein three-dimensional structures. Of course this assessment is possible only if you are working with a family of proteins for which three-dimensional structures are known for at least two members of the family. The alignment obtained by including tertiary structural features provides a set of test alignments against which sequence-only alignments can be compared. Similar conditions can then be applied to sequences from the family for which three-dimensional structures are not available.

In fact, this approach to evaluation of alignments can be applied during the alignment process. For example, Greer [3] advocates alignment based on superimposition of the three-dimensional structures of the homologues. To extend this approach to the unknown structure, one must use protein structure prediction algorithms to identify possible secondary structural elements.

Identification of Structurally Conserved and Structurally Variable Regions
After the known structures are aligned, they are examined to identify the structurally conserved regions (SCRs) from which an average structure, or framework, can be constructed for these regions of the proteins. Variable regions (VRs), in which each of the known structures may differ in conformation, also must be identified because special techniques must be applied to model these regions of the unknown protein.

When only one known structure is available for homology modeling, it is more difficult to identify the SCRs. Based on analyses of other homologues for which multiple structures are available, we know that the SCRs generally correspond to the elements of secondary structure, such as alpha-helices and beta-sheets, and to ligand- and substrate-binding sites. Thus, these regions are used as the SCRs in the cases where only one structure is available. The VRs usually lie on the surface of the proteins and form the loops where the main chain turns.

Once the known structures are aligned and the SCRs have been identified, one aligns the unknown sequence. Alignment based solely on sequence may be used, though other structural features also may be taken into account. In Quanta, multiple sequence alignment algorithms are available that may be used both when aligning sequences of the known structures and when aligning the sequence of the unknown with the known structures. Four scoring systems are available, each of which may be evaluated during an alignment so that relative statistical weights may be assigned. The four scoring methods are:

  • Sequence homology in which the scoring system is the same as for conventional Needleman-Wunsch alignment.
  • Secondary structure homology in which scoring is based on the probability of replacing one secondary structure type by another type. Secondary structural features are based on hydrogen-bonding patterns and main chain torsions, with the definitions of secondary structure types being those of Kabsch and Sander [15]. When applying this method with the unknown, its secondary structure must first be predicted using one of the three (Momany, GOR, and Holley/Karplus) methods available in Quanta.
  • Residue accessibility homology is dependent on the difference in the fractional accessibility between aligned residues.
  • CA-CA distance homology which is based on the interatomic distances between the alpha carbon atoms of the aligned residues.

Generating Coordinates for the Unknown Structure
When generating coordinates for the unknown structure, one needs to model main chain atoms and side chain atoms, both in SCRs and VRs. For the SCRs, it is straightforward to generate the coordinates of the main chain atoms of the unknown structure from those of the known structure(s). Side chain coordinates are copied if the residue type in the unknown is identical or very similar to that in the known homologues. For other side chain coordinates one can apply a side chain rotamer library in a systematic approach to explore possible side chain conformations. It may be desirable to weight the contribution of each homologue in each SCR based on the extent of similarity with the unknown. In the event that some coordinates in the unknown are undefined in the SCRs, regularization can be used to build and relax both main chain and side chain atoms in those regions. Note that this procedure should be used only if the region of undefined atoms is one or two residues in length.

For the VRs, a variety of approaches may be applied in assigning coordinates to the unknown. Recall that these regions will correspond most often to the loops on the surface of the protein. If a loop in one of the known structures is a good model for that of the unknown, then the main chain coordinates of that known structure can be copied. Side chain coordinates of residues that are similar in length and character also may be copied. Rotamer libraries can be used to define other side chain coordinates.

When a good model for a loop cannot be found among the known structures, one can search fragment databases for loops in other proteins that may provide a suitable model for the unknown. A residue range is chosen to include the undefined loop as well as a few residues (e.g., three) on either side of the loop for which coordinates have been defined. Fragments are examined for their ability to fit in the undefined region without making bad contacts with other atoms and to overlap well with the residues on either side of the loop. The loop may then be subjected to conformational searching to identify low energy conformers if desired. Coordinates for side chain atoms in these loop regions may be copied if residues are similar, though it is likely that considerable application of side chain rotamer libraries will be required to define coordinates in these regions.

Databases of Structures from Homology Modeling
Databases are now available that contain large numbers of protein structures that have been obtained by comparative (homology) modeling. Two of these databases are listed here:

Modbase was created by Sali and co-workers, using their program Modeller, which creates models based on the satisfaction of spatial restraints [16]. That is, restraints are identified from the alignments of homologues of known structure, and these restraints are then applied to the unknown sequence. Restraints can include distances between alpha carbons, other distances within the main-chain, and main-chain and side-chain dihedral angles. Routines to satisfy the restraints optimally include conjugate gradient minimization and molecular dynamics with simulated annealing.

3DCrunch is a large scale modeling project that aims to submit all entries from protein sequence databases to SWISS-MODEL. Currently the database contains 64,000 entries.

Automated Web-Based Homology Modeling

Web-based tools are now available to generate models of protein 3-dimensional structures using comparative modeling techniques.

  • SWISS-MODEL is available through Glaxo Wellcome Experimental Research in Geneva, Switzerland.
  • WHAT IF, available on EMBL servers, includes three components, one to generate the homology models, one to evaluate the quality of the homology models, and one to evaluate models of proteins for which the structure is already known, thereby providing for evaluation of the quality of the modeling program.

Evaluation and Refinement of the Structure
For a homology model from any source, it is important to demonstrate that the structural features of the model are reasonable in terms of what is know about protein structures in general. That is, researchers have analyzed three-dimensional structures of proteins from which basic principles of protein structure and folding have been developed. Several programs are available to assist in this analysis of correctness of a homology model.

The criteria for analysis of correctness can include:

  • main chain conformations in acceptable regions of the Ramachandran map
  • planar peptide bonds
  • side chain conformations that correspond to those in the rotamer library
  • hydrogen-bonding of polar atoms if they are buried
  • proper environments for hydrophobic and hydrophilic residues
  • no bad atom-atom contacts
  • no holes inside the structure.

Programs that provide structure analysis along with output that is useful for publication include PROCHECK and 3D-Profiler [17,18]. PROCHECK is based on an analysis of (phi,psi) angles, peptide bond planarity, bond lengths, bond angles, hydrogen-bond geometry, and side-chain conformations of known protein structures as a function of atomic resolution. Thus, the expected values of these parameters are known and can be compared to a modeled structure based on the atomic resolution of the structures from which the model was developed. 3D-profiler compares a homology model to its sequence using a 3D profile. The profile is based on the statistical preferences of each of the 20 amino acids for particular environments within the protein. Each residue position in a 3D model can be characterized by its environment. Preferred environments for amino acids are derived from known three-dimensional structures and are defined by three parameters: (1) the area of each residue that is buried, (2) the fraction of side-chain area that is covered by polar atoms (i.e., O and N), and (3) the local secondary structure. Based on these environment variables, a 3D structure is converted into a 1D profile that describes each residue in the folded protein structure. Examination of these profiles reveals which regions of a sequence appear to be folded correctly and which do not.

Once any irregularities have been resolved, the entire structure may then be subjected to further refinement. This process may consist of energy minimization with restraints, especially for the SCRs. The restraints then may be gradually removed for subsequent minimizations. It also may be advantageous to apply molecular dynamics in conjunction with energy minimization. For any of these refinement procedures, the structure should be solvated, using for example crystallographic waters from the known homologues, a solvent shell, or a periodic box of pre-equilibrated water molecules.

Class-Directed Structure Determination

The need to be able to determine 3-dimensional structures more quickly to keep up with the rapidly increasing amount of sequence information has prompted some investigators to suggest that a new strategy should be applied to structural analysis [19]. That strategy is called class-directed structure determination. The premise behind this strategy is that we should expend our resources for structure determination on those structures that will be most informative rather than those that are simply of interest to a few investigators. It is proposed that the structures that should be determined are those that will assist in developing a more complete classification scheme for known protein structures. A basic concept in this strategy is that there are a finite number of protein folds, and efforts should be directed to identifying all of those folds and determining structures of representative members of each of the fold classes.

Current importance of class-directed structure determination

  • Classification is now possible on a large scale because of a large amount of sequence information and and increasing amount of structural information.
  • Structural information can be obtained more readily if the structure of any member of the class is sufficient.
  • It is necessary to discover all of the types of unique folds so that a complete classification system can be developed.

Steps in class-directed structure determination

  • Identifying targets that are the most important ones.
  • Classifying proteins based on the target.
  • Determining 3-dimensional structures of representative proteins.
  • Analyzing and correlating the structural, functional, and sequence information.

Classification schemes

Several methods are being developed for the classification of proteins based on their structural and functional features.

  • SCOP (Structural Classification of Proteins)
  • CATH (Class, Architecture, Topology, and Homology)
  • HOMSTRAD (HOMologous STRucture Alignment Database)

Why define clusters of protein structures and determine structures of select members of each group?

  • Structural information from one member could be applied to another member using homology modeling or threading approaches.
  • If a classification scheme generates clusters of proteins of similar function, then classification will expedite the prediction of function based on sequence and structural information.
  • Classification will facilitate the design of mutagenesis studies among members of clusters.
  • Classification will facilitate the development of methods for predicting protein structure.

On-line Resources
Biology Workbench provides a Web interface to major sequence databases and the tools to search those databases. Included is a Tutorial to assist in learning and using the workbench.

Sequence Analysis Tools includes tools for sequence similarity searches; analysis of primary sequences for physicochemical parameters, signal peptides, sequence repeats, glycosylation sites, etc.; secondary structure prediction; binary and multiple sequence alignments.

Computational Analysis of DNA and Protein Sequences is a chapter from Genome Analysis: A Laboratory Manual
. It provides basic Internet information as well as a summary of sequence analysis, multiple alignments, structure prediction and protein modeling, and many other related topics.

A Guide to Sequence Searching provides a description of several biosequence comparison topics, including substitution matrices.

Dali performs multiple structural alignments of proteins. Coordinates of protein structures are compared against those in the Protein Data Bank, and a multiple alignment of structural neighbors is returned.

Sequence Analysis Bibliographic Reference Database provides a comprehensive list of papers dealing with sequence analysis.

SCOP (Structural Classification of Proteins) for classification of proteins.

CATH (Class, Architecture, Topology, and Homology) for classification of proteins.

HOMSTRAD (HOMologous STRucture Alignment Database) for alignment of proteins.

ModBase is a database of protein structures obtained by homology modeling.

3DCrunch is a project to create databases of protein structures by homology modeling.

SWISS-MODEL is a Web-based tool for generating homology models.

WHAT IF is a Web-based tool for generating homology models.

Printed References
[1] Blundell, T.L., Sibanda, B.L., Sternberg, M.J.E., and Thornton, J.M. (1987) Knowledge-Based Prediction of Protein Structures and the Design of Novel Molecules. Nature 326: 347-352.

[2] Fetrow, J.S. and Bryant, S.H. (1993) New Programs for Protein Tertiary Structure Prediction. Bio/Technology 11: 479-484.

[3] Greer, J. (1991) Comparative Modeling of Homologous Proteins. Meth. Enzymol. 202: 239-252.

[4] Johnson, M.S., Srinivasan, N., Sowdhamini, R., and Blundell, T.L. (1994) Knowledge-Based Protein Modeling. Crit. Rev. Biochem. Mol. Biol. 29: 1-68.

[5] Sali, A., Overington, J.P., Johnson, M.S., and Blundell, T.L. (1990) From Comparisons of Protein Sequences and Structures to Protein Modelling and Design. Trends Biochem. Sci. 15: 235-240.

[6] Lewin, R. (1987) When Does Homology Mean Something Else? Science 237: 1570.

[7] Reeck, G.R. et al. (1987) "Homology" in Proteins and Nucleic Acids: A Terminology Muddle and a Way out of It. Cell 50: 667.

[8] Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. J. Mol. Biol. 48: 442-453.

[9] Dayhoff, M.O. and Eck, R.V. (1968) A Model of Evolutionary Change in Proteins. In Atlas of Protein Sequence and Structure
(Dayhoff, M.O., ed.), vol. 3, pp. 33-41, National Biomedical Research Foundation, Washington, D.C.

[10] Dayhoff, M.O., Schwartz, R.M., and Orcutt, B.C. (1978) A Model for Evolutionary Change. In Atlas of Protein Sequence and Structure
(Dayhoff, M.O., ed.), vol. 5, suppl. 3, pp. 345-358, National Biomedical Research Foundation, Washington, D.C.

[11] Dayhoff, M.O., Barker, W.C., and Hunt, L.T. (1983) Establishing Homologies in Protein Sequences. Meth. Enzymol. 91: 524-545.

[12] Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Proc. Natl. Acad. Sci. USA 89: 10915-10919.

[13] Johnson, M.S. and Overington, J.P. (1993) A Structural Basis for Sequence Comparisons - An Evaluation of Scoring Methodologies. J. Mol. Biol. 233: 716-738.

[14] Pearson, W.R. (1995) Comparison of Methods for Searching Protein Sequence Databases. Protein Sci. 4: 1145-1160.

[15] Kabsch, W. and Sander, C. (1983) Dictionary of Protein Secondary Structure: Pattern Recognition of Hydrogen-Bonded and Geometrical Features. Biopolymers 22: 2577.

[16] Sali, A. and Blundell, T.L. (1993) Comparative Protein Modelling by Satisfaction of Spatial Restraints. J. Mol. Biol. 234: 779-815.

[17] Luthy, R., Bowie, J.U., and Eisenberg, D. (1992) Assessment of Protein Models with Three-Dimensional Profiles. Nature 356: 83-85.

[18] Bowie, J.U., Luthy, R., and Eisenberg, D. (1991) A Method to Identify Protein Sequences That Fold into a Known Three-Dimensional Structure. Science 253: 164-170.

[19] Terwilliger, T.C., Waldo, G., Peat, T.S., Newman, J.M., Chu, K., and Berendzen, J. (1998) Class-directed Structure Determination: Foundation for a Protein Structure Initiative. Protein Sci. 7: 1851-1856.

| Home Page | Topics | Evaluation | Assignments | Resources | News |

Copyright ?? 1997-2003 David R. Bevan
All Rights Reserved
Dept. of Biochemistry
Virginia Tech
Comments to
drbevan@vt.edu
Last Update: 3/14/03