
![]()
| Home Page | Topics | Evaluation | Assignments | Resources | News |
![]()
Introduction ![]()
With the development of techniques in molecular
biology that allow rapid identification, isolation, and sequencing of genes, we
are now able to infer the sequences of many proteins. However, it is still a
time-consuming task to obtain the three-dimensional structures of these
proteins. A major goal of structural biology is to predict the
three-dimensional structure from the sequence, a pursuit that has not yet been
realized. Thus, alternative strategies are being applied to develop models of
protein structure when the constraints from X-ray diffraction or NMR are not
yet available.
One method that can be applied to generate reasonable models of protein
structures is homology modeling. This procedure, also termed comparative
modeling or knowledge-based modeling, develops a three-dimensional model from a
protein sequence based on the structures of homologous proteins. Several
reviews on this topic have appeared [1-5]. In the description that follows,
some aspects of homology modeling that you may find useful in this course and
in your research are discussed.
What
is Homology? ![]()
Care must be used in applying the term,
"homology modeling." In fact, as noted above some authors prefer
alternative names for the procedure. One must recognize that homology does not
necessarily imply similarity. Homology has a precise definition: having a
common evolutionary origin [6,7].
Thus, homology is a qualitative description of the nature of the relationship
between two or more things, and it cannot be partial. Either there is an
evolutionary relationship or there is not. An assertion of homology usually
must remain an hypothesis. Supporting data for a homologous relationship may
include sequence or three-dimensional similarities, the relationships between
which can be described in quantitative terms.
An observation of importance in homology modeling is that for a set of proteins
that are hypothesized to be homologous, their three-dimensional structures are
conserved to a greater extent than are their primary structures. This
observation has been used to generate models of proteins from homologues with
very low sequence similarities. Thus, in homology modeling, we are attempting
to develop models of an unknown from homologous proteins. These proteins will
have some measure of sequence similarity but we are relying on the conservation
of folds among homologues to guide us as well.
General
Procedures ![]()
The steps to creating a homology model are as
follows:
Identifying
Homologues ![]()
Several computerized search methods are
available to assist in identifying homologues. In most cases of homology modeling,
we have the sequence of a protein for which we want to model the
three-dimensional structure (the unknown). We then apply sequence search
methods to identify proteins with which the unknown has some degree of sequence
similarity and for which the three-dimensional structures are available. We
then assume that these proteins are homologous with our unknown and use the
three-dimensional structures of these proteins to develop a model of the
structure of our unknown. Ideally, one will have several homologues with which
to develop a homology model, but modeling can be done with only one known
structure.
Sequence comparison also is applied when
attempting to identify possible functions of an uncharacterized protein for
which the sequence has been deduced from a DNA sequence. For example, one can
search for motifs that distinguish a protein family, such as residues critical
to binding or catalysis. The PROSITE
database contains many protein patterns that are characteristic of particular
families of proteins.
Although less common, some cases do arise in which the three-dimensional
structure of a protein is known and one wants to identify homologues. In these
cases, searches of three-dimensional databases are performed. Because
structural folds are conserved to a greater extent than sequence, one may
identify homologues with very little sequence similarity. An example of a
program that provides this type of database searching is Dali. One submits coordinates of a
protein structure, and the program performs a multiple structural alignment
with proteins that in the protein data bank.
Aligning
Sequences ![]()
A critical step in the development of a
homology model is the alignment of the unknown sequence with the homologues.
Many methods are available for sequence alignment, and sometimes the most
perplexing task is deciding which methods to apply. Access to the methods and
databases for sequence alignment has been simplified with the development of
programs such as the Biology Workbench.
The challenge to the researcher is to understand the options that are applied
in alignment so that correct interpretation of results is possible.
Factors to be considered when performing an alignment
are (1) which algorithm to use for sequence alignment, (2) which scoring method
to apply, and (3) whether and how to assign gap penalties.
Algorithms for Alignments
![]()
Sequence alignments generally are based on the
dynamic programming algorithm of Needleman and Wunsch [8]. Current methods
include FASTA, Smith-Waterman, and BLASTP, with the last method differing from
the first two in not allowing gaps.
Scoring Alignments
![]()
Scoring of alignments typically involves
construction of a 20x20 matrix in which identical amino acids and those of
similar character (i.e., conservative substitutions) may be scored higher than
those of different character. Four general types of scoring have been applied
to alignments:
Identity:
considers only identical residues
Genetic
Code: considers the number of base
changes in DNA or RNA to interconvert the codons for the amino acids
Chemical
Similarity: considers the
physico-chemical properties (e.g., polarity, size, charge) with greater weight
given to alignment of similar properties
Observed
Substitutions: considers substitution
frequencies observed in alignments of sequences.
![]()
The substitution schemes are generally considered to be the best methods for
scoring alignments. These methods are based on an analysis of the frequency
with which a given amino acid is observed to be replaced by other amino acids
among proteins for which the sequences can be aligned.
PAM Matrices![]()
One of the first substitution scoring schemes to be developed was the Dayhoff
mutation data matrix. Dayhoff and co-workers [9-11] developed this method
during analysis of the evolution of proteins. The mutation probability matrix
that they derived gives the probability of one amino acid mutating to a second
amino acid within a particular evolutionary time. The scoring schemes are
denoted PAM (Percentage of Acceptable point Mutations) followed by a number.
For example, if alignments were scored using PAM40 and PAM250, the lower PAM
matrix would recognize short alignments of highly similar sequences and the
higher PAM matrix would find longer, weaker local alignments. Using PAM250, 20%
of the amino acids must remain unchanged for the sequences to be considered to
be related.
BLOSUM Matrices![]()
The substitution matrices derived by Dayhoff and co-workers were based on
substitution frequencies from global alignments of very similar sequences.
Henikoff and Henikoff [12] extended this approach by developing substitution
matrices using local multiple alignments of more distantly related sequences. A
database was assembled that contained multiple alignments (without gaps) of
short regions of related sequences. These sequences were clustered into groups
(blocks) based on their similarity at some threshold value of percentage
identity. Blocks substitution matrices (BLOSUM) were derived based on
substitution frequencies for all pairs of amino acids within a group. The
different BLOSUM matrices were obtained by varying the threshold. For example,
a BLOSUM80 matrix is derived using a threshold of 80% identity.
Of current interest is the development of
scoring matrices based on alignments derived from three-dimensional structures.
One example is that of Johnson and Overington (JO matrices) [13]. These
investigators aligned the three-dimensional structures in 65 homologous sets of
proteins. From these structures, 207,795 amino acid replacements were
tabulated. The proteins in each homologous set had 15-40% sequence identity, so
this substitution matrix should provide a sensitive basis for scoring sequence
alignments. They demonstrated that their substitution matrix performed well
relative to other matrices.
Choosing a Scoring Matrix![]()
It is not possible to choose one best scoring system for all alignment problems
you might undertake. As noted above, Johnson and Overington compared results
obtained with 12 difference scoring matrices, and Pearson [14] also recently
published a comparison of several scoring methods. In general, different
scoring matrices may perform better than others depending on the problem being
studied and the conditions used for alignments. In any case, you will need to
select the alignment algorithm, scoring matrix, and gap penalty when doing your
alignments. You also will need to decide if you want to do local and/or global
alignments. One advantage of local alignments is that they do not make the
assumption that the unknown protein and the database sequence are of similar
length.
Evaluating the Alignment![]()
A final aspect of sequence alignment that should be considered is evaluation of
the accuracy of the alignment. The best way to assess the accuracy is to
compare alignments from sequence comparisons with alignments from protein
three-dimensional structures. Of course this assessment is possible only if you
are working with a family of proteins for which three-dimensional structures
are known for at least two members of the family. The alignment obtained by
including tertiary structural features provides a set of test alignments
against which sequence-only alignments can be compared. Similar conditions can
then be applied to sequences from the family for which three-dimensional
structures are not available.
In fact, this approach to evaluation of
alignments can be applied during the alignment process. For example, Greer [3] advocates
alignment based on superimposition of the three-dimensional structures of the
homologues. To extend this approach to the unknown structure, one must use
protein structure prediction algorithms to identify possible secondary
structural elements.
Identification of
Structurally Conserved and Structurally Variable Regions ![]()
After the known structures are aligned,
they are examined to identify the structurally conserved regions (SCRs) from
which an average structure, or framework, can be constructed for these regions
of the proteins. Variable regions (VRs), in which each of the known structures
may differ in conformation, also must be identified because special techniques
must be applied to model these regions of the unknown protein.
When only one known structure is available
for homology modeling, it is more difficult to identify the SCRs. Based on
analyses of other homologues for which multiple structures are available, we
know that the SCRs generally correspond to the elements of secondary structure,
such as alpha-helices and beta-sheets, and to ligand- and substrate-binding
sites. Thus, these regions are used as the SCRs in the cases where only one
structure is available. The VRs usually lie on the surface of the proteins and
form the loops where the main chain turns.
Once the known structures are aligned and the
SCRs have been identified, one aligns the unknown sequence. Alignment based solely
on sequence may be used, though other structural features also may be taken
into account. In Quanta, multiple sequence alignment algorithms are available
that may be used both when aligning sequences of the known structures and when
aligning the sequence of the unknown with the known structures. Four scoring
systems are available, each of which may be evaluated during an alignment so
that relative statistical weights may be assigned. The four scoring methods
are:
Generating Coordinates for
the Unknown Structure ![]()
When generating coordinates for the unknown
structure, one needs to model main chain atoms and side chain atoms, both in
SCRs and VRs. For the SCRs, it is straightforward to generate the coordinates
of the main chain atoms of the unknown structure from those of the known
structure(s). Side chain coordinates are copied if the residue type in the
unknown is identical or very similar to that in the known homologues. For other
side chain coordinates one can apply a side chain rotamer library in a
systematic approach to explore possible side chain conformations. It may be
desirable to weight the contribution of each homologue in each SCR based on the
extent of similarity with the unknown. In the event that some coordinates in
the unknown are undefined in the SCRs, regularization can be used to build and
relax both main chain and side chain atoms in those regions. Note that this
procedure should be used only if the region of undefined atoms is one or two
residues in length.
For the VRs, a variety of approaches may be applied in assigning coordinates to
the unknown. Recall that these regions will correspond most often to the loops
on the surface of the protein. If a loop in one of the known structures is a
good model for that of the unknown, then the main chain coordinates of that
known structure can be copied. Side chain coordinates of residues that are
similar in length and character also may be copied. Rotamer libraries can be
used to define other side chain coordinates.
When a good model for a loop cannot be found among the known structures, one
can search fragment databases for loops in other proteins that may provide a
suitable model for the unknown. A residue range is chosen to include the
undefined loop as well as a few residues (e.g., three) on either side of the
loop for which coordinates have been defined. Fragments are examined for their
ability to fit in the undefined region without making bad contacts with other
atoms and to overlap well with the residues on either side of the loop. The
loop may then be subjected to conformational searching to identify low energy
conformers if desired. Coordinates for side chain atoms in these loop regions
may be copied if residues are similar, though it is likely that considerable
application of side chain rotamer libraries will be required to define
coordinates in these regions.
Databases of Structures from Homology Modeling![]()
Databases are now available that contain large numbers
of protein structures that have been obtained by comparative (homology)
modeling. Two of these databases are listed here:
Modbase was created by Sali and co-workers,
using their program Modeller, which
creates models based on the satisfaction of spatial restraints [16]. That is,
restraints are identified from the alignments of homologues of known structure,
and these restraints are then applied to the unknown sequence. Restraints can
include distances between alpha carbons, other distances within the main-chain,
and main-chain and side-chain dihedral angles. Routines to satisfy the
restraints optimally include conjugate gradient minimization and molecular
dynamics with simulated annealing.
3DCrunch is a large scale modeling project
that aims to submit all entries from protein sequence databases to SWISS-MODEL.
Currently the database contains 64,000 entries.
Web-based tools are now available to generate
models of protein 3-dimensional structures using comparative modeling
techniques.
Evaluation and Refinement of the Structure
![]()
For a homology model from any source, it is
important to demonstrate that the structural features of the model are
reasonable in terms of what is know about protein structures in general. That
is, researchers have analyzed three-dimensional structures of proteins from
which basic principles of protein structure and folding have been developed.
Several programs are available to assist in this analysis of correctness of a
homology model.
The criteria for analysis of correctness can
include:
Programs that provide structure analysis
along with output that is useful for publication include PROCHECK
and 3D-Profiler [17,18]. PROCHECK is based on an analysis of (phi,psi) angles,
peptide bond planarity, bond lengths, bond angles, hydrogen-bond geometry, and
side-chain conformations of known protein structures as a function of atomic
resolution. Thus, the expected values of these parameters are known and can be
compared to a modeled structure based on the atomic resolution of the
structures from which the model was developed. 3D-profiler compares a homology
model to its sequence using a 3D profile. The profile is based on the
statistical preferences of each of the 20 amino acids for particular
environments within the protein. Each residue position in a 3D model can be
characterized by its environment. Preferred environments for amino acids are
derived from known three-dimensional structures and are defined by three parameters:
(1) the area of each residue that is buried, (2) the fraction of side-chain
area that is covered by polar atoms (i.e., O and N), and (3) the local secondary structure. Based on these
environment variables, a 3D structure is converted into a 1D profile that
describes each residue in the folded protein structure. Examination of these
profiles reveals which regions of a sequence appear to be folded correctly and
which do not.
Once any irregularities have been resolved,
the entire structure may then be subjected to further refinement. This process
may consist of energy minimization with restraints, especially for the SCRs.
The restraints then may be gradually removed for subsequent minimizations. It
also may be advantageous to apply molecular dynamics in conjunction with energy
minimization. For any of these refinement procedures, the structure should be
solvated, using for example crystallographic waters from the known homologues,
a solvent shell, or a periodic box of pre-equilibrated water molecules.
The need to be able to determine
3-dimensional structures more quickly to keep up with the rapidly increasing
amount of sequence information has prompted some investigators to suggest that
a new strategy should be applied to structural analysis [19]. That strategy is
called class-directed structure determination. The premise behind this strategy
is that we should expend our resources for structure determination on those
structures that will be most informative rather than those that are simply of
interest to a few investigators. It is proposed that the structures that should
be determined are those that will assist in developing a more complete
classification scheme for known protein structures. A basic concept in this strategy
is that there are a finite number of protein folds, and efforts should be
directed to identifying all of those folds and determining structures of
representative members of each of the fold classes.
Several methods are being developed for the
classification of proteins based on their structural and functional features.
![]()
On-line
Resources ![]()
Biology
Workbench provides a Web interface to major sequence databases and the
tools to search those databases. Included is a Tutorial to assist in
learning and using the workbench.
Sequence
Analysis Tools includes tools for sequence similarity searches; analysis of
primary sequences for physicochemical parameters, signal peptides, sequence
repeats, glycosylation sites, etc.;
secondary structure prediction; binary and multiple sequence alignments.
Computational Analysis of DNA
and Protein Sequences is a chapter from Genome Analysis: A Laboratory
Manual. It provides basic Internet
information as well as a summary of sequence analysis, multiple alignments,
structure prediction and protein modeling, and many other related topics.
A Guide to Sequence Searching
provides a description of several biosequence comparison topics, including
substitution matrices.
Dali performs multiple structural
alignments of proteins. Coordinates of protein structures are compared against
those in the Protein Data Bank, and a multiple alignment of structural
neighbors is returned.
Sequence Analysis Bibliographic
Reference Database provides a comprehensive list of papers dealing with
sequence analysis.
SCOP
(Structural Classification of Proteins) for classification of proteins.
CATH (Class, Architecture,
Topology, and Homology) for classification of proteins.
HOMSTRAD (HOMologous
STRucture Alignment Database) for alignment of proteins.
ModBase is a database of
protein structures obtained by homology modeling.
3DCrunch is a project
to create databases of protein structures by homology modeling.
SWISS-MODEL is a
Web-based tool for generating homology models.
WHAT
IF is a Web-based tool for generating homology models.
![]()
Printed
References ![]()
[1] Blundell, T.L., Sibanda, B.L., Sternberg,
M.J.E., and Thornton, J.M. (1987) Knowledge-Based Prediction of Protein
Structures and the Design of Novel Molecules. Nature 326: 347-352.
[2] Fetrow, J.S. and Bryant, S.H. (1993) New Programs for Protein Tertiary
Structure Prediction. Bio/Technology 11: 479-484.
[3] Greer, J. (1991) Comparative Modeling of Homologous Proteins. Meth.
Enzymol. 202: 239-252.
[4] Johnson, M.S., Srinivasan, N., Sowdhamini, R., and Blundell, T.L. (1994)
Knowledge-Based Protein Modeling. Crit. Rev. Biochem. Mol. Biol. 29: 1-68.
[5] Sali, A., Overington, J.P., Johnson, M.S., and Blundell, T.L. (1990) From
Comparisons of Protein Sequences and Structures to Protein Modelling and
Design. Trends Biochem. Sci. 15: 235-240.
[6] Lewin, R. (1987) When Does Homology Mean Something Else? Science 237: 1570.
[7] Reeck, G.R. et al. (1987) "Homology" in Proteins and Nucleic
Acids: A Terminology Muddle and a Way out of It. Cell 50: 667.
[8] Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the
Search for Similarities in the Amino Acid Sequence of Two Proteins. J. Mol.
Biol. 48: 442-453.
[9] Dayhoff, M.O. and Eck, R.V. (1968) A Model of Evolutionary Change in
Proteins. In Atlas of Protein Sequence and Structure (Dayhoff, M.O., ed.), vol. 3, pp. 33-41, National
Biomedical Research Foundation, Washington, D.C.
[10] Dayhoff, M.O., Schwartz, R.M., and Orcutt, B.C. (1978) A Model for
Evolutionary Change. In Atlas of Protein Sequence and Structure (Dayhoff, M.O., ed.), vol. 5, suppl. 3, pp. 345-358,
National Biomedical Research Foundation, Washington, D.C.
[11] Dayhoff, M.O., Barker, W.C., and Hunt, L.T. (1983) Establishing Homologies
in Protein Sequences. Meth. Enzymol. 91: 524-545.
[12] Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices
from Protein Blocks. Proc. Natl. Acad. Sci. USA 89: 10915-10919.
[13] Johnson, M.S. and Overington, J.P. (1993) A Structural Basis for Sequence
Comparisons - An Evaluation of Scoring Methodologies. J. Mol. Biol. 233:
716-738.
[14] Pearson, W.R. (1995) Comparison of Methods for Searching Protein Sequence
Databases. Protein Sci. 4: 1145-1160.
[15] Kabsch, W. and Sander, C. (1983) Dictionary of Protein Secondary
Structure: Pattern Recognition of Hydrogen-Bonded and Geometrical Features.
Biopolymers 22: 2577.
[16] Sali, A. and Blundell, T.L. (1993) Comparative Protein Modelling by
Satisfaction of Spatial Restraints. J. Mol. Biol. 234: 779-815.
[17] Luthy, R., Bowie, J.U., and Eisenberg, D. (1992) Assessment of Protein
Models with Three-Dimensional Profiles. Nature 356: 83-85.
[18] Bowie, J.U., Luthy, R., and Eisenberg, D. (1991) A Method to Identify
Protein Sequences That Fold into a Known Three-Dimensional Structure. Science
253: 164-170.
[19] Terwilliger, T.C., Waldo, G., Peat, T.S.,
Newman, J.M., Chu, K., and Berendzen, J. (1998) Class-directed Structure
Determination: Foundation for a Protein Structure Initiative. Protein Sci. 7:
1851-1856.
![]()
| Home Page | Topics | Evaluation | Assignments | Resources | News |
![]()
Copyright ?? 1997-2003 David
R. Bevan
All Rights Reserved
Dept. of Biochemistry
Virginia Tech
Comments to drbevan@vt.edu
Last Update: 3/14/03