Yale University Lecture Notes



 Bioinformatics describes the computational analysis of gene sequences, protein structures, and expression datasets on a large scale. Specific topics include sequence alignment, biological database design, geometric analysis of protein structure, and macromolecular simulation










Basic Alignment via Dynamic Programming

Suboptimal Alignment

Gap Penalties

Similarity (PAM) Matrices

Multiple Alignment

Profiles, Motifs, HMMs

Local Alignment

Probabilistic Scoring Schemes

Rapid Similarity Search: Fasta

Rapid Similarity Search: Blast

Practical Suggestions on Sequence Searching

Transmembrane helix predictions

Secondary Structure Prediction: Basic GOR

Secondary Structure Prediction: Other Methods

Assessing Secondary Structure Prediction

Features of Genomic DNA Sequence

Sequence Alignment Required Reading

[1] Chapter 3 from Gribskov, M. and Devereux, J. (1992). Sequence Analysis Primer. New York, Oxford University Press.
(Focus on dynamic programming section of this chapter.)

[2] Needleman, S. B. and Wunsch, C. D. (1971). "A general method applicable to the search for similarities in the amino acid sequence of two proteins." J. Mol. Biol. 48: 443-453.
(The original paper. Still pretty easy to read. Will be used in class.)

[3] Smith, T. F. and Waterman, M. S. (1981). "Identification of common molecular subsequences." J. Mol. Biol. 147: 195-197
(The original paper on local alignment. Not quite as easy to read, but introduces this important concept.)

[4] Alschul et al. (1998). "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs," Nucleic Acids Res 1997 Sep 1;25(17):3389-402

Scoring Required Reading

[5] Altschul, S. F., Boguski, M. S., Gish, W. and Wootton, J. C. (1994). Issues in searching molecular sequence databases. Nature Genetics. 6(2): 119-29.
(Most important. A short overall review.)

[6] M Levitt & M Gerstein (1998). A Unified Statistical Framework for Sequence Comparison and Structure Comparison. Proceedings of the National Academy of Sciences USA 95: 5913-5920
(Understand the concept of P-value and the framework for deriving scoring statistics.)

[7] Pearson, W. R. (1996). Effective Protein Sequence Comparison. Meth. Enz. 266: 227-259.
(Understand how the FASTA e-value is derived.)

Multiple Alignment Required Reading

[8] Eddy, S. R. (1996). "Hidden Markov models," Curr. Opin. Struc. Biol. 6, 361-365.

[9] Higgins, D. G., Thompson, J. D. & Gibson, T. J. (1996). "Using CLUSTAL for multiple sequence alignments," Methods Enzymol 266, 383-402.

Secondary Structure Prediction Required Reading

[10]Garnier, J., Gibrat, J. F. & Robson, B. (1996b). "GOR method for predicting protein secondary structure from amino acid sequence," Methods Enzymol 266, 540-53.

[11] King, R. D. & Sternberg, M. J. E. (1996). "Identification and application of the concepts important for accurate and reliable protein secondary structure prediction," Prot. Sci. 5, 2298-2310




Structuring Information in Tables
Keys and Joins
Complex RDB encoding
Indexes and Optimization
Forms and Reports
Clustering & Trees
Function Classification and Orthologs
The Genomic vs. Single-molecule Perspective
Folds in Genomes, shared & common folds
Genome Trees
Bulk Structure Prediction
Extent of Fold Assignment: the Bias Problem
Correcting for Biases with Sampling
Cross-tabulation, folds and functions
Analysis of Expression Data
Analysis of Other Whole Genome Datasets

Databases Required Reading

[12] M Gerstein & W Krebs (1998). "A Database of Macromolecular Movements," Nuc. Acid. Res. 26:4280-4290.

[13] Korth & Silberschatz, Database System Concepts(CS book on databases; Read pages 1 to 65 [sections 1.0 to mid-3.2] and pages 97 to 108 [part of section 4.1]. Some of the information on SQL is available from the on-line link below.)

Genome Surveys Required Reading

[14]Fred Tekaia, Antonio Lazcano & Bernard Dujon (1999). "The Genomic Tree as Revealed from Whole Proteome Comparisons," Genome Res. 9:550-557

[15] H Hegyi & M Gerstein (1999). "The Relationship between Protein Structure and Function: a Comprehensive Survey with Application to the Yeast Genome," J Mol. Biol. 228: 147-164.

[16] M Gerstein & H Hegyi (1998). "Comparing Microbial Genomes in terms of Protein Structure: Surveys of a Finite Parts List," FEMS Microbiology Reviews 22: 277-304.



What Structures Look Like?
RMS Superposition
Structural Alignment by Iterated Dynamic Programming
Scoring Structural Similarity
Fold Library
Relation of Sequence Similarity to Structural and Functional Similarity
Protein Geometry
Calculation of Surface Area
Calculation of Volume
Standard Volumes and Radii
Structure Alignment Required Reading
[17] Holm, L. and Sander, C. (1993). Protein Structure Comparison by Alignment of Distance Matrices. J. Mol. Biol. 233: 123-128.
(A different method of structural alignment, which differs more from sequence alignment.)
[18] M Gerstein & M Levitt (1998). "Comprehensive Assessment of Automatic Structural Alignment against a Manual Standard, the Scop Classification of Proteins," Protein Science 7: 445-456.
(Understand the method, not results, in this paper OR in Gerstein & Levitt (1996), below)
Geometry Required Reading
[19] J Tsai, R Taylor, C Chothia & M Gerstein (1999). "The Packing Density in Proteins: Standard Radii and Volumes," J. Mol. Biol. 290: 253-266.

[20] M Gerstein & F M Richards, "Protein Geometry: Volumes, Areas, and Distances," (2000) chapter 22 of volume F of the International Tables for Crystallography ("Molecular Geometry and Features" in "Macromolecular Ccrystallography")



Basic Forces: Electrostatics
VDW Forces
Bonds as Springs
Energy Minimization
Monte Carlo
Molecular Dynamics
Energy and Entropy
Parameter Sets
Number Density
Poisson-Boltzman Equation
Lattice Models and Simplification

Simulation Required Reading

[21] M Gerstein & M Levitt (1998). "Simulating Water and the Molecules of Life," Scientific American 279: 100-105.

[22] McCammon, J. A. & Harvey, S. C. (1987). Dynamics of Proteins and Nucleic Acids. Cambridge UP.

[23] Honig, B. & Nicholls, A. (1995). Classical electrostatics in biology and chemistry. Science 268, 1144-9.