Computational methods for biological sequence analysis are playing an increasingly important role in biology and medicine. The key question addressed by these methods is the discovery of the function of a protein or gene. It is well known that the function of a protein is dictated by its amino acid sequence since this determines the structure of the protein and thus its interaction with the environment.
Proteins are the building blocks of life, supporting a variety of functions which are essential for cell life. These include protection from infections or cancers, gene regulation, survival in different conditions, growth, differentiation, regeneration and others. In fact, the function of every cell in a living organism (whether microbial or human) is determined by which proteins (genes) are expressed in the cell and how they interact in the particular cell environment.
The area of protein function is particularly timely because the new technology of high-throughput genomics generates thousands of hypothetical genes that have not been assigned a putative function. There are numerous commercial applications. Classifying new genes into categories opens many opportunities for new medical treatments. Genes are often used as drugs directly (e.g., insulin), or drug targets (e.g., attacking a particular gene in a microbial organism). Other applications include the design of pesticides, design of new crops, gene therapies and rational drug design.
Proteins are macromolecules found in living organisms which play many roles essential to sustaining life (e.g., forming the physical framework of the organism, acting as enzymes to (promote chemical reactions). A protein is composed of a sequence of several hundred amino acids. Proteins are created in living cells by translating the coding regions (genes) of the DNA sequence. Different proteins are expressed in different cells. The level of expression of different proteins determines the cell function. Since proteins are long and linear complex molecules, they “fold” to give a 3D shape. Biologists have identified four levels of structure which can influence the protein's function:                1. Primary structure—the sequence of amino acids        2. Secondary structure—the presence or absence of small “sub-folds”.                    These are regular patterns formed by local folding of the protein (e.g., helices and sheets).                        3. Tertiary structure—the final 3D shape        4. Quaternary structure—complexes formed with other proteins.        
Given one level of structure, it is not necessarily a trivial task to predict the next level. Hence, function prediction from the primary structure alone is difficult. Therefore, techniques other than sequencing are needed to determine the 3D structure and ultimately the protein function.
The traditional and still most reliable way to perform protein structure prediction is to use laboratory-based techniques such as X-ray crystallography. However, recent years have seen the development of software-based solutions. One such technique is to use dynamic programming-based alignment tools such as “BLAST” to match the new sequence to previously labeled protein sequences (Altshul et al., 1990, Basic Local Alignment Search Tool, JMB 215:403–410). Alternatively, statistical techniques such as Hidden Markov Models (HMM's) can be used to build a model for each labeled class (E. Sonnhammer, S. Eddy and R. Durbin, “Pfam: A Comprehensive Database of Protein Families Based on Seed Alignments,” Proteins, 1997, pages 405–420). (A. Krogh, M. Brown, I. Mian, K. Sjolander and D. Haussler, “Hidden Markov Models in Computational Biology: Applications to Protein Modeling”, J. of Molecular Biology, 1994, Volume 235, 1501–1531.) Still another alternative is to learn the boundaries between protein classes rather than a model for the class itself. (Jaakkola, Diekhans, Haussler, “Using the Fisher kernel method to detect remote protein homologies,” in Proceedings of ISBM '99). The first two approaches use the protein sequence itself directly to perform classification. The last one uses a HMM to compute the gradient of the protein being produced by the HMM with respect to each of the parameters of the HMM. In summary, none of these methods uses the sensitivity of parts of the protein to motifs to build a feature vector.
Lab-based techniques, such as X-ray crystallography, are expensive and time-consuming. In addition, X-ray crystallography relies on having relatively large amounts of the protein. It cannot work with just a primary description of the protein (i.e., the sequence of amino acids in a file). Finally, it is not possible to crystallize certain proteins in any case (e.g., membrane spanning proteins).
BLAST and other dynamic programming methods are more time-consuming and less accurate than statistical-based techniques.