1. Field of the Invention
The present invention relates to a method of classifying biological elements into functional families and identifying biologically active regions of the biological element.
2. Description of the Prior Art
Genomes carry all information of life from one generation to the next for every organism on earth. Each genome, which is a collection of DNA molecules, can be represented as a series of strings comprised of four letter symbols. Today, the genomes of a worm known as C. elegans, the fruit fly, the human and a weed known as Arabidopsis, as well as several dozen microbial genomes are available. Most of these data are accessible free of charge, encouraging the exploration of this data. However, it is not the genes, but the proteins they encode that actually perform the functions of living cells. A search for protein function requires that each protein and its structure be identified and characterized, and that every proteinxe2x80x94protein interaction be characterized.
Classification of Proteins Proteins are the molecules constructed from linear sequences of smaller molecules called amino acids. There are twenty naturally occurring amino acids and they can be represented in a protein sequence as a string of alphabetic symbols. Protein molecules fold to form specific three dimensional shapes which specify their particular chemical function.
Analysis of protein sequences can provide insights into function and can also lead to knowledge regarding biologically active sites of the protein. While analysis of protein sequences is often performed directly on the symbolic representation of the amino acid sequence, patterns in the sequence are often too weak to be detected as patterns of symbols.
Alternative sequence analysis techniques can be performed by assigning numerical values to the amino acids in a protein. The numerical values are derived from the physico-chemical properties of the amino acid such as hydrophobicity, bulkiness, or electron-ion interaction potential (EIIP) and are relevant to structural folding or biological activity.
It has been recognized that proteins of a given family have a common characteristic frequency component related to their function which may be used to classify proteins into functional families.
Frequency Analysis Methods The Resonant Recognition Model is an attempt to use frequency analysis to determine the characteristic frequency components of a family of proteins.
The Resonant Recognition Model or RRM, is described by I. Cosic in xe2x80x9cMacromolecular bioactivity: Is it resonant interaction between macromolecules?xe2x80x94theory and applications,xe2x80x9d IEEE Transactions on Biomedical Engineering, vol. 41, December 1994. The RRM is a physico-mathematical model that analyzes the interaction of a protein and its target using digital signal processing methods. One application of this model involves prediction of a protein""s biological function. In this technique a Fourier transform is applied to a numerical representation of a protein sequence and a peak frequency is determined for a particular protein""s function. The aim of this method is to determine a single parameter that correlates with a biological function of genetic sequences. To determine such a parameter it is necessary to find common characteristics of sequences with the same biological function. The cross-spectral function determines common frequency components of two signals. For a discrete series, the cross-spectral function is defined as:
Sn=XnY*n, n=1,2, . . . ,N/2 
where Xn are the Discrete Fourier Transform (DFT) coefficients of the series X(n) and Yn* are the complex conjugate DFT coefficients of the series Y(n). Peak frequencies in the cross-spectral function define common frequency components for analyzed sequences. The common frequency components for a group of protein sequences can be defined as follows:
|Mn|=|X1n∥X2n|. . . |XMn|, n=1,2, . . . ,N/2 
This methodology can be illustrated via an example. Fibroblast growth factors (FGF) constitute a family of proteins that affect the growth, differentiation, and survival of certain cells. The symbolic representations of two FGF amino acid sequences are shown below:
Symbolic representations, such as these, can be translated into numerical sequences using the EIIP index, described by K. Tomii and M. Kanehisa in xe2x80x9cAnalysis of amino acids and mutation matrices for sequence comparison and structure prediction of proteins,xe2x80x9d Protein Engineering, vol. 9, January 1996.
V. Veljkovic, I. Cosic, B. Dimitrjevic, and D. Lalovic, in xe2x80x9cIs it possible to analyze DNA and protein sequences by the methods of digital signal processing?,xe2x80x9d IEEE Transactions on Biomedical Engineering, vol. 32, May 1985, have shown that the EIIP correlates with certain biological properties.
The graphical representation of the corresponding numerical sequences for the FGF proteins (SEQ ID NO:1 and SEQ ID NO:2) obtained by replacing every amino acid with its EIIP value can be see in FIGS. 1A and 1B. A DFT is performed on each numerical sequence. The resulting spectra are shown in FIGS. 2A and 2B. The cross-spectral function of the two FGF spectra generates the consensus spectrum shown in FIG. 3. For the spectrum plots the x-axis represents the RRM frequencies and the y-axis are the normalized intensities. The prominent peak denotes the common frequency component for this family of proteins.
The presence of a peak frequency in a consensus spectrum implies that all the analyzed sequences have one frequency component in common. This frequency is related to the biological function provided the following conditions are met:
one peak only exists for a group of protein sequences sharing the same biological function;
no significant peak exists for biologically unrelated protein sequences;
peak frequencies are different for different biological functions.
However, since frequency analysis alone contains no spatial information, there is no indication as to which residues contribute to the frequency components. The RRM technique lacks the ability to reliably identify the individual amino acids that contribute to that peak frequency.
Spatial Analysis Methods Frequency analysis alone cannot handle the transitory nature of non-stationary signals. However, a time-frequency representation (or space-frequency representation as is synonymously known in the art. See Leon Cohen, Time-Frequency Analysis. Prentice Hall, 1995. P. 113) of a signal provides information about how the spectral content of the signal evolves with time (or space) and therefore provides a tool to analyze non-stationary signals.
In an attempt to provide spatial information relating to the proteins Q. Fang and I. Cosic in xe2x80x9cPrediction of active sites of fibroblast growth factors using continuous wavelet transforms and the resonant recognition model,xe2x80x9d Proceedings of The Inaugural Conference of the Victorian Chapter of the IEEE EMBS, 1999 describe a method using a continuous wavelet transform to analyze the EIIP representations of protein sequences. The continuous wavelet transform (CWT) is one of the time-frequency or space-frequency representations. Because the CWT provides the same time/space resolution for each scale the CWT can be chosen to localize individual events such as active site identification. The amino acids that comprise the active site(s) are identified as the set of local extrema of the coefficients in the wavelet transform domain. The energy concentrated local extrema are the locations of sharp variation points of the EIIP and are proposed by Fang and Cosic as the most critical locations for a protein""s biological function.
Experiments have shown that the potential cell attachment sites of FGF""s are between residues 46-48 and 88-90. FIG. 4 is a gray scale plot of a CWT spectrogram (a time-frequency representation) of the FGF protein (SEQ ID NO:1 and SEQ ID NO:2) of Example 1. This plot was produced using an intensity plot routine in MATLAB, available from MathWorks, Natick, Mass. The gray scale on this plot represents the amplitude of the data, the lightest gray being the highest amplitude and the black being the lowest amplitude. For clarity of illustration the background, which would otherwise be completely black, has been rendered white. It can be observed that there are two bright regions at the higher frequencies from scale 1.266 to scale 2.062, which correspond to the amino acids at the active sites. These regions are enclosed in white rectangular boxes and are labeled with the reference numerals 100 and 200, respectively.
While the wavelet transform technique shows promise for identifying amino acids at potential biologically active sites, it does not reveal the characteristic frequency component of the Resonant Recognition Model. The spectrogram of the CWT can often be difficult to interpret. It is the weaknesses of these prior art methods described above that are overcome by the present method.
Classification of Genes While the study of proteins leads to the understanding of the functions within organisms it is still necessary to understand how genes in organisms are regulated since it is this regulation which influences the production of the proteins under the correct environmental conditions. Within the last few years miniaturized laboratory analysis technology using substrates containing an array of samples has become available. These sample arrays are commonly known as xe2x80x9cmicroarraysxe2x80x9d or xe2x80x9cgene chipsxe2x80x9d. Microarray technology is revolutionizing functional genomics research by allowing scientists to measure the expression level of thousands of genes simultaneously from a single experiment. The discovery of sets of genes with similar expression patterns has a variety of uses such as: finding genes that might be involved with a particular disease by comparing their patterns with genes that are known to be associated with the disease; characterizing the function of an unknown gene by comparing it to a class of genes of a known class; and finding genes with similar patterns of behavior over time.
Linear Classification Methods However, a standard protocol for microarray data analysis has not yet been established. Many data mining techniques are currently in use for microarray data analysis. Eisen, et.al. in xe2x80x9cCluster analysis and display of genome-wide expression patternsxe2x80x9d, Proceedings of the National Academy of Science, Vol. 95, pp. 14863-14868, show that standard linear correlation coefficients can be calculated for gene pairs. This information can then be passed to a hierarchical clustering software package to visualize relationships amongst the genes. Eisen""s data set contains over six thousand genes and can be found at http://rana.stanford.edu/clustering.
For simplicity of illustration of Eisen""s method, a randomly selected subset of the yeast genes from four functional families of Eisen""s data set was selected. This subset was clustered using Eisen""s standard correlation coefficients and the results passed to the hierarchical clustering algorithm known as xe2x80x9cPHYLIPxe2x80x9d (Phylogeny Inference Package), version 3.57c (1995), distributed by the author, Joseph Fellsenstein, at http://evolution.genetics.washington.edu/phylip.html.
FIG. 10 illustrates the results of this prior art technique. As may be appreciated from FIG. 10 the functional families xe2x80x9cGlycolysisxe2x80x9d and xe2x80x9cProtein Degradationxe2x80x9d do not form tight clusters. In fact, these functional families are somewhat disjoint. This deficiency is attributed to the use of a linear correlation metric on what is inherently a nonlinear relationship of the gene expression and the functional families.
In view of the foregoing it is believed advantageous to use a nonlinear time-frequency transform to identify the frequency components that classify biological elements into functional families, while simultaneously retaining spatial information involving secondary structure and biologically active sites.
The relative terms xe2x80x9cbiological element: biological subelementxe2x80x9d as used herein are meant to express biological entities related in next-adjacency in a hierarchy, with the xe2x80x9cbiological elementxe2x80x9d occupying the higher level in the hierarchy with respect to the xe2x80x9cbiological subelementxe2x80x9d.
For example, in a first hierarchy:
protein sequence
amino acid
dna sequence
nucleotide
the first member of the following hierarchically adjacent pairs of entities is the biological element while the second member of the pair is the biological subelement, thus:
protein sequences:amino acid
amino acid:dna sequences
dna sequences:nucleotide.
As a further example, in a second hierarchy:
gene expression experiment
gene
gene expression value
a xe2x80x9cgene expression valuexe2x80x9d is a biological subelement of a xe2x80x9cgenexe2x80x9d (the biological element) in expression experiments across genes. A xe2x80x9cgenexe2x80x9d can be a biological subelement of a xe2x80x9cgene expression experimentxe2x80x9d (the biological element). A gene expression value, for example, might be a physico-chemical property measurement of an amino acid.
The term xe2x80x9cfunctional familyxe2x80x9d refers to biological elements exhibiting similar behaviors under the same environmental conditions. The term includes:
Proteins with a common biological function;
DNA sequences with common regions;
Genes with related expression behavior; and
Cell line similarity based on gene expression.
The present invention is a method of classifying a biological element comprised of biological subelements into a functional family, wherein each family is represented by a cluster of data points around a common frequency characteristic of a time-frequency transform, the method comprising the steps of:
a) converting a symbolic representation of a sequence of biological subelements to a numeric representation of that sequence;
b) performing a time-frequency transform on the numeric representation;
c) identifying a cluster of data having a common frequency characteristic in the time-frequency domain,
thereby to identify a biological element in the functional family corresponding to that cluster.
The present invention may be implemented to classify a protein into a functional family, wherein each family is represented by a cluster of data points around a common frequency characteristic of a time-frequency transform, the method comprising the steps of:
a) converting a symbolic representation of a primary amino acid sequence data to a numeric representation of that sequence;
b) performing a time-frequency transform on the numeric representation; and
c) identifying clusters of data having a common frequency characteristic in the time-frequency domain, thereby to identify proteins of a common functional family.
The resulting transformed data may be plotted in the time-frequency domain using commercially available plotting routines.
The preferred time-frequency transform is the Wigner-Ville time-frequency transform. Since this transform is quadratic in nature, cross-terms representing unwanted interference are produced. Accordingly, the method may further include the step of filtering the interference terms from the transformed data. The preferred filter method is the center affine filter method.
In general, each numeric representation may be either a scalar representation of a characteristic of each biological subelement or a vector representation of multiple characteristics of each biological subelement. The vector representation may be reduced to a minimal set of dimensions which preserves the functionally important features of the biological element.
When classifying a biological element each numeric representation may be either a scalar representation of a physico-chemical property of each biological subelement or a vector representation of multiple physico-chemical properties or each biological subelement. The vector representation may be reduced to a minimal set of dimensions which preserves the functionally important features of the biological element.
Biologically active regions are identified by relatively high amplitude clusters of data points. Accordingly, the method may further include the step of identifying clusters of data points whose amplitude exceeds a predetermined threshold, thereby to identify biologically active regions in the biological element.