With the increase in the number of species that have been determined of their genome sequences, so called genome comparison has extensively been performed. Genome comparison aims at finding new facts based on gene difference among species, for example, finding genes involved in evolution, finding a collection of genes which are considered to be common to all species, or, conversely, studying the nature unique to specific species.
The recent development of infrastructures such as DNA chips and DNA microarrays has changed the interest in the art of molecular biology from information of interspecies to information of intraspecies, namely coexpression analysis, and broadened the study covering from extraction of information to correlation of information, including the conventional comparison between species.
For example, if an unknown gene has an expression pattern identical to that of a known gene, the unknown gene can be assumed to have a similar function to that of the known gene. Such functional meanings of genes and proteins are studied as function units or function groups. The interactions between the function units or function groups are also analyzed by correlating with known enzymatic reaction data or metabolism data, or more directly, by knocking out or overreacting a specific gene to eliminate or accelerate expression of genes to study the direct and indirect influences on gene expression patterns of a whole collection of genes.
Herein, an expression pattern of a gene is represented as a curve (or a line graph) of successive expression levels obtained from a series of experiment cases performed on the gene, where the horizontal and vertical axes represent experiment cases and expression levels, respectively. The expression pattern is not limited to an expression pattern of a gene but may be an expression pattern of other biopolymer such as DNA, cDNA, RNA, a DNA fragment or a protein. Herein, expression patterns of genes are exemplified for describing the present invention. Specific examples of the experiment cases along the horizontal axis include experiments in a time course, body parts of an organism, species, parts of a nucleotide sequence, and genes.
One exemplary analysis of expression patterns where experiments in a time course are taken as the horizontal axis, is the expression analysis of yeast by the group of P. Brown et al. from the Stanford University (Michel B. Eisen et al., Cluster analysis and display of genome-wide expression patterns, Proc. Natl. Acad. Sci. (1998), December 8; 95(25): 14863-8). They used a gene called cdc6 mutant strain to obtain expression data of the gene upon experiments in a time course. The expression data include a time sequential expression pattern obtained with centrifugation, a time sequential expression pattern during the budding period, a time sequential expression pattern obtained with a shock by high temperature, a time sequential expression pattern obtained with a shock by low temperature and a time sequential expression pattern obtained with the diauxic shift method. These expression data are combined to cluster the expression patterns, thereby succeeding in specifying the function of the gene.
According to one method for analyzing gene expressions, genes having similar patterns to that of a selected gene (a reference gene) shown in FIG. 24B are extracted from expression patterns of a group of numerous genes (candidate genes) shown in FIG. 24A. The extracted genes can be potential members of a function group or a function unit to which the reference gene belongs to.
FIGS. 24C and 24D schematically show the searching process and the results thereof according to this conventional method, respectively. According to this conventional method, genes having expression patterns similar to the expression pattern of the reference gene along their entire patterns are extracted. Specifically, an expression pattern of each gene is taken as a single vector (a vector having expression levels as components of independent axes representing multiple experiment cases in a multidimensional space). Then, the vectors are compared to give similarity (or dissimilarity) of the genes. Alternatively, genes can be extracted based on curve data (expression pattern data) selected by a user instead of actual gene data. The expression level along the vertical axis represents the proportion of the number of amplified genes. Actual measurement values depend on the experiment process, and the index of the expression levels may be, for example, fluorescent intensities from fluorescent labels labeling genes hybridized on a DNA chip, chemiluminescent intensities from chemiluminescent labels, values obtained by detecting, with an electrode, electric signals induced by chemical reactions of genes attached on a DNA chip, or mass spectrographic values obtained by measuring the time of flight of gasified hybridized genes.
According to such conventional method, however, genes to be extracted are only those having similar expression patterns along the entire expression pattern (i.e., for all experiment cases) of the selected reference gene.
For example, the conventional method is not capable of recognizing an expression pattern similar to that of the reference gene when the expression pattern data of the candidate gene contains a measurement error as shown in FIG. 22A which is caused by a difference between experimental environments. Moreover, when a plurality of genes have similar expression patterns for having similar functions in a part of the segment of the time course (for a part of consecutive experiment cases) but have different expression patterns for having different functions in another segment, curves are similar in a particular segment as shown in FIG. 22B. According to the conventional method, such a group of curves having similar curves in a particular segment cannot be extracted.
An expression regulatory effect of genes consists of a series of cascades where expression of one gene induces or inhibits expression of another gene. The term “cascades” as used herein refers to chain expressions of multiple genes as schematically shown in FIG. 23 where gene 1 induces expression of gene 2, gene 2 in turn induces expression of gene 3, and gene 3 in turn induces expression of gene 4. A further complicated network is formed by a combination of such cascades. In such gene cascades, the peaks of expressions of multiple genes are ranged along the time axis while their expression patterns have very similar shapes. FIG. 22D also shows a part of such cascades. This gene cascades cannot be detected by the conventional method.
In addition, the conventional method cannot detect, for example, a gene expression pattern shown in FIG. 22E where the expression of the gene is repressed, a gene expression pattern shown in FIG. 22C where the expression of the candidate gene has the same pattern as the expression of the reference gene but with a large and constant difference, and a gene expression pattern shown in FIG. 22F where the expression of the candidate gene has the same pattern as the expression of the reference gene but is stretched under a constant magnification.