As genome sequences are determined for an increasing variety of species, a great deal of attention is being paid to a so-called genome comparison method aimed at discovering new information from genetic differences between them. The genome comparison method aims to find out genes responsible for the development of individual species, in order to look for groups of genes which are believed to be common to all living organisms, or, conversely, estimate characteristics specific to individual species.
Recent years have witnessed the development of an infrastructure in the form of DNA chips and DNA microarrays (hereinafter referred to as ‘biochips’. As a result, the interests of molecular biologists are turning from inter-species data to intra-species data, in other words, they are focusing on the analysis of genes expressed simultaneously in a particular cell. Thus, there is an increasing number of ways in which data is extracted and used, alongside the more conventional comparisons between species.
For instance, if a previously unknown gene is discovered and found to exhibit the same expression pattern as a known gene, it may be inferred to have a similar function to that of the known gene. The functional significance of such genes and the proteins themselves are being studied in the form of functional units and groups. Meanwhile, as far as interactions between them are concerned, the direct and indirect effect of a given gene is being analysed by comparison with known enzyme reaction data or metabolic data, or more directly by destroying the gene or causing it to overreact, thus eliminating the expression thereof or expressing it in quantity to study the expression patterns of all genes. An example of success in this field is provided by an expression analysis of yeast performed by a group led by P. Brown of Stanford University in the USA (Michael B. Eisen et al.; Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. 95 (25); 14863–8, 8 Dec. 1998). This group used DNA microarrays to hybridise genes extracted from cells in a time series, representing the degree of gene expression (intensity of hybridised fluorescent signals) numerically. By allocating colours to the numerical values they then displayed the expression processes of the individual genes in a manner which was easy to understand. They then clustered genes with similar expression pattern processes in a cycle of cells (those with similar degrees of expression at a given point in time).
FIG. 27 is an example of how the result of a standard cluster analysis of gene expression is displayed according to this method. The experiment cases are displayed in the horizontal direction, and the genes arranged in the vertical direction. The degree of expression of each gene in each experiment case is denoted by the density of colour. Denser colours represent higher degrees of expression. A dendrogram is displayed on the left of the drawing. The dendrogram shows how in the process of clustering two closest clusters have been merged in each case, while the length of each branch corresponds to the relative distance between two clusters on merging.
FIG. 28 is an example of another display representing the similarity of gene expression patterns. Observed data on individual genes is arranged on the right, while the dendrogram displayed on the left has been prepared on the basis of these gene expression patterns.
With developments in biology the functions of genes are gradually being clarified, and biologists are attempting to analyse them by combining expression data and known information. Analysis by dendrogram allows biologists to look for biologically meaningful clusters (groups of genes). In other words, if the expression patterns of individual genes in a cluster are similar and there are many of known function with the same pattern, this is extracted as a meaningful cluster. Such clusters are herein referred to as function clusters. Vertical bars 2801 and 2802 in FIG. 28 are examples of such function clusters. For instance, if there is a gene of unknown function within a function cluster, it is possible to infer that it possesses a similar function to those in the same cluster with a known function. What is more, by examining the expression pattern of a function cluster it is possible to discover the expression process specific to the function.
A huge amount of gene data needs to be handled in the actual analysis of gene expression patterns. This is because biochips make it possible to observe genes of the order of between several thousand and several tens of thousands at the same time. With developments in biochip technology the number of genes which it is possible to observe simultaneously is set to increase by leaps and bounds, lending powerful support to the process of clarifying the mechanism of life.
As the number of such genes increases in this manner, it becomes extremely difficult to comprehend the workings of all of them. A dendrogram will contain thousands or tens of thousands of genes, and even the subtrees illustrated in FIGS. 27 and 28 will be very complex and include many fine branches, making it difficult to decide what sort of classification has been carried out.
Researchers will have to spend much time and effort choosing function clusters for these dendrograms. Some commercially available gene expression clustering tools have display functions for dendrograms and gene names, but none gives any suggestion as to what clusters merit attention.
In view of the above problems with conventional technology, it is a first object of the present invention to take the results of clustering, extract from them groups of genes having the same function and genes having similar expression patterns to the groups of genes, and provide a function and display for re-analysing these genes. This makes it possible to assist in discovering specific expression patterns for gene functions, surmise unknown gene functions, and infer whether or not genes known to have one function have other functions as well.
It is a second object of the present invention to provide a means of automatically sorting clusters of genes having similar expression patterns and the same function, and displaying them in a form in which it is easy for researchers to understand their characteristics.