(1) Technical Field
The present invention relates to creating and analyzing spatial-expression patterns, wherein the spatial-expression pattern involves integrating the expression data with the spatial distribution of expression information.
(2) Discussion
The bioinformatics field, which, in a broad sense, includes any use of computers in solving information problems in the life sciences, and more particularly, the creation and use of extensive electronic databases on genomes, proteomes, etc., is currently in a stage of rapid growth.
In order to understand some of the concepts in the bioinformatics field, it is important to understand some of the basic principals of cells. A cell relies on proteins for a variety of its functions. Producing energy, biosynthesizing all component macromolecules, maintaining cellular architecture, and acting upon intra- and extra-cellular stimuli are all protein dependent activities. Almost every cell within an organism contains the information necessary to produce the entire repertoire of proteins that the organism can specify. This information is stored as genes within the organism's DNA genome. Different organisms have different numbers of genes to define them. The number of human genes, for example, is estimated to be approximately 25,000.
Genetic information of all life forms is encoded by the four basic nucleotides, denoted by symbols A, G, C, and T. The make up of all life forms is determined by the sequence of these nucleotides. DNA is the molecule that encodes this sequence of nucleotides. The DNA molecule usually contains a large number of genes. Each gene provides biochemical instructions on how to construct a particular protein. The one-to-one nature of one gene creating one protein has been recently changed. In some cases multiple genes are required to create a single protein and commonly multiple proteins can be produced through alternative splicing and post-transcriptional modification of a single gene. Recently, much attention has also been directed at “Copy Number Variations” or “Copy Number Polymorphisms”. Scientists have determined that genomes of different people have small deletions and insertions of sequences of various sizes scattered through the DNA. Much of these so called CNVs or CNPs do not produce obvious phenotypes. However, it is suspected that such changes can create pre-disposition to various conditions. These CNVs can be in any area of the genome including between genes, within the intronic areas of genes or even in Exons. An example of a genome is depicted in FIG. 1, wherein the genes 103, 105 are defined as segments of nucleotide sequence on the DNA genome 101. As depicted in FIG. 1, the letter “N” represents any of the four DNA nucleotides: “A,” “C,” “G,” or “T.”
Only a portion of the genome is composed of genes, and the set of genes expressed as proteins varies between cell types. Some of the proteins present in a single cell are likely to be present in all cells because they serve functions required in every type of cell. These proteins can be thought of as “housekeeping” proteins. Other proteins serve specialized functions that are only required in particular cell types. Such proteins are generally produced only in limited types of cells. Given that a large part of a cell's specific functionality is determined by the genes that it is expressing, it is logical that transcription, the first step in the process of converting the genetic information stored in an organism's genome into protein, would be highly regulated by the control network that coordinates and directs cellular activity.
Genes are activated, or expressed, in a very specific fashion and to a specific level at any given moment in time to achieve a desired state. The regulation of transcription is readily observed in studies that scrutinize activities evident in cells configuring themselves for a particular function (specialization into a muscle cell) or state (active multiplication or quiescence). As cells alter their state, coordinate transcription of the protein sets required for the change of state can be observed. As a window both on cell status and the system controlling the cell, detailed, global knowledge of the transcriptional state could provide a broad spectrum of information useful to biologists. For instance, knowledge of when and in what types of cell the protein product of a gene of unknown function is expressed would provide useful clues as to the likely function of that gene. Furthermore, determining gene expression patterns in normal cells could provide detailed knowledge of how the control system achieves the highly coordinated activation and deactivation required to develop and differentiate a single fertilized egg into a mature organism. Also, comparing gene expression patterns in normal and pathological cells could provide useful diagnostic “fingerprints” and help identify aberrant functions that would be reasonable target for therapeutic intervention.
The current approaches in studying gene expression patterns attempt at understanding the differences in expression patterns of genes in different conditions (either a pair of conditions or a series of conditions) by comparing the level of expression of various genes one by one. FIG. 2 illustrates a simplified example where two different gene expression patterns 201, 203, each being composed of five distinct genes 205-209 are compared. Assume that Condition1 pattern 201 is for a healthy cell and Condition2 pattern 203 is for a diseased cell. Using an arbitrary expression scale of zero to ten, the example matrix of expression values is constructed as shown in FIG. 2.
Using the matrix of FIG. 2, the typical approach would compare the genes one by one and draw conclusions such as there are no significant differences in Gene4 208 so it probably does not contribute to differences observed in Condition1 201 and Condition2 203. However, Gene3 207 exhibits a large difference and might indicate a gene having some relation to the observed differences between Condition1 201 and Condition2 203. The common measures used to calculate the difference between Condition1 201 and Condition2 203 is to take a difference or ratio between the expression values gene by gene. For example, the Euclidean distance or the Pearson correlation value between the two expression vectors may be calculated.
These prior art approaches do not take into account a very important piece of information, the spatial distribution of genes, or the sequence of nucleotides, along the genome. The observation of the presence of extensive number of copy number changes in the genome (effecting the local position of genes) is further evidence in the need to look and process of the genome from a spatial arrangement perspective. Thus, what is needed is an apparatus, method and computer program product which takes into account not only the expression levels of the genes but also their spatial distribution.