A large number of methods have been developed for the analysis of microarray gene expression data. This reflects the tremendous complexity of the problem of transforming digital information on expression levels of over 20,000 genes into meaningful biological insights. Many microarray data analysis approaches are based on a case-control study design, for example comparing treated and untreated cells or matched disease and control tissues. In other cases, characteristic subsets of genes or classifiers are built and tested for specific purposes, such as the differential diagnosis of diseases. In most cases, significant numbers of samples from the case and control groups are expected in order to arrive at statistically significant interpretation of differentially expressed genes. Interpretation of data from individual samples is often not possible with these approaches. For example, samples from disease tissues, such as tumors, are often readily available, whereas the corresponding normal tissue samples may be much harder to obtain. In other cases, an appropriate control group is hard to define and challenging to acquire, particularly from human tissues. For example in studies of stem cells, their differentiation patterns should be followed up in comparison to multiple differentiated cell and tissue types to provide a comprehensive understanding of the differentiation patterns of the cells.
Recently, there have been major efforts to develop large-scale databases from publicly available microarray datasets (e.g. GeneSapiens, Oncomine, connectivity map, gene expression omnibus, Array-express) in order to analyze and mine the enormous quantities of microarray data that have been published by the biomedical community. Indeed, analyses of such metadata are increasingly recognized as a powerful means to study gene networks and gene regulation, and to identify tissue- or disease-specific gene expression patterns. Availability of these microarray databases would also provide an opportunity to use a comprehensive collection of reference samples as a means of guiding the interpretation of new microarray data produced by investigators from test samples. This is particularly appealing for the analysis and interpretation of data from individual samples. However, currently there are no tools available for such comparisons. Therefore, the microarray data analysis community would need a tool similar to the simple, yet highly powerful and versatile sequence comparison program (BLAST) [Altschul et al., Basic local alignment search tool, J Mol Biol, 1990] program for matching an unknown test DNA sequence against a comprehensive reference database of previously sequenced samples.
Today, the amount of genetic information increases rapidly including both DNA sequence and functional gene expression genetics. Especially this is the situation in oncology: cancer is a genetic disease on a cellular level, and should be treated and diagnosed as such.
Very large number of publications exists featuring various methods for classifying gene expression profiles to a priori defined classes. Just for the sake of clarification, these are usually divided in two classes. Unsupervised and supervised clustering methods, former is more commonly known as clustering whereas latter type of methods are more commonly known as classifiers. The fundamental difference between these is that in unsupervised methods data is just organized based on its features, simple sorting of numbers being perhaps the simplest unsupervised approach and hierarchical or k-means clustering being the most commonly applied ones. Stratifying cancer diagnostics tests today (e.g. OncotypeDX, MammaPrint, TargetNow) are based on unsupervised methods where a group of pre-defined gene expression values, among other possible sample analysis techniques, are used to diagnose cancer, typically by using a dedicated chip manufactured for that purpose only to measure pre-set 80-100 genes. In supervised methods some machine learning method is used where computer is taught to recognize certain features of the training data and then subsequently it is able to classify novel data based on these features.
In order to better understand significance of an expression profile, a biologically meaningful comparison to known gene expression profiles should be made possible. There are known methods of comparing gene expression samples to each other but usually they fail on either or both of the following: (i) ability to compare single sample against multiple samples (one versus one, or many versus many are more feasible); and (ii) ability to extract biologically sensible information as to which features (=genes) are especially responsible for the found similarity.
Cancer is a very personalized disease on a genetic level. Every cancer is different with enormous number of potential gene mutations and gene expression anomalies—and their combinations across all the approximately 23,000 human genes. It has been shown, e.g. by tumour sequencing projects, that one tumour may have numerous different mutations, and that the same cancer type (like breast cancer, prostate cancer) may have significantly different genetic profiles between individuals.
Currently, cancer diagnostics is done by pathologists performing visual inspection of the histology of the biopsy. Even though this is an indispensable part of the diagnostic procedure it is subject to errors and in some cases visual features cannot reveal the exact nature of the cancer. More advanced methods are based on measuring pre-determined genes that are identified from prior research, and prescribing medication to diagnoses derived from those specific genes.
One problem with the current diagnostic methodologies is that, e.g. because of omitting a number of genes from the scope of the method, they lose information that may be needed for diagnostic and treatment decisions and may even cause a wrong diagnosis if wrong genes are measured. As a result, the diagnostics process is inefficient and may produce only partial, or even wrong, results.
One further problem with the current diagnostic methodologies is that they are not particularly suitable for identifying a primary tumour of a metastasized cancer disease.
PCT application WO2008045389 teaches an improved computerized decision support system and apparatus incorporating bioinformatics software for selecting the optimum treatment for a cancerous condition in a human patient. The system comprises a PCR kit or a gene chip, an integrated detector, a detector for accepting receipt of the gene chip toward analyzing the patient's genotype, a database describing the correlation of patient genotypes and the efficacy and toxicity of various anti-cancer drugs used in treating patients with a particular cancerous condition and a computerized decision support system.
PCT application WO2009131710 teaches a method for identifying genomic signatures linked to survival specific for a disease. The method comprises performing data analysis comprising bioinformatics and computational methodology to identify copy number abnormalities and altered expression of disease candidate genes.
PCT application WO2006135904 teaches a method for producing an improved gene expression profile (GEP) for one or more cell samples. The method involves determining one or more particular gene (PG) improved results (IR) for the cell sample, and compiling the PG IR values to produce one or more forms of improved GEP for the cell sample.
PCT application WO2007137187 teaches a method involving performing a test for a gene and a test for a gene expressed protein from a biological sample of a diseased individual. A determination is made to detect which genes and/or gene expressed proteins exhibit a change in expression compared to a reference. A drug therapy used to interact with the genes and/or gene expressed proteins that exhibited a change in expression that is not single disease restricted, is identified from an automated review of an extensive literature database and data generated from clinical trials.
PCT application WO2009132928 teaches a method for predicting an outcome of a patient suffering from or at risk of developing a neoplastic disease. The method comprises the steps of quantifiably determining the gene expression levels of genes, thus obtaining a pattern of expression levels of the genes, comparing the pattern of expression levels with known, pre-defined reference patterns of expression levels indicative of the outcomes and predicting an outcome of a patient from the comparison using a mathematical function to determine the similarity of the pattern of expression levels with the first reference pattern and the second reference pattern. The method depends on disease candidate genes as the starting point of forming the prediction.
PCT Application WO2009125065 teaches a computer-implemented method for correcting data sets from measurements of properties of biological samples. The method comprises the steps of determining first and second property-specific distribution parameters for each property, determining a property-specific correction element for each version of the parallel measurement device based on the discrepancy between the property-specific distribution parameters, correcting the property value and outputting the property's corrected property value to a physical memory and/or display.
PCT Application WO2008066596 discloses a gene expression barcode for normal and diseased tissue classification. The computer-based method includes the steps of determining threshold of active gene expression across a collection of reference categories each consisting of a plurality of samples. The gene specific thresholds are then used to characterize which genes are in active or inactive states in each of the reference categories. These are defined as the gene expression barcodes of the reference categories. The method is unable to identify genes, which are the most significant ones in the process of identifying a tissue type. The method merely identifies genes whose expression level exceeds a threshold value for the gene. The number of those genes may be very high, making the interpretation of the result very difficult and deteriorates the reliability of the result. Additionally, the method relies on the predefined set of genes, the barcode, for tissue classification. Overall the method assumes each gene to have only two informative expression states, which further limits the predictive potential of the method.
None of the methods known in the art teach a way to analyse and characterize a tissue without first making some assumption about the tissue or limiting the number of genes involved in the process.