This invention relates generally to the field of functional genomics. The invention enables the direct correlation of genomic DNA to rapidly quantifiable protein expression levels enabling a protein expression profile for a particular cell. This information can then be used to correlate with reference cells to identify differences in protein expression patterns that are responsible for differentiation, disease states, age, or any other temporal or spatial protein expression difference in particular cells for diagnosis, pathway regulation or drug target candidates.
The last quarter of a century has been marked by a relentless drive by molecular biologists to decipher first genes and then entire genomes. Genomics, the use of genetic and molecular biology techniques to develop complete genome maps, as well as underlying genomic sequences for different organisms, has provided an explosion of information about the underlying genes which make up all living things. The fruits of this work already include the genome sequences of 599 viruses and viroids, 205 naturally occurring plasmids, 185 organaros, 31 new bacteria, 7 archea, 1 fungus, 2 animals and 1 plant (Nature, 409:860-921 (2001) The Human Gene Consortium). A significant milestone in the field of genomics culminated recently with the announcement that the entire human genome had been sequenced.
The most important application of this sequence data, however will be the ultimate identification of protein coding genes. Proteins are not produced directly from DNA, instead information in the form of DNA is transcribed to form messenger RNA (mRNA) molecules. These mRNA molecules function as templates for protein synthesis (translation). Each cell in the body contains the entire genome of the organism, however, only a portion of any cell""s genome is expressed at any given time. Differences in expression profiles account for the different types of cells and tissues within an organism and for a cell""s varying response to stress or disease.
Thus, cells in different tissue in the human body are unique because they have different native genes. For example, blood cells and muscle cells not only look different, they also perform different functions. Blood cells supply oxygen to organs and protect us from disease, while muscle cells enable us to move and digest food. These differences are due to specific gene products that are unique to blood or muscle cell proteins. The presence of different proteins within the same cell is the result of the function of different genes. An example could be the generation of Ab diversity or viruses.
Functional genomics is aimed at discovering the biological function of particular genes and uncovering the means by which sets of genes and their products work together in health and disease. According to the Human Gene Consortium group, there appear to be about 30,000 to 40,000 protein coding genes in the human genome. Amazingly this is only about twice as many as in C. Elegans or D. melanogaster the fruit fly. Thus the vast complexity of the human must be due to more complicated use of the existing genes with alternative splicing rather than simply increased number of genes.
If genes encode multiple proteins, then the architect of biological complexity distinguishing our genetic material from that of a worm is RNA, the molecule that directs the production of proteins from DNA. Unlike genes in bacteria, genes in plant and animal cells are not arranged as continuous DNA but as coding exons interspersed with noncoding introns making it possible to transcribe one gene into several different products as each mRNA is spliced together to form combinations of exons and bits and pieces of introns.
Previous estimates were that around 20% of human genes are transcribed in more than one alternative variant, but recent research puts the number closer to 50% and even this estimate has been criticized as conservative. For example, a team of American researchers studying genes that control brain development in Drosophila melanogaster reviewed calculations indicating that the Neurexin genes can give rise to 35,000 different possible protein products just from alternative splicing. If you add the possibilities for RNA editing as well as post translational modifications, one could potentially end up with millions of different gene products. In fact, studies of fly species that have evolved separately for millions of years show that sequences of many alternative splice sites are strictly conserved indicating that they are in fact used.
Thus the desired endpoint for the description of a biological system is not the analysis of mRNA transcript levels alone but also the accurate measurement of protein expression levels and their respective activities. Quantitative analysis of global mRNA levels is the current method for the analysis of the state of cells and tissues, (Fraser, et al, 1997 xe2x80x9cStrategies for whole microbial genome sequencing and analysisxe2x80x9d Electrophoresis 18:1207-1216). Several methods have been refined to provide absolute mRNA or relative mRNA levels in comparative analysis. mRNA based genomics, however provide several inherent limitations. For example, gene (mRNA) expression levels may not always accurately predict the protein expression levels. Therefore gene expression analysis such as with micro arrays, may not provide definitive information on certain targets. In fact Gygi et al, recently concluded that the correlation for all yeast proteins between mRNA and protein expression levels was less than 0.4. Indeed, for some genes, while the mRNA levels were of the same value the protein levels varied by more than 20-fold. Conversely, invariant steady-state levels of certain proteins were observed with respective mRNA transcript levels that varied by as much as 30-fold. Gygi et al. xe2x80x9cCorrelation between Protein and mRNA Abundance in Yeastxe2x80x9d Molecular and Cellular Biology March 1999 pp 1720-1720.
Further, post translational modification of proteins such as proteolytic cleavage, glycosylation, phosphorylation, prenylation, myristalation, ubiquitination and N- and C-terminal processing can affect protein activity and half life. These modifications cannot be determined solely from gene sequence or expression data. Some proteins are active only when they are complexed to other molecules or proteins, or at a particular sub-cellular location within a cell. Again these factors cannot be determined from gene sequence expression data. FIG. 19 is a diagram demonstrating the layers of information which may be assayed to identify the real state of cell (furthest outward circle). Those who assay DNA and raw sequence data determine gene function based on sequence similarity, gene structure, and evolutionary relationships. Missing from this data is any mRNA or translational modification data. Those who assay mRNA gain a prediction of a protein profile based on the assumption that protein levels are directly proportional to mRNA. An assumption which is proving to be erroneous. Closest of all these methods to the real cell state is the method of the invention which detects actual cellular protein levels by direct measurement.
The field of study of proteomics has gained increasing importance as functional genomics attempts to assign functions to the mass of information from the human genome. Proteomics includes the science and processes of analyzing and cataloging all the proteins encoded by a genome (a proteome).
Complete descriptions of proteins including sequence structure and function will substantially aid the current pharmaceutical approach to therapeutics development. Thus the specific structural and functional aspects of a particular protein can be used to design better proteins or small molecule ligands that can serve as activators or inhibitors of protein function to develop drugs. Genome sequence information, due to the multitude of steps between gene transcription and corresponding protein function, is often insufficient to explain disease mechanisms.
Multiple genes may be involved in a single disease process. Identifying all the genes involved in a particular disease based on DNA sequence data may be possible but learning how these genes function in health and disease (and health therapeutic interventions can be designed for them) requires proteomics.
Disease may be caused by changes in gene expression, protein expression, or post translational modification of proteins. Many proteins are the intermediate targets for drugs, drug related changes in gene expression levels, or an indirect result of the drugs interaction with the protein. Cells and their proteoms are dynamic. One genome may yield multiple proteoms as a result of changes in differentiation, stress, or disease condition. Proteomics can be used to determine serum based biomarkers which can be valuable as clinical markers or used as a basis of a diagnostic.
As can be seen, a need exists in the art for identifying proteins and their concomitant coding sequences that are directly or indirectly regulated and involved in differential expression patterns associated with disease states, different tissue types, or other alternative cell states.
It is an object of the present invention to provide an immediate linking of protein information to its corresponding genome sequence to provide information for diagnostic protocols, pathway elucidation, or targets for drug design.
It is another object of the present invention to identify a protein expression profile of any particular cell, whether plant, bacterial, animal, etc. in origin, and to quantify relative levels of expression of those proteins associated with a particular population.
It is yet another object of the present invention to provide a library of functional genomic data that may be used to develop human therapeutics. Most researchers involved in the field of functional genomics rely on machine-based analysis for protein structure or function to assign functions to proteins. Those approaching the task from a sequencing objective assign function by analyzing and comparing genomic data developed by comparisons between disease and normal tissues. Typically this is accomplished through the use of gene chips or direct sequencing.
It is yet another object of the invention to provide an immediate link between genomic sequence to proteomic information using molecular biology techniques. Results of the information according to the invention can provide new therapeutic target development. Individual variations in protein expression levels between normal and aberrant tissues will lead to the direct identification of new therapeutic targets. If an unidentified protein is either higher or lower in expression levels within the malignant cells compared to the normal cells, it provides a probable target for further study and identifies a potential drug intervention site. Unique protein targets will be identified according to the invention.
It is yet another object of the invention to provide information about entire pathways of protein regulation, proving new target development when multiple proteins are involved in a particular state. Most cancer and other therapeutic drug development focuses on a single protein target at one time. However complex interactions between proteins result in the malignant or disease state in almost all cells. According to the invention, applicants method identifies protein expression levels for an entire pathway of active protein targets and can evaluate expression of multiple proteins simultaneously providing for analysis of pattern of protein co-expression with malignant or disease state cells as compared to normal cells.
The present invention relates generally to methods and compositions for the identification of differential protein expression patterns and concomitantly the active genetic regions that are directly or indirectly involved in different cell types, tissue types, disease states, or other cellular differences desirable for diagnosis or for drug therapy targets.
According to the invention a method for obtaining a protein profile in a cell is disclosed by use of a genetic integration polynucleotide encoding a tag protein which may be actively detected. The polynucleotide construct comprises a marker gene or tag which is introduced into the genome of an organism using any vector insertion method known in the art, developed in the future, or described herein. The marker gene is not operably linked to any promoter sequence in the construct (promoterless) and the construct thus relies upon integration within an active transcription unit within the cell for expression. The activity of the tag is then measured to sort and preferably quantify protein expression patterns for the cell. Once a profile expression pattern is obtained, molecular biology techniques are employed to ascertain the particular genetic loci which is expressed. This information elucidates diagnostic profiles for disease or other cellular states or types as well as elucidating potential target sites for drug intervention and alternative gene forms (SNPs). FIG. 3 depicts a general overview of the process of the invention as applied to a cancer versus a normal cell.
Polynucleotides for achieving the methods of the invention are disclosed including expression constructs, molecular biology techniques, transformed cells, vectors, and methods of design of the same which are intended to be within the scope of the invention.