The present invention relates to a method for analyzing expression frequencies of genes. More precisely, the present invention relates to a method that enables analysis of types and amounts of mRNA expressed from all of genes coding for proteins in a cell even with a small amount of biosample in order to elucidate dynamic change of gene expression.
The total number of genes on the genome coding for proteins is expected to be about 100,000 for human. As for yeast, of which total genomic structure have already been elucidated, the number of genes coding for proteins are estimated to be about 5000.
In recent years, public gene databanks have been established mainly in Europe, the United States and Japan. An enormous amount of gene information has been registered at such databanks from all over the world, and further information is newly coming together into the databanks every day. The human genome project is currently being pursued in a worldwide scale aiming at elucidating the total genes of human genome by the year of 2005, and the gene information obtained in that project is also being registered at the databanks. By inquiring to such databanks about a certain gene sequence, one can know if any gene having the same sequence as, or analogous sequence to the gene sequence has already been registered or not and, if registered, information concerning the sequence including designation and function of the gene, related references and so forth. Such a search is called homology search. There are several kinds of software for performing homology search. However, when a large number of samples must be searched for, BLAST is usually used, of which searching time is short.
Usually, all of genes contained in a cell are not necessarily transcribed into mRNA so as to produce proteins from mRNA, and it is estimated for human that about 15000 genes are expressed in one cell. Thus, in a cell, many kinds of genomic genes are expressed, and a corresponding number of types of messenger RNA (henceforth referred to as xe2x80x9cmRNAxe2x80x9d) are produced. However, types and amounts of expressed genes (also referred to as xe2x80x9cgenetic expression frequency informationxe2x80x9d hereinafter) may vary depending on the types and conditions of cells. For example, when a blood stem cell differentiates into a lymphocyte precursor cell, pre-B cell, B cell, and then activated B cell, each cell shows entirely different gene expression, although there are also genes that are commonly expressed in them.
Measurement of such genetic expression frequency information as described above is called genetic expression profile analysis. Substances responsible for cellular life activities are mainly proteins, and it is important to analyze types and amounts of proteins translated from mRNAs as the genetic expression analysis. However, it is technically difficult under a current situation to obtain profiles for the total proteins. On the other hand, measurement of total types of mRNA has already become possible.
The method reported for the first time as the genetic expression profile analysis method is the Body Map method (Gene, 174, 151-158 (1996)). The outline of the Body Map method is as follows. A poly(T) sequence on a vector is annealed to a poly(A) tail at the 3xe2x80x2 end of each mRNA, and a cDNA is synthesized by using the vector poly(T) sequence as a primer. Further, the cDNA is digested with a restriction enzyme MboI. Since one of MboI site exists in every 300 base pairs of cDNA in average, the cDNA on the vector is digested into a length of 300 base pairs in average. At this time, a cDNA fragment nearest the poly(A) tail remains ligated to the vector. The vector having this cDNA fragment is each cyclized and introduced into Escherichia coli to prepare a cDNA library. About 1000 clones are arbitrarily selected from the library, and the nucleotide sequence for 300 base pairs in average of each clone is determined. The clones were divided into groups of clones having the same sequence, and type and occurring frequency of each sequence are calculated to obtain genetic expression profile. Homology search (BLAST search) is performed in a databank for each cDNA sequence, and clones containing genes having the same sequence as known genes are given the names of the genes. When the sequence is not registered at the databank, it is considered that no gene corresponding to the sequence exists.
In order to perform homology search by the BLAST search, information for at least 11 base pairs is required. The types of sequences consisting of 10 nucleotides are about 1,000,000, and this number is far beyond the number of gene types of which existence is expected in human, i.e., 100,000. That is, if there is information for 11 base pairs, a gene having the sequence can be identified and thus the genetic expression profile analysis is possible. Therefore, if, aiming at increasing the efficiency of the genetic expression profile analysis by Body Map which requires much sequencing, cDNA fragments of about 300 base pairs used in Body Map are further made into short fragments of 11 base pairs or more (called xe2x80x9ctagxe2x80x9d), many of these fragments are ligated and inserted into a vector to prepare a library of ligated tags, about 1000 clones are arbitrarily selected as in Body Map, and DNA sequences of the ligated tags are determined, it is expected that more genetic expression information can be obtained with the same labor compared with Body Map. Each tag represents a gene sequence, and occurring frequency of the tag indicates expression frequency of the gene. Since the length of DNA sequence that can be determined by once of sequencing is usually about 600 base pairs, DNA sequences of about 50 tags at most can be determined by once of sequencing. That is, it becomes possible to perform the genetic expression profile analysis with efficiency about 50 times higher than that of the Body Map method.
As a method for genetic expression profile analysis based on the aforementioned concept, there is the method of serial analysis of gene expression (SAGE, U.S. Pat. Nos. 5,695,937 and 5,866,330, European Patent Publication No. 0761822 A). In SAGE, cDNA is produced by using a poly(T) of which 3xe2x80x2 end is bonded with biotin as a primer, the cDNA is digested with a restriction enzyme such as MboI (called an xe2x80x9canchoring enzymexe2x80x9d) as in Body Map, cDNA fragments containing the 3xe2x80x2 end to which biotin is bonded are adsorbed on avidin beads, the beads are divided into two of portions, and two kinds of linkers (A or B) are each ligated to the cDNA fragments (about 13 bp) adsorbed on either of the two portions of the beads. Each linker contains a site for a Class II restriction enzyme such as BsmFI (called a xe2x80x9ctagging enzymexe2x80x9d). Each cDNA fragment is excised from the beads with the tagging enzyme, the excised end is blunt-ended, and the tags ligated to the linker A and the linker B are connected. The product of the connection is called a xe2x80x9cditagtxe2x80x9d. The ditag is amplified by PCR using primers that recognize the linker A and the linker B. A large number of amplified ditags are ligated, inserted into a vector, and sequenced. About 50 tag sequences can be obtained by once of sequencing. By calculation based on this tag sequence information, genetic expression frequencies are provided.
Further, as other methods for analyzing expression frequencies of genes, there are the gene chip method and the gene microarray method. In both of the methods, there are used gene fragments adhered in array to a suitable plate (usually slide glass) at an extremely high density (about 10 fragments/mm2 or more). The gene fragments on this chip are hybridized with fluorescence-labeled mRNAs to determine types and amounts of mRNAs.
As described above, several methods have been developed for analyzing expression frequencies of genes, and fair results have been obtained. Currently, the SAGE method is the most effective means for measuring expression frequencies of the total genes of all eukaryotic organisms. However, when this method was actually practiced, it encountered many problems, and the SAGE method could not be reproduced in most of research facilities. That is, the techniques required for the SAGE method are complicated and they can be performed only by specially trained persons. Further, about 1 xcexcg of mRNA is required for the measurement, and therefore it is substantially impossible to perform the measurement with a sample that can be obtained in a small amount, for example, a clinical biopsy material, or to measure difference of genetic expression in micro tissue portions. Furthermore, the method theoretically causes considerable measurement errors.
In the SAGE method, it is extremely important to accurately determine a sequence of the tag. This is because the tag is short (about 13 bp) and therefore, if even only one of nucleotides is erroneously determined, it may be determined to be a different sequence even though it is the same sequence, or a different sequence may be determined to be the same sequence. However, such an erroneous determination is likely to occur in the SAGE method. This is because, in the SAGE method, two tags are connected to form a ditag and the border between the tags becomes indefinite. The tag is a short gene fragment excised by a restriction enzyme such as BsmFI and FokI. However, cleavage sites of these enzymes are not always stable, and the lengths of the excised tags are diversified. Thus, if the tags are connected to form a ditag in the state that tags of different lengths are intermingled, it becomes indefinite from which tag nucleotides at the ligation site of the tags have been derived. As a result, it becomes impossible to obtain a correct sequence of the tag. Thus, the SAGE method suffers from a theoretically inevitable drawback. Furthermore, the SAGE method involves an operation of collecting DNA using avidin and biotin beads. However, it is in fact extremely difficult to collect DNA by using avidin and biotin beads without inviting contamination, and it is extremely difficult to obtain correct data by operating according to the protocol as it is. Moreover, the SAGE method requires a large amount of mRNA in order to obtain data. Therefore, in a case of a sample of which amount is limited, for example, clinical samples, a sufficient amount of mRNA cannot be obtained and thus the SAGE method cannot be performed.
Furthermore, in the gene chip method and the gene microarray method, the measurement can be performed only for a gene of which structure is known, unlike the Body Map method or the SAGE method. Therefore, under the current situation, expression frequencies of total genes of all organisms cannot be measured.
Currently, the SAGE method is the most effective means for measuring expression frequencies of the total genes of all eukaryotic organisms. However, when this method was actually practiced, it encountered many problems, and the SAGE method could not be reproduced in most of research facilities. The drawbacks of the SAGE method are as follows: (1) the techniques required for the method are complicated and they can be performed only by specially trained persons; (2) about 1 xcexcg of mRNA is required for the measurement, and therefore it is substantially impossible to perform the measurement with a sample that can be obtained in a small amount, for example, a clinical biopsy material, and it is similarly impossible to measure difference of genetic expression in micro tissue portions; and (3) the method theoretically causes considerable measurement errors because a ditag is measured.
The present invention was accomplished in view of the aforementioned current situation, and an object thereof is to provide a method that can be easily performed by ordinary researchers and enables accurate genetic expression frequency analysis with a micro amount of a specimen.
The inventors of the present invention assiduously studied in order to achieve the aforementioned object. As a result, they found that expression frequencies of genes could be efficiently analyzed with high precision by synthesizing each cDNA from each mRNA using a vector primer having a poly(T) sequence, converting each cDNA sequence to a tag on the vector, forming a concatemer by ligating the obtained tags via a sequence that enables recognition of ends of the tags, and analyzing a nucleotide sequence of the concatemer, and thus accomplished the method of the present invention designated as xe2x80x9cMAGE (Micro-analysis of Gene Expression)xe2x80x9d.
That is, the present invention provides the followings.
(1) A method for analyzing expression frequencies of genes, which comprises the following steps:
(a) a step of forming a vector primer to which each cDNA is ligated, by annealing the vector primer with each mRNA derived from a cell of which expression frequencies of genes is to be analyzed, and synthesizing the cDNA, said vector primer comprising a linear plasmid vector having a single-stranded poly(T) sequence at one 3xe2x80x2 end, a recognition sequence for a first restriction enzyme in an inner position from the poly(T) sequence, a recognition sequence for a second restriction enzyme near the other end, and a recognition sequence for a type IIS restriction enzyme in an inner position from the recognition sequence for the second restriction enzyme,
(b) a step of digesting the vector primer to which the cDNA is ligated, with the second restriction enzyme and a third restriction enzyme that does not digest the vector primer and forms a digested end of the same shape as a digested end obtained with the second restriction enzyme, to excise an upstream region of the cDNA, and cyclizing the vector primer,
(c) a step of digesting the cyclized vector primer with the first restriction enzyme and the type IIS restriction enzyme to excise a downstream region of the cDNA so that a tag consisting of a part of the cDNA is left, and cyclizing the vector primer again,
(d) a step of performing polymerase chain reaction (PCR) by using the vector primer as a template and oligonucleotides having nucleotide sequences corresponding to respective flanking regions of the both sides of the tag contained in the vector primer as primers to amplify the tag,
(e) a step of ligating the amplification products to form a concatemer of the tags, and
(f) a step of determining the nucleotide sequence of the concatemer and investigating types and frequencies of tags occurring in the nucleotide sequence.
(2) The method according to (1), wherein the ligation reaction in the step (e) is performed in the presence of an adaptor having one end of the same shape as an end of the tag to arrange the adaptor at each end of the concatemer, and the concatemer is amplified by performing PCR using an oligonucleotide having a sequence corresponding to the sequence of the adaptor as a primer.
(3) The method according to (1) or (2), wherein, after the step (e), the concatemer is cloned in a cloning vector for nucleotide sequencing, and then the nucleotide sequence of the concatemer is determined.
(4) The method according to any one of (1) to (3), wherein the recognition sequence for the third restriction enzyme consists of four nucleotides.
(5) The method according to any one of (1) to (4), wherein the vector primer has a recognition sequence for a fourth restriction enzyme of which digestion point is in the same position as or an inner position from the digestion point of the recognition sequence for the second restriction enzyme, which is not excised from the vector primer by the digestion with the type IIS restriction enzyme;
the primer for the downstream side of the tag among the primers used in the step (d) has a recognition sequence for a fifth restriction enzyme that forms an end of the same shape as the end digested with the fourth restriction enzyme; and
the concatemer is formed after the amplified primers are digested with the fourth restriction enzyme and the fifth restriction enzyme.
(6) The method according to (5), wherein the vector primer has a nucleotide sequence different from the recognition sequence for the fifth restriction enzyme by one nucleotide in an inner position from the recognition sequence for the first restriction enzyme, and the nucleotide sequence different by one nucleotide is converted to the recognition sequence for the fifth restriction enzyme by PCR using the primer for the downstream side of the tag.
(7) The method according to (6), wherein the third, fourth and fifth restriction enzymes are identical to one another.
(8) The method according to any one of (1) to (7), wherein the vector primer is formed by ligating a linear plasmid obtained by digesting a plasmid having a multicloning site at two sites in the multicloning site, and a partially double-stranded DNA having an end of the same shape as one end of the linear plasmid and a single-stranded poly(T) sequence.