The present invention relates to automatic assembly of mRNA sequences from databases containing large numbers of partial cDNA sequences.
In human cells, genetic material is stored as DNA in a nucleus of the cell. When a certain protein is needed by the cell, a portion of the DNA is transcribed as mRNA, which is transported the cytoplasm of the cell. In the cytoplasm, ribosomes create proteins, using the mRNA as a template. Generally, the mRNA comprises a long sequence of bases, each triplet (codon) of which encodes a specific amino acid. Thus, a sequence of triplets encodes a sequence of amino acids, which form a protein.
Cell function can, theoretically, be analyzed by determining the type of and ratio between the proteins in the cell. However, proteins are very delicate materials, which are difficult to analyze. mRNA, which controls the creation of the proteins, is easier to separate and analyze. Although several different mRNA sequences may encode similar acting proteins, each mRNA sequence encodes only a single protein. In addition, there is usually a good correlation between the relative amount of different types of mRNA and the relative amounts of protein. It is thus possible to analyze cell function by analyzing the mRNA in a cell.
It should be noted that mRNA contains two types of information which are not evident from DNA. First, the relative concentration of the mRNA indicates the abundance of a particular protein. Second, in the process of transcribing DNA, changes, especially deletions, are made to the nucleotide sequence.
Differential analysis is used to generate standardized databases of human cellular activity by determining differences between gene expression in sick cells and healthy cells and between cells from different tissues. The result of a differential analysis between two cells is the difference in the type and expression level of mRNA sequences. In some cells, for example cancer cells, there is a higher concentration of certain proteins than in healthy cells of the same tissue. Determining these differences can help researchers determine how a cancer cell functions differently from healthy cells. Analysis of mRNA is currently being used to generate drug leads. For example, by selectively blocking these proteins which are more common in cancer cells, using designer-pharmaceuticals, it may be possible to disrupt the functioning of cancer cells, without significantly affecting the functionality of regular cells. Also when developing pharmaceuticals for bacterial, prion and viral infections, it is useful to design a pharmaceutical which selectively blocks proteins which are necessary for the life and/or reproduction of the disease agent, but which does not block proteins necessary for human cell survival.
Thus, it can easily be appreciated why pharmaceutical companies, research institutes and biotechnology companies maintain large databases of partial mRNA sequences. Such sequences, known as ESTs (Expressed Sequence Tag), often have associated information, such as the tissue type and/or disease type where the EST is expressed and/or the expression level of the EST in these situations. Some databases include complete mRNA sequences. In some cases, a genomic database can be analyzed to yield mRNA sequences, if the introns are correctly identified.
ESTs are generated using the following (greatly simplified) process: a cell is selected and disrupted; proteins and other cell structures are selectively disintegrated; mRNA sequences are isolated and converted to cDNA sequences; cDNA sequences are inserted into host cells, which can be cultured; individual host cells are disrupted; and a segment of DNA which includes the cDNA or original mRNA sequence at a known location thereof is located and read out.
Unfortunately, the art of reading mRNA sequences is not yet completely developed. The error rate of the reading increases with increasing length of the mRNA sequence. The common errors are insertion or deletion of bases, and errors in the identification of individual bases. At a certain sequence length, the error rate increases to a point where further reading is not possible. As a result, most ESTs are only 200-600 bases long, while an average mRNA sequence is typically 1000-3000 bases long.
In addition, EST databases contain many other types of errors, which may be accumulated during the complicated process of EST generation in addition to features, inherent in the mRNA, which make the assembly difficult. These causes of difficulty include:
(a) Chimeric sequences. During the process of extracting and replicating the mRNA and cDNA, chimeric sequences may be inadvertently inserted into the nucleotide sequences. Such chimeric sequences include ribosome RNA, junk sequences from the extraction and replication process, contamination from external sources, such as human cells and contamination from the host cells.
(b) Intron Contamination. Introns are portions of the DNA which are not expressed in the final mRNA product and are usually removed from the mRNA during the middle of the transcription process (splicing). However, since the cell is disrupted in the middle of its normal activity, the transcription process may be incomplete or otherwise disrupted, for example by introns being incorporated in the mRNA sequences.
(c) Broken and respliced sections. During the process of extraction and replication the mRNA sequences may be broken and, in some cases, may be reconnected, not necessarily correctly. In addition, whole sections of mRNA sequences may be inadvertently removed.
(d) Alternative splicing. This is not an error in the ESTs but it is an important cause of mismatch between ESTs. The transcription of DNA to mRNA does not follow a one-to-one correspondence. Depending on various conditions in the cell, a single DNA sequence may be transcribed as several different mRNA sequences. The different transcriptions, named alternative splice variants, are usually achieved by certain segments of the DNA being selectively spliced out. Thereafter, selected portions of the mRNA, named alternative spliced regions, are selectively spliced out of the mRNA sequence. As result, there may be two mRNA sequences which do not exactly match, even though they originate from the same DNA sequence and contain no errors.
(e) Redundancy Level. The process of extracting the ESTs includes replication of mRNA sequences and there is usually more than one copy of each mRNA in a living cell. In addition, as most databases contain ESTs extracted in many experiments, many ESTs can be expected to appear in several experiments. As a result, there is a high redundancy of ESTs in the raw database. However, due to the errors in reading out the ESTs, the ESTs will not exactly match. Also, even though there may be significant overlap between two or more ESTs, they will usually have different start and end points and different lengths. This lack of consistency makes the task of assembly more difficult.
As an end result, EST databases generally contain only short ESTs, which must then be correctly associated and assembled into the original mRNA sequences. However, due to the above-described problems, it is very difficult to correctly match up the ESTs. In general, the limiting factor in this field is information analysis, rather than information volume.
If the ESTs are correctly matched, the discovery and/or development of new pharmaceuticals, is made easier and faster. For example, assuming 20 ESTs are determined by differential analysis to be found in a cancer cell rather than a healthy cell, 20 leads must be pursued to find a drug, which may disrupt the cancer cell. However, if the 20 ESTs are combined to form 2 complete mRNA sequences, only 2 leads need to be pursued, reducing the volume of work by a factor of 10.
It is an object of some embodiments of the present invention to provide a method of mRNA assembly which reduces existing raw EST databases, removes errors therefrom and facilitates the creation of longer and/or complete mRNA sequences. The desired end result is a reduced database in which each mRNA sequence and/or EST encodes a different protein. At least, the ratio between the number of ESTs and the number of proteins should be reduced as much as possible. Two types of errors should preferably be avoided and/or corrected: incorrect mRNA sequences and errors of omission, where a real difference between two mRNA sequences is lost, due to the method of reducing the raw database.
It is another object of some embodiments of the present invention to provide a method of discovering hereunto unknown complete mRNA sequence and/or genes.
It is another object of some embodiments of the present invention to provide a method of modeling and discovering alternatively spliced mRNA sequences.
It is another object of some embodiments of the present invention to provide a method of EST association and/or assembly which has a lower computational complexity than existing methods and is therefore suitable for the analysis of huge databases of ESTs.
In accordance with a preferred embodiment of the present invention, a process of database reduction and/or analysis includes:
(a) correcting obvious errors in ESTs;
(b) clustering ESTs which appear to originate from the same mRNA sequence;
(c) assembling ESTs into mRNA sequences;
(d) comparing the assembled mRNA sequences to protein databases; and
(e) comparing the assembled mRNA sequences to genome databases.
The order of (a)-(e) is not fixed. For example, error correction may be performed at any stage. Further, the process is preferably iterative, with later steps affecting earlier steps.
One aspect of some embodiments of the present invention relates to using a method that directly compares a database with a database, rather than a method that compares an individual EST with a database. As a result, a more efficient analysis algorithm can be developed. In accordance with a preferred embodiment of the invention, an algorithm whose complexity is near O(k(N)xN), where k is a slowly increasing function of N, rather than O(N2), (N is the number of ESTs) is provided. In huge EST databases, this difference is extremely important and may pave the way to using mRNA analysis of cells from biopsies to diagnose individuals, in a short time.
Another aspect of some embodiment of the present invention, relates to a method of clustering ESTs. Rather than force a long segment of one EST to match a second segment of a second EST, only certain annotated portions of the ESTs are matched. In a preferred embodiment of the invention, short segments, preferably 9 bases long, are used for the matching. An index is generated which lists, for each 9 base sequence (n-group), all the ESTs which contain that sequence. The list associated with each indexing n-group may then be treated as an individual (smaller) database. If the component database is small enough, it may be preferred to use brute force methods to find matches within the component database. Alternatively or additionally, at least larger ones of the component databases may be reindexed using the same method. Preferably, during such reindexing additional limitations are applied, for example, that the order of appearance of the n-groups is the same in the matched ESTs or by indexing (and matching) only the n-groups which are either consecutive, 1 or 2 bases away from the indexing n-group. Typically, ESTs are clustered when they contain 4 matching n-groups.
It should be appreciated that the size of the indexing base sequence may be a number other than 9, although 9 appears to be suitable for raw databases of 100,000-1,000,000 ESTs of an average length of 400 bases. The length of the n-group may also be different for different iterations of the method. It should be appreciated that, in general, longer indexing sequences are more sensitive to errors in the reading of the mRNA sequences, however, they provide better matches. Further, the number of n-groups that must match is also a parameter, which may vary depending on the original database size, error rate and redundancy level. Further, the number of bases allowed between two consecutive n-groups is also a parameter, which may vary responsive to the database characteristics and the efficiency of the algorithm. In a preferred embodiment of the invention, each EST is graded as to its suitability to be included in a certain cluster. In some cases, an EST may be suitable for two clusters, especially if the two clusters are really a single cluster. In addition, externally provided data, such as the information that two ESTs are probably from the same mRNA sequence, can also affect the grade. Also the number of detected and/or corrected errors in a particular EST and/or in the original database as a whole may affect the grading process.
Another aspect of some embodiments of the present invention relates to a method of assembly of clustered ESTs into mRNA sequences using graphs. Each unique segment of an EST is associated with a node in a directed graph. The allowed transitions between nodes are restricted based on the xe2x80x9ctransitionsxe2x80x9d found in the ESTs that comprise the cluster. In accordance with a preferred embodiment of the invention, the resulting graph is analyzed to determine errors. For example, if there is more than one end node in the graph, this may be indicative of a chimeric sequence. Also an end node which is too close (number of bases between) to a start node is also usually indicative of a problem. End nodes may be defined as nodes whose segments contain stop codons and/or as nodes which have no transitions thereafter. Alternative paths in the graph, in which both a direct transition and an indirect transition between two nodes are available, usually identify alternative spliced regions. In a preferred embodiment of the invention, mRNA sequences with one, two, three, four or even more alternative spliced regions are correctly identified by preferred embodiments of the invention. Thus, a large number of possible alternative spliced variants, for a single mRNA sequence, may be identified in a single tissue type. Generally, the larger the ratio between ESTs and mRNA sequences, the better the identification of alternative spliced regions (and of errors in the sequence). Further, some preferred embodiments of the invention can also identify exclusive alternative splices, where each alternative spliced variant of the mRNA sequence contains a segment that does not appear in other variants.
Another aspect of some embodiments of the present invention relates to using feedback from one step of the above-described process to affect a different step. In one example, an error in the assembly step, such as the discovery of a chimeric sequence, may be used to change the clustering, by disallowing all matches based on the identified chimeric sequence. A chimeric sequence may be identified by matching the assembled mRNA sequence to a database of known contaminates. Preferably, only suspected chimeric sequences are tested by comparison to a database of contaminates, at the assembly stage. Suspicious sequences are preferably determined from the morphology of the graph. Another example is correcting errors in ESTs based on the assembly. Such corrected errors may also be propagated back to the clustering step.
Another aspect of the present invention relates to using an mRNA assembly method as a part of a diagnostic device. Such a device will receive as an input a readout of ESTs, sequence the ESTs into mRNA sequences, correct errors in the sequences and then analyze the resulting mRNA expression spectra and/or compare it to known disease templates to diagnose the disease. Such an input may, in some cases be of relatively low quality.
Another aspect of some embodiments of the present invention is related to diagnosing diseases and cellular dysfunction based on an analysis of relative expression levels of alternative spliced variants in a single tissue type.
Another aspect of the present invention relates to DNA chip design. Correct selection of DNA sequences to place on a DNA chip is limited by the uncertainty of the relative importance and association of different ESTs. Once the ESTs are assembled into mRNA sequences, it is possible to select one or more sets of DNA segments which will be most useful for the DNA matching task. The high degree of automation possible withxe2x80x94and the quality ofxe2x80x94mRNA sequence determination, in accordance with preferred embodiments of the present invention, make such an analysis for DNA chip design a reality. Such a set can also take into account alternative splicing and/or the types and distributions of different errors in the EST database. Thus, a DNA chip can be made more robust for a particular application. In one preferred embodiment of the invention, the indexing method is used to generate an index of all the short segments of nucleotides in the mRNA sequences of interest. The length of the short segments is determined based on the design constraints of the DNA chip. The number of short segments necessary to correctly identify a single mRNA sequence (or DNA sequences, in genomic applications) can be determined by the number of re-indexing steps required to isolate that sequence in a database. The utilization of a DNA chip can be maximized by selecting only mRNA sequences which can be identified using a minimal number of short DNA sequences.
There is therefore provided in accordance with a preferred embodiment of the invention, a method of obtaining an mRNA sequence having alternative spliced variants from a database of ESTs, comprising:
providing a raw database comprising a plurality of ESTs; and
assembling ones of said ESTs into mRNA sequences, wherein said assembling includes identifying alternative spliced regions.
Preferably, the method includes clustering ESTs which have matching segments and wherein said assembly comprising assembling ESTs which are clustered together.
Alternatively or additionally, the method includes correcting errors in said ESTs.
There is also provided in accordance with a preferred embodiment of the invention, an mRNA sequence determined by the above described processes. Preferably, the sequence comprises at least two alternative spliced regions. Alternatively, the sequence comprises at least three alternative spliced regions. Alternatively, the sequence comprises at least four alternative spliced regions. Alternatively or additionally, the sequence represents at least two alternative spliced variants of mRNA sequence, each variant utilizing at least one mutually exclusive alternative splice region. Alternatively, the sequence represents at least three alternative spliced variants of mRNA, each variant utilizing at least one mutually exclusive alternative splice region. Alternatively or additionally, the sequence represents at least four alternative spliced variants of mRNA, each variant utilizing at least one mutually exclusive alternative splice region. Alternatively, or additionally, the mRNA sequence is obtained from a single tissue type.
There is also provided in accordance with a preferred embodiment of the invention, a method of tissue analysis comprising:
providing a biological sample;
determining relative expression levels of different variants of mRNA sequences in the biological sample which contain alternative spliced regions, to determine a spectra of relative expression of alternative spliced variants; and
analyzing said spectra to determine disease in the sample.
Preferably, analyzing comprises comparing said spectra against predetermined spectra. Alternatively or additionally, determining relative expression levels comprises:
analyzing said sample to detect ESTs; and
assembling said ESTs into mRNA sequences having alternative spliced regions.
There is also provided in accordance with a preferred embodiment of the invention a diagnostic device comprising:
an input for receiving EST expression levels; and
a spectra generator which generates a spectra of mRNA expression levels responsive to said EST input.
Preferably, the spectra generator generates a spectra of relative expression levels of different variants of mRNA sequences containing alternative spliced regions. Alternatively or additionally, the device comprises a database containing expression spectra corresponding to a plurality of disease states. Preferably, the device comprises a comparator which compares the generated spectra with spectra in the database to determine a disease state in the tissue which originated the ESTs.
There is also provided in accordance with a preferred embodiment of the invention, a method of clustering a plurality of ESTs, comprising:
indexing n-groups in the ESTs, to generate lists of ESTs which contain each particular n-group indexed; and
matching ESTs within each list to generate clusters.
Preferably, matching ESTs comprises indexing n-groups in each of said lists to generate secondary lists. Preferably the method comprises recursively applying said indexing until recursively created secondary lists include ESTs containing at least three n-group matches. Alternatively, the method comprises recursively applying said indexing until recursively created secondary lists include ESTs containing at least four n-group matches. Alternatively the method comprises recursively applying said indexing until recursively created secondary lists include ESTs containing at least five n-group matches. Alternatively or additionally, recursively applying said indexing comprises recursively indexing only n-groups which are distanced from the first indexed n-group less than a certain number of bases.
Preferably, the number of bases is less than five. Alternatively, the number of bases is less than four. Alternatively, the number of bases is less than three.
Alternatively or additionally, matching comprises correlating said ESTs using an SW (Smith-Waterman) algorithm, modified to include detection of long-gaps.
Alternatively, matching comprises correlating said ESTs using an SW (Smith-Waterman) algorithm.
In a preferred embodiment of the invention, said indexing comprises ignoring certain n-groups.
Preferably, the indexed n-groups are 9 bases long. Preferably, the n-groups are between 5 and 15 bases long.
In a preferred embodiment of the invention the clustering method includes merging clusters. Preferably, merging clusters comprises merging responsive to an assumed error distribution in said ESTs.
There is also provided in accordance with a preferred embodiment of the invention, a method of mRNA assembly from a plurality of ESTs, comprising:
determining a correspondence between segments in each EST; and
generating a directed graph in which each node represents a single segment, and each transition between two nodes represents the existence of an EST in which the two corresponding segments are consecutive.
Preferably, the method comprises clustering said ESTs into clusters of associated ESTs, wherein said determining a correspondence is performed on individual clusters of ESTs. Alternatively or additionally, the method comprises identifying alternative spliced regions from said graph based on the morphology of the graph. Alternatively or additionally, the method comprises correcting errors in said ESTs based on said graph based on the morphology of the graph. Preferably, the method comprises repeating said clustering responsive to said corrected errors.
There is also provided in accordance with a preferred embodiment of the invention a method of identifying errors in mRNA sequences, comprising:
generating a graph which represents the assembly of segments of ESTs into an mRNA sequence; and
analyzing said graph to determine unusual configurations of said graph.
Preferably, said analyzing comprises identifying multiple end-nodes in said graph.
There is also provided in accordance with a preferred embodiment of the invention a method of tuning a database reduction process, comprising:
applying the database reduction process, with a certain value for at least one parameter, to a sample database;
determining a reduction ratio in the database; and
reapplying said method with a new value for said at least one parameter if said reduction ratio is not achieved.
Preferably, the at least one parameter comprises the length of n-groups used in matching two ESTs.
There is also provided in accordance with a preferred embodiment of the invention a method of iterative clustering of ESTs, comprising:
clustering ESTs;
assembling clustered ESTs; and
re-clustering the ESTs responsive to errors detected in the ESTs after said clustering.
There is also provided in accordance with a preferred embodiment of the invention, a method of iterative clustering of ESTs, comprising:
deciding if two ESTs match, responsive to predetermined error probabilities of errors in said ESTs;
clustering said ESTs responsive to said match;
correcting said predetermined error probabilities, responsive to further processing of said ESTs; and
repeating said deciding and said clustering responsive to said corrected error probabilities.
There is also provided in accordance with a preferred embodiment of the invention a method of EST database processing, comprising:
analyzing said ESTs to detect errors;
further processing said ESTs to create mRNA sequences;
determining, responsive to said further processing, corrections for said errors; and
correcting said errors.
Preferably, said further processing comprises assembling said ESTs into mRNA sequences.
There is also provided in accordance with a preferred embodiment of the invention, a method of designing a DNA chip based on an EST set determined by differential analysis of two biological samples, comprising:
reducing said EST set to a set of mRNA sequences;
analyzing said set of mRNA sequences to determine short mRNA sequences which maximally differentiate said mRNA sequences from mRNA sequences found in both biological samples; and
designing a DNA chip which detects said short mRNA sequences.
There is also provided in accordance with a preferred embodiment of the invention, a method of designing a DNA chip to detect relative expression levels of different variants of mRNA sequences having alternative spliced regions, comprising:
reducing an EST database to determine an mRNA sequence having alternative spliced regions;
enumerating short DNA sequences which are only included in the alternative spliced regions of said different variants; and
designing a DNA chip which detects said short DNA sequences.
There is further provided in accordance with a preferred embodiment of the invention a DNA chip constructed based on the above design methods.
There is also provided in accordance with a preferred embodiment of the invention, an mRNA sequence comprising at least two alternative spliced variants, for a single tissue type. Preferably, the sequence comprises at least three alternative variants. Preferably, the sequence comprises at least four alternative variants.
There is also provided in accordance with a preferred embodiment of the invention, an mRNA sequence comprising at least three alternative spliced regions. Preferably, the sequence comprises at least four alternative spliced regions. Preferably, the sequence comprises at least five alternative spliced regions. Alternatively or additionally, the mRNA sequence comprises different variants including mutually exclusive regions.
There is also provided in accordance with a preferred embodiment of the invention, a method of designing a DNA chip, comprising:
indexing an mRNA database to determine the indexing of short DNA sequences in the mRNA database, which short DNA sequences are of a length suitable for detection by a DNA chip;
determining from said indexing a set of short DNA sequences which uniquely identify a desired mRNA sequence; and
designing a DNA chip which detects said set of short DNA sequences.
There is also provided in accordance with a preferred embodiment of the invention an mRNA sequence substantially as described and shown in mRNA transcripts included in the instant application.
There is also provided in accordance with a preferred embodiment of the invention an mRNA sequence having alternative spliced variants, substantially as described and shown in the instant application.