The present invention relates generally to a computer system and method for correlating data, and more particularly, to correlating tagged data to information associated with the tagged data.
The identification of the sequence of bases which constitute an oligonucleotide (ODN) is commonly performed by Sanger sequencing, which is named after Dr. Fred Sanger who introduced the sequencing in 1976. As currently practiced, Sanger sequencing employs radioactively labeled molecules to determine the sequence of a sample oligonucleotide. The radioactivity labeled molecules are used in the enzymatic synthesis of radioactively labeled oligonucleotide fragments. The fragments have base sequences that are identical to the sequence of the sample oligonucleotide. In order to determine the sequence of the sample oligonucleotide, the radioactive fragments generated by Sanger sequencing are separated using gel electrophoresis. Gel electrophoresis creates a two-dimensional map that, upon analysis, yields information about the base sequence of the sample oligonucleotide.
Although Sanger sequencing is used in laboratories around the world, it has significant shortcomings. One shortcoming is that the two-dimensional gel electrophoresis map is processed by analyzing the radioactivity of the fragments. Radioactive materials raise health concerns for many people who work in this area. Another shortcoming is that gel electrophoresis only provides maps of limited size. In particular, it is very difficult to create a single map that provides sequence information for an oligonucleotide formed from more than about 1,000 bases. Yet another shortcoming is that it is very difficult to automate the analysis of the two-dimensional gel electrophoresis map.
A current approach, which partially overcomes these shortcomings, is the use of fluorescently labeled molecules, rather than radioactively labeled molecules to create the fragments. As commonly practiced, four unique labels corresponding to each unique base (i.e., A, T, C, and G) are used. This use of fluorescently labeled molecules allows column chromatography, rather than gel electrophoresis, to be used to separate the labeled fragments. Column chromatography is typically a more efficient technique than gel electrophoresis for separating fragments and is more amenable to automation than gel electrophoresis. Despite the avoidance of radioactivity and the efficiency of the separation, the use of fluorescently labeled molecules in conjunction with Sanger sequencing is not widespread. The primary problem with the use of fluorescent labels with Sanger sequencing is that only a few fluorescent labels can be detected in a single assay (always less than 8, usually only 4), which limits the DNA sequencing throughput to one or two samples per lane or per column. Currently, devices using fluorescent labeling are commercially available that can automatically sequence fragments. These devices, however, are rather expensive and still have this primary problem.
Accordingly, there is a need in the art for improved approaches to the basic Sanger sequencing. Preferably, the improved approach would avoid the use of radioactively labeled molecules, be amenable to automation, utilize equipment that is commonly available in research and development laboratories, be highly accurate even for long oligonucleotide sequences, and be efficient in allowing for many oligonucleotide samples to be analyzed simultaneously.
The present invention provides a method and system for correlating characteristics (e.g., type of nucleotide) of biomolecules (e.g., DNA) to molecular tags with unique molecular weights that are associated with the biomolecule. In one embodiment, the molecular tags are applied to primers used when synthesizing the biomolecule. The system initially receives a mapping of each characteristic of the biomolecules to the corresponding molecular weight of the molecular tag. The system also receives an indication of the molecular weights detected when analyzing the biomolecules to which the molecular tags have been associated. For each molecular weight detected. the system determines based on the received mapping the characteristic corresponding to the detected molecular weight. The system then indicates that the analyzed biomolecule has the determined characteristic.