Deoxyribonucleic acid (“DNA”) sequencing is the process of determining the ordering of nucleotide bases (adenine (A), guanine (G), cytosine (C) and thymine (T)) in molecular DNA. Knowledge of DNA sequences is invaluable in basic biological research as well as in numerous applied fields such as, but not limited to, medicine, health, agriculture, livestock, population genetics, social networking, biotechnology, forensic science, security, and other areas of biology and life sciences.
Sequencing has been done since the 1970s, when academic researchers began using laborious methods based on two-dimensional chromatography. Due to the initial difficulties in sequencing in the early 1970s, the cost and speed could be measured in scientist years per nucleotide base as researchers set out to sequence the first restriction endonuclease site containing just a handful of bases.
Thirty years later, the entire 3.2 billion bases of the human genome have been sequenced, with a first complete draft of the human genome done at a cost of about three billion dollars. Since then sequencing costs have rapidly decreased. Today, many expect the cost of sequencing the human genome to be in the hundreds of dollars or less in the near future, with the results available in minutes, much like a routine blood test.
As the cost of sequencing the human genome continues to decrease, the number of individuals having their DNA sequenced for medical, as well as other purposes will likely significantly increase. Currently, the nucleotide base sequence data collected from DNA sequencing operations are stored in multiple different formats in a number of different databases. Such databases also contain scientific information related to the DNA sequence data including, for example, information concerning single nucleotide polymorphisms (SNPs), gene expression, copy number variations. Moreover, transcriptomic and proteomic data are also present in multiple formats in multiple databases. This renders it impractical to exchange and process the sources of DNA sequence data and related information collected in various locations, thereby hampering the potential for scientific discoveries and advancements.
Bioinformatic processing of DNA sequence data currently involves aligning lengthy strings of such sequence data and comparing them so as to identify sequence similarities. Although this process has been able to accommodate the processing of limited quantities of DNA sequence data, it is believed to be inadequate to handle the massive amounts of DNA sequence data expected to be generated in coming years using next-generation DNA sequencing machines. For example, processing of hundreds or thousands of complete human genome sequences using conventional approaches would not be practical in view of the enormous computational resources required by such approaches.