This disclosure relates in general to data management and analysis. Embodiments of this disclosure relate to management and analysis of single molecule data, i.e., data on individual macromolecules such as nucleic acid and protein molecules. Further, embodiments of this disclosure provide computer database systems for storing, processing, displaying, and analyzing single molecule data such as data from optical mapping of single molecules. Hence, embodiments described herein enable comprehensive data analysis in genomics and proteomics involving a wide array of differently formatted single molecule data.
The availability of whole genome sequences of an increased number of species brings genomics and proteomics in the forefront of the modern biomedical sciences and marks a new era of research and development in the healthcare, food, and cosmetic industries, among others. While promising unprecedented potential in understanding the genetic make up of different species and the mechanisms of life, the massive amount of sequence data poses a significant challenge of data management, analysis, and knowledge extraction to biomedical researchers. One notable factor constituting such challenge is the diverse data formats. These data are derived from a variety of technology platforms, including, e.g., automated sequencing, full length DNA cloning, in-situ hybridization of nucleotide and peptide molecules, nucleic acid and protein arrays, and microfluidics, etc. Even with the same platform, different device models and/or different protocols operated in various laboratories or institutions often produce data with different format and different resolution or sensitivity, which calls for different annotative and interpretative in approaches for further processing. Effective methods and systems are therefore needed to manage these different kinds or different types of data and to enable discovery of interrelations among them, such that useful knowledge can be extracted on the individual and collective functions of the proteins and nucleic acids of interest. Ultimately, such systems and methods are required for the realization of the promises held by the genomics and proteomics technologies.
In recent years, methodologies and instruments have been developed that permit study of individual macromolecules, i.e., DNA, RNA, and proteins. Single molecule optical mapping is one such effective approach for close and direct analysis of single molecules. See, U.S. Pat. No. 6,294,136, the disclosure of which is fully incorporated herein by reference. Any data generated from such studies—e.g., by manipulating and observing single molecules—constitutes single molecule data. The single molecule data thus comprise, among other things, single molecule images, physical characteristics such as lengths and shapes of single molecules, sequences of the single molecules, and restriction maps of single molecules. Such single molecule data complements genomics and proteomics data generated from the other technology platforms and provides new insights into the structure and functionalities of genomes and their constitutive functional units.
The usefulness of the single molecule data is accompanied by the heightened challenge of its management and analysis. This is due, in part, to the aspect of image processing and restriction map construction involved with single molecule images. For example, typically in optical mapping, visible gap sites on a single DNA molecules may be recorded and the mass of a DNA fragment defined by these gaps is determined by integrated fluorescence measurements and length measurements through image analysis; and subsequently, a restriction map may be derived. Such restriction map provides a road map in understanding the structure and function of the DNA of interest and may be used to compare with and validate the results of physical mapping, among other things.
There is therefore a need for systems capable of storing, processing, and analyzing image data of single molecules, which systems at the same time are capable of processing and analyzing other types of single molecule data, as well as other kinds of biomedical data. Such systems should support comprehensive genomic and proteomic data analysis across technology platforms, allowing image data to be correlated with non-image data. The robustness and flexibility for handling diverse data formats are desired, as is fast user response.