Biological sequencing is the process of determining the precise order of nucleotides within a biomolecule. For example, biomolecules can include DNA, RNA, mRNA, protein sequences and other bipolymers. The rapid development of sequencing methods and instruments has significantly advanced biological and medical research, and led to an increase in medical discoveries. This rapid development has led to biological sequencing being a critical tool for researchers and diagnosticians alike, in the medical field (e.g. personalized medicine, fertility screening, lifestyle choices, and health/lifespan predictions). Other fields such as national/border security, forensic science, virology, pathogen detection and systems biology can also depend upon biological sequencing. These sequencing capabilities drive scientific and medical inquiry while also providing early warning of emerging health threats, such as pandemics, or in the future, those posed by engineered biology.
This rapid development of sequencing methods and instruments over the last three decades has also resulted in high throughput sequencing technologies that have significantly improved the speed and precision of methodologies to analyze biological sequences. This has resulted in researchers being able to collect massive quantities of high-precision data in very short times. However, processing of this data requires significant computing power to be able to be done quickly and accurately. Thus, multiple computers are often utilized to analyze data using parallel or distributed processing to simultaneously analyze thousands, millions or even billions of nucleotide sequences. While parallel and/or distributed processing can result in fast, precise sequencing results, the huge amount of data that must be processed and transmitted between the parallel or distributed computers can lead to inefficiencies in both the transmission of the data and the processing thereof. Similar issues exist for proteomic data generated from mass spectrometers. These files can be on the order of hundreds of gigabytes per sample and terabytes per run. This can result in inefficiencies in both the transmission of the data and the processing thereof. As new technologies continue to be devised to read genetic, epigenetic and proteomic data, this problem will be further compounded.
Further, the parallel and/or distributed processing computers may not always be in close physical proximity to each other, or connected via a secure network. Some may be connected over an open network topology accessible over the World Wide Web. The usage of an open link between processing computers can result in data privacy issues, including hacking, data spillage, theft, or improper control of the data. In some instances, the data being transmitted can be medically privileged as it may be an individual's DNA. RNA, epigenetic DNA methylated phenotype, or proteomic data or even a part of the medical record. This data could also be privileged as it can contain data about other bloodline relatives and/or members of a tribal/national group, from which identity, whereabouts, health, age, personality traits, and disease susceptibility can be determined. In other instances, the data may be a new or novel discovery that would need to be protected to maintain a trade secret or other business advantage and even because of national security concerns. National security concerns can arise, given that exploitation of compromised genetic/epigenetic data can be used to develop pathogens, and synthetic biology that can be harmful at the individual or group level. An individual's location, genealogy, susceptibilities and phenotypes can be tracked and/or identified with increasing ease once genetic/epigenetic data is known. Currently, the data transmitted over open networks between parallel and/or distributed processing computers can potentially be intercepted and decoded. Currently, the transmitted data may not be encrypted or enciphered to a level appropriate to adequately mitigate the risks posed by hackers, state or non-state cyber actors with interests in the acquisition of sensitive data on specific individuals, ethnic groups, tribes, nations, or racial communities.
Additionally, biological, bioinformation, personally identifiable, general personal and personal medical data can further transmitted to local computers, such as those at a doctor's office or hospital, or to local computers in academic or governmental centers, for evaluation by clinicians and for display to patients. This data can also be used for analysis by researchers in an academic setting for the further development of biological knowledge.
Another challenge can be how the data can be “purged” from computer and transmission systems after it is analyzed and read by a patient or medical practitioner so as to prevent its retrieval by an unauthorized actor at a later date. Recent reports show that directed “hacks” of medical information have resulted in the unauthorized access of tens of millions of confidential and private personal medical records in the United States. The repercussions and the consequences of such “hacks” in a genomic setting cannot be quantified nor imagined at present. The reason is that the genomic and epigenetic information of a subject not only determines and characterizes their phenotype/medical susceptibilities/strengths and known locations, but also it characterizes up to a substantial percentage the genotype of all his direct or secondary descendants. This can further jeopardize the subjects private personal information.