Biological sequencing is the process of determining the precise order of nucleotides within a biomolecule. For example, biomolecules can include DNA, RNA, mRNA, protein sequences and other bipolymers. The rapid development of sequencing methods and instruments has significantly advanced biological and medical research, and led to an increase in medical discoveries. This rapid development has led to biological sequencing being a critical tool for researchers and diagnosticians alike, in the medical field (e.g. personalized medicine, fertility screening, lifestyle choices, and health/lifespan predictions) as well as in fields such as forensic science, virology and systems biology.
This rapid development of sequencing methods and instruments over the last three decades has also resulted in high throughput sequencing technologies that have significantly improved the speed and precision of the analyzed sequences. This has resulted in researchers being able to collect massive quantities of high-precision data in very short times. However, processing of this data requires significant computing power to be able to be done quickly and accurately. Thus, multiple computers are often utilized to analyze data using parallel or distributed processing to simultaneously analyze thousands, millions or even billions of nucleotide sequences. While parallel and/or distributed processing can result in fast, precise sequencing results, the huge amount of data that must be processed and transmitted between the parallel or distributed computers can lead to inefficiencies in both the transmission of the data and the processing thereof. Similar issues exist for proteomic data generated from mass spectrometers. These files can be on the order of hundreds of gigabytes per sample and terabytes per run. This can result in inefficiencies in both the transmission of the data and the processing thereof. As new technologies continue to be devised to read genetic, epigenetic and proteomic data, this problem will be further compounded.
Further, the parallel and/or distributed processing computers may not always be in close proximity to each other, or connected via a secure network. Some may be connected over an open network topology accessible over the World Wide Web. This open link between processing computers can result in data privacy issues. In some instances, the data being transmitted can be medically privileged as it may be an individual's DNA, RNA, epigenetic DNA methylated phenotype, or proteomic data or even a part of the medical record. This data could also be privileged as it can contain data about other bloodline relatives, from which identity, whereabouts, health, age, personality traits, and disease susceptibility can be determined. In other instances, the data may be a new or novel discovery that would need to be protected to maintain a trade secret or other business advantage and even because of national security concerns. Currently, the data transmitted over open networks between parallel and/or distributed processing computers can potentially be intercepted and decoded. Currently, the transmitted data may not be encrypted or enciphered to a level appropriate to adequately mitigate the risks posed by hackers, state or non-state cyber actors with interests in the acquisition of sensitive data on specific individual groups.