The present disclosure relates generally to the field of data related to biological samples, such a sequence data. More particularly, the disclosure relates to techniques for analyzing and/or storing data generated by a sequencing device in a cloud computing environment.
Genetic sequencing has become an increasingly important area of genetic research, promising future uses in diagnostic and other applications. In general, genetic sequencing involves determining the order of nucleotides for a nucleic acid such as a fragment of RNA or DNA. Relatively short sequences are typically analyzed, and the resulting sequence information may be used in various bioinformatics methods to logically fit fragments together to reliably determine the sequence of much more extensive lengths of genetic material from which the fragments were derived. Automated, computer-based examinations of characteristic fragments have been developed and have been used more recently in genome mapping, identification of genes and their function, and so forth. However, existing techniques are highly time-intensive, and resulting genomic information is accordingly extremely costly.
A number of alternative sequencing techniques are presently under investigation and development. In several techniques, typically single nucleotides or strands of nucleotides (oligonucleotides) are introduced and permitted or encouraged to bind to the template of genetic material to be sequenced. Sequence information may then be gathered by imaging the sites. In certain current techniques, for example, each nucleotide type is tagged with a fluorescent tag or dye that permits analysis of the nucleotide attached at a particular site to be determined by analysis of image data. Although such techniques show promise for significantly improving throughput and reducing the cost of sequencing, further progress in speed, reliability, and efficiency of data handling is needed.
For example, in certain sequencing approaches that use image data to evaluate individual sites, large volumes of image data may be produced during sequential cycles of sequencing. In systems relying upon sequencing by synthesis (SBS), for example, dozens of cycles may be employed for sequentially attaching nucleotides to individual sites. Images formed at each step result in a vast quantity of digital data representative of pixels in high-resolution images. These images are analyzed to determine what nucleotides have been added to each site at each cycle of the process. Other images may be employed to verify de-blocking and similar steps in the operations.
The image data is important for determining the proper sequence data for each individual site. While the image data may be discarded once the individual nucleotides in a sequence are identified, certain information about the images, such as information related to image or fluorescence quality, may be maintained to allow researchers to confirm base identification or calling. The image quality data in combination with the base identities for the individual fragments that make up a genome will become unwieldy as systems become capable of more rapid and large-scale sequencing. There is need, therefore, for improved techniques in the management of such data during and after the sequencing process.