Bioinformatics analyses are complex, multi-stage analyses (or, alternately workflows) comprising multiple software applications. Most of the applications specific to high throughput (“HT”) sequencing data have been developed in the last decade, and these are often incredibly sophisticated and intricate in their statistical and algorithmic approaches. For example, over 11,600 tools, mostly open-source, have been developed for the analysis of genomic, transcriptomic, proteomic, and metabolomic data.
One problem plaguing bioinformatics analyses is the handling of intermediate data, including temporary and log files. Such files contribute to the massive expansion of data during processing, and are often stored in obscure folders within each bioinformatics application. For example, high throughput sequencing (“HTS”) experiments generate massive raw data files known as FASTQ files, which are text-based files containing nucleotide sequences and quality score information. These files are usually considered the ‘raw’ data. To generate useful knowledge, the raw data needs to be trimmed and cleaned, and then subjected to secondary analysis, usually including alignment to a reference genome, de novo assembly, or k-mer counting. Such analyses generate equally massive secondary and intermediate files describing the alignment, assembly, or quantification of the raw data. In turn, these derived files may often be sorted, filtered, annotated, or analyzed in any number of ways that generate even more data. All of this data, (whether stored as files, objects, or elements in a database) amount to a massive expansion (in some cases up to 5 times expansion) in the number and total footprint of the initial data, resulting in significant storage management challenges.
As another example, the amount of data generated by deep sequencing technologies is growing at an exponential rate. According to many published papers, over 40 ExaBytes (“EB”) of raw genomic data are expected to be generated annually from genomic (e.g., DNA) sequencers by 2025. When taking into account the 3-5 times data expansion, this portends that an incredible 120 EB of data could be generated annually from these deep sequencing technologies.
All too often, research institutions lack a comprehensive policy and data tracking mechanism to ascertain at any time how much computer storage space is being utilized by these files. It is not uncommon for research institutions to have terabytes of HTS data scattered across hundreds of directories and folders, while the only expansion backup strategy consists of removing old hard drives and placing them on a bookshelf. Or, on the other extreme, that teams of researchers are required to meet weekly to assess which files can be deleted from overloaded data volumes. It is, therefore, paramount to incorporate data management systems and retention policies with traditional bioinformatics pipelines to track analyses and other pertinent information so that users and administrators of the system can find data, preserve analysis provenance, and enable consistent reproducibility of results.
Although genetic data may be readily available, storing, analyzing, and sharing vast quantities of genetic data can be challenging and inefficient, often obscuring significant genomic insights. For example, single-patient genetic data can become mired in a complex and unintegrated data pipeline. There is still a need in current technologies for an improved method for managing complex data workflows.
Systems and methods have been put forth to solve the aforementioned problem by tracking data individually selected for deletion and manually deleted by the user. For example, U.S. Pat. No. 8,140,814 discloses a data management apparatus and system in which a volume deletion metadata recorder records metadata for one or more volumes of data deleted by a user. After deletion, the system initiates rule-based data storage reclamation for the deleted data volumes according to a predetermined rule.
Further, U.S. Pat. No. 9,235,476 (“'476”) discloses a method and system providing object versioning in a storage system that supports the logical deletion of stored objects. When a user requests the deletion of an object, the system verifies whether the deletion is prohibited and allows or denies the deletion based on the result of said verification. The utility of this logical deletion is the safeguarding of stored objects from unintentional deletion.
Currently available systems and methods, such as the above, generally do not practice deletion of data files produced by workflow stages. If deletion is practiced, it is typically carried out manually. Thus, the current technology does not address the previously discussed issue of managing the massive expansion of the initial data in complex bioinformatics workflows as users are required to identify data files ready to be deleted and must then manually carry out the deletion. Thus, the egregious task of examining the massively expanded data in order to determine which data files are ready for deletion remains. The present invention resolves this problem by automating the identification and deletion of data files. Current systems do not provide for identifying data files that are ready to be deleted and automating said deletion, on the contrary, users are still required to sift through a substantial number of data files, determine if each of said files is dispensable, and then perform the required deletions.
Moreover, deletion of intermediate data is counterintuitive to the present field of study since this data is the output of a processing stage. Therefore, some, if not all, of the intermediate data is perceived to be the highly valuable products of the data processing. It is perceived in the field that most, or all, of these files are necessary to store, especially for the verification or audit of the final data results. Thus, prior art teaches away from the objective of the present invention, which is the automatic identification and deletion of data files that are unnecessary in the processing of genomics data by a workflow.
Additionally, the present invention automatically deletes or compresses data files produced at a given stage of the bioinformatics workflow based on information, or policies, stored in metadata descriptive of the produced data file. This automated deletion or compression occurs before the next stage of the workflow begins. In this way, the expansion of initial data into the workflow is managed, thus reducing memory requirements of hardware executing the workflow. The automated deletion can occur after the completion of the workflow at a predetermined time, as defined in a data retention policy or in response to a present need to reduce storage consumption.
The prevailing thought in the present field is to maintain the latest version of the applications employed by each stage of the workflow. Results are then reproduced by reprocessing input data via the latest applications, inevitably leading to inconsistent and unreproducible results, which are required for clinical diagnostics and regulatory policies for drug development. The present invention maintains records of all parameters, application versions, and containers, via metadata files. In this way, the specific application version is referenced, via information stored in the metadata file, and accessed for accurately reprocessing input data. A significant issue in the field is producing consistent output and results, especially since the final data could include the diagnosis of a disease or confirmation of drug efficacy.
Any feature or combination of features described herein are included within the scope of the present invention provided that the features included in any such combination are not mutually inconsistent as will be apparent from the context, this specification, and the knowledge of one of ordinary skill in the art. Additional advantages and aspects of the present invention are apparent in the following detailed description and claims.