The disclosure relates to generating queries. More particularly, the methods and systems described herein relate to generation, by a visual query builder, of a query of a genomic data store.
Conventional models for genomic data analysis typically require downloading large genomic data sets, incorporating new locally generated data, and then performing computational analyses on this data using local hardware. This model has been successfully employed by researchers for many years, but recently presents a bottleneck given the enormous growth in size of biomedical data sets. Large-scale scientific programs using next-generating sequencing technology, such as the 1000 Genomes Project, the Human Microbiome Project (HMP), and The Cancer Genome Atlas (TCGA), have grown to a point in which it is impractical for individual researchers to download, store, and analyze common genomic data sets. For example, the full size of the TCGA data set is expected to exceed over 2.5 Petabytes of data. Few researchers have this level of storage capacity, and for those who do, simply downloading this amount of data could take months over a conventional network connection.
To address this issue, the research community has focused on the creation of public cloud computing resources that allow researchers from anywhere in the world to access, contribute to, and analyze a shared data set. Analysis tools are co-located with the data via the cloud computing resource, and researchers may upload or design their own. In this way, the analysis tools are brought to the data, rather than downloading the data to perform analysis with local tools. This paradigm is expected to improve data accessibility and analysis for a diverse community of users. However, the implementation of such cloud computing resources requires skill and careful planning, leading to new issues regarding data access, computing capacity, interoperability, training, usability, and governance.
One particular issue affecting meaningful analysis of large genomic data sets is the diversity of the data. Each file from a study may represent a variety of raw and processed data, such as from whole genome, whole exome, RNA, microRNA, bisulfite, array, and other forms of analyses. Data can be unprocessed (such as sequence reads or intensity data), or processed (such as alignments or a list of variants), each of which may be stored in a variety of formats. Further, various metadata and attributes may be associated with each data point, including clinical information about the patient, processing information related to samples, and the type of genomic analyses performed. For example, the TCGA data set includes metadata for each patient that includes rich information regarding treatment protocols, follow ups, and detailed processing information for each sample at the aliquot level. Organizing and presenting this information in an appropriate way to allow for meaningful query and selection is a challenging task.
Diverse meta-elements can be better managed using a semantic model. The Semantic Web has already been used for the purpose of organizing biological datasets. For example, Open PHACTS aims to improve drug discovery efforts by integrating and linking major pharmacological datasets. Previous work on structuring and exposing TCGA data includes an adaptation to a Simple Sloppy Semantic Database (S3DB) management model, including a scalable approach to continuous data integration and visual environment for biomedical data understanding and discovery. However, querying such solutions is still not an intuitive process for users.
One way to improve accessibility to such data is to use a visual query browser. For example, the visual query browser can be used to visually create SPARQL Protocol and RDF Query Language (SPARQL) queries, allowing researchers and other users to query a Resource Description Framework (RDF) store in an interactive manner. Previous attempts at creating visual query browsers for RDF stores typically require a level of expertise that poses challenges for users. For example, U.S. Pat. No. 8,984,002 describes a query builder system that allows a user to visually build an RDF query using a drag-and-drop interface by selecting classes and predicates to form the query. (See, e.g., U.S. Pat. No. 8,984,002, FIG. 3.) The SPARQL query is built, line by line, by incrementally generating the query statements that make up the query. (Id., col. 8, lines 49-56.) However, this approach is still not immediately intuitive for the user, who must still have knowledge of the intricacies of RDF format and SPARQL queries, which conventional users do not necessarily have.