The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
A wide variety of genomic data exists, including, without limitation, data structures such as DNA sequences and protein sequences, annotations to those structures, and publications. Genomic data may be found in a wide variety of sources. For example, sequence data is one type of genomic data. Common sources of sequence data include web-based databases such as GenBank, provided by the United States National Institute of Health, the European Nucleotide Archive (“ENA”), and the Protein Data Bank, operated by the Research Collaboratory for Structural Bioinformatics. These sources allow users to access sequence data in a number of formats, such as flat-text files or FASTA-formatted files. Generally, the sequence data comprises a header with a sequence identifier and other metadata, and a body comprising a sequence. The sequence data may be accessed in a variety of manners, including in pages on a website, in files downloadable via HTTP and/or FTP protocols, or using a REST-based application programming interface.
Another type of genomic data is annotations. Annotations may include, for example, research findings that are related to specific sites of a sequence, such as an observation that a site is a binding site for a certain protein or a variation of a certain disease. The UC Santa Cruz (UCSC) Genome Browser is a popular web-based interface with which to access various sources of annotation data. Each sequence identifier may be associated with one or more annotation records, and each record may be associated with one or more specific sites in a sequence.
There are also a wide variety of tools for processing genomic data. For example, one common category of tools aligns sequences together and compares those sequences. Some such tools are described in “Computer Graphical User Interface Supporting Aligning Genomic Sequences”, U.S. patent application Ser. No. 13/835,688, filed Mar. 15, 2013, the contents of which are hereby incorporated by reference for all purposes, as if set forth in their entirety. Another example tool is BLAST, a web-based tool for identifying similarities between an unknown protein and known proteins. A number of example algorithms for processing genomic data are described in “Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids” by Richard Durbin, Cambridge University Press 1998, the entire contents of which are hereby incorporated by reference for all purposes as if set forth in their entirety herein. These and other tools generally identify genomic data to process based on input, such as input specifying sequences or input based upon which sequences may be mined or derived. The tools then perform one or more various processing algorithms with respect to the genomic data, such as statistical analyses, comparisons, search operations, filtering operations, manipulations, and so forth. The tools then generate a report of any result(s) of the processing.
The analysis of genomic data has become an increasingly important task. Unfortunately, such analyses are often complex, relying on large quantities of disparate data sources and disconnected tools. For example, a researcher may be interested in determining how variations in a certain genomic sequence affect a certain disease. The researcher may begin the analysis by retrieving a sequence from a databank. The researcher may then code the sequence as a protein using a first tool, compute variations of the protein using a second tool, and run a large-scale similarity search across yet a different databank to find species that have similar proteins. The researcher may then access yet other tools and databanks to search for sequences in these species that code for the protein, and finally execute a motif-finding algorithm to identify other proteins that bind to the protein. As a consequence of the complexity of this task, the researcher's work may be disorganized and difficult to reproduce or extend to other sequences.
While this application will often refer to genomic data, many of the techniques described herein are in fact applicable to any type of data. Other uses of the techniques described herein may include, without limitation, data analyses in the field of natural language processing, social sciences, financial data, historical and comparative linguistics, and marketing research.