In the field of bioinformatics, biological data repositories (databases) are used to store sequences of genome information for DNA and protein sequences. Each sequence is a series of capital letters and numerals uniquely identifying a genetic code for DNA nucleotides and amino acids. Internally, each sequence is formed as a structured string organized into primary, secondary, tertiary, and so forth, sets of cloning vectors that can be lengthy and complex.
Worldwide, all known genome sequences are identified and cataloged in three principal public databases. The databases include the GenBank, maintained by the National Center for Biotechnology Information (NCBI); the European Molecular Biology Laboratory (EMBL); and the DNA DataBank of Japan (DDBJ). Each day, the genome sequences maintained in these databases are downloaded and synchronized to provide an up-to-date and consistent repository of collective biological data.
Biological data repositories, such as GenBank, EMBL and DDBJ, are searched on a regular basis as an aid to biotechnical research. As publicly-accessible biological data repositories, each of these databases processes a high volume of queries each day. For example, the GenBank contains over 12 million entries totaling nearly 13 billion base pairs of sequence sets, and receives over 800,000 queries per day from over 120,000 individuals worldwide. The demand for searching availability often exceeds database capacities.
Nevertheless, searching remains a crucial part of on-going research for several reasons. First, individual sequences must be matched and identified, where feasible, to existing DNA and protein sequences to determine the potential characteristics and composition. Second, identifying a given sequence allows the generation of a probability function predicting behavior and interaction characteristics. Third, biological data repository searching allows the determination of whether a given sequence is novel and, if so, whether the sequence has been the subject of patent or similar protection.
To accommodate the large demand for these public databases, access by each individual user is limited to a fixed maximum number of queries per day. Accordingly, the tools available for accessing these databases have evolved to maximize the limited availability afforded to each user. In particular, with the growth and widespread availability of local and wide area networks, including the Internet, browser-based tools via the World Wide Web (Web) have become available and have significantly displaced older command line-based query tools.
One limitation imposed, in part, by the limited access afforded to public biological data repositories is the disincentivizing of searching multiple sequence sets against one or more of the databases as a single transaction. Rather, each sequence in a set of multiple sequences must be submitted to separate databases as an individual query in serial fashion, one-at-a-time. Furthermore, combined genome sequences must be categorized based on the type of sequence presented, that is, DNA or protein. Single query limitations and type categorizations increase the difficulty attendant to using the public databases.
To alleviate these access constraints, individual users often download and mirror public databases onto a local host for increased search efficiency without the restrictions mandated by the public repositories. However, the same tools used to search local database copies are used on the public repositories and thus provide limited relief from the access restrictions. For instance, these tools lack the necessary mechanisms to process queries for multiple sequences, including mixed sequences containing DNA and protein. These tools also lack the capabilities to process search results on a sequence-by-sequence basis or to align and display multiple sets of sequences received in the search results from a multi-repository search. Other shortcomings exist.
In the prior art, two principal tools for accessing public biological data repositories exist. First, the Ensemble query tool, licensed by EMBL, operates as a browser-based solution for searching one database, one query at a time. The tool directly interfaces to the database engine and operates in a strict request-response manner without intermediate flow control. Sequence results cannot be exported nor can a new database be created based on search results. Control is limited to a serial searching of a single data repository and the results received therefrom are presented for only one sequence request.
Second, the Blast software suite, licensed by NCBI, offers a similar browser-based query tool, but includes a conventional command line interface. Queries can be executed against multiple databases for a single sequence by using the command line interface. However, the user interface is awkward, complex and non-intuitive and requires a high level of expertise to interpret and apply the appropriate flags and parameters as a single command line. As well, both the browser-based and command line interfaces fail to offer any type of meaningful flow control other than a simple serialization of individual queries.
Therefore, there is a need for an approach to providing a capability to search multiple biological data repositories, including public databases, for multiple sequences of biological data for a set of one or more sequences. Preferably, such an approach would provide both preprocessing of queries and post-processing of search results.
There is a further need for providing an intuitive and user-friendly interface to searching data repositories of biological data. Preferably, such an approach would provide a graphical user interface that includes the capability to display substantially unlimited search results sets as generated by a multi-sequence query against multiple databases.
There is a further need for an approach to providing control over the intermediate layer transaction processing of a search query executed against multiple data repositories. Preferably, such an approach would offer load balancing, processing of partial results, and detection of expired searches.
There is a further need for an approach to aligning a plurality of sequence sets received as search results generated by a search query against multiple databases. Preferably, such an approach would provide flexible multiple sequence alignment displayable in via graphical and textual user interfaces.