1. Field of the Invention
The present invention relates to a method, system, and software program for generating a search within a relational database, or across homogenous relational databases. More specifically, the invention relates to a search tool that automatically creates joins between tables without the user or client software having previous knowledge of the database schema.
2. Description of the Prior Art
A structured database is a computerized record keeping system created for the purpose of maintaining information and making that information available on demand. Many databases in use today are based on a relational model, often called a relational database, in which the data is stored as a collection of individual tables all related to one another.
The tables, or nodes in a graphical representation, in a relational database include a row of column names specifying one or more column fields, and one or more data rows containing a single value for each of the column fields. For example, a table containing product information in a company's product database may contain a column of product names, and each field in that column may contain a name of a current product. Similarly, a second column may contain product ID numbers each relating to the product name found within the same row.
One feature that distinguishes relational from nonrelational databases is the ability to “join” two or more tables. A join is described as a query in which data is retrieved from the fields of more than one table (although data may also be retrieved by joining a table with itself). In order to join two or more tables, the tables to be joined need to share a common key field that is stored in at least one column. Typically, the purpose of the join is to connect data having the same primary key across the tables being joined. A primary key is usually represented as a column of data entries in a parent table, while secondary keys, also called foreign keys, describe an identical column of data in a child table to be joined. In such cases, for the results of the join to be meaningful, the columns that represent the same data in the different tables must use the same field values to represent the same instances of those objects. In other words, the primary and foreign keys must use the same text strings in order to effect a correct join.
One common query for retrieving data from various tables in a database is the SELECT statement from the structured query language (SQL). A standard SQL statement usually takes the form: “SELECT <string> FROM <string> WHERE <string> GROUP BY <string> HAVING <string> ORDER BY <string>.” More information on SQL can be found at http://www.microsoft.com. While only the SELECT and FROM strings are required, it is desirous to include other strings to further narrow and optimize the search, and to allow the user to express the desired format of results. The result of a SELECT statement is a subset of data that matches the search criteria, and is stored as a temporary table, often termed the result table. In the SELECT statement illustrated above, the FROM string indicates the table name from which the information is being retrieved. The remaining strings included in the SELECT statement direct which columns are to be returned and stored in the results table. The WHERE string includes filtering criteria and hence dictates which rows satisfy the search criteria and are to be returned and stored in the results table.
A join operation is usually performed by specifying more than one table in the FROM string of the SELECT statement. A join operation allows rows from multiple tables all satisfying the search criteria to be sequentially combined into a single row and stored in the results table. This becomes important for grouping relevant information into a single table, rather than requiring the search engine to repeatedly parse the entire database to assemble the requested data, thus saving time.
As an example of a join operation, assume that a first table has columns containing a primary key of product names and product purchase prices. A second table may have a foreign key that contains product names that can be joined to the first table, and additionally has a second column that contains product sales prices. A third table might contain a foreign key column of product names and a second column containing product profit margin numbers. In order to effect a correct join between the three tables, a query would recognize that the column containing product names is identical across the three tables. Furthermore, the text string containing the product names must also be identical across the three tables in order to effect a correct join. Now, when a query is run that requests the financial properties of a particular product, the search can be conducted across the tables in the database by first finding the product name as a text string within each table, and then returning the value of the corresponding financial information. In this example, the product name field is used to join the three tables containing purchase, sales, and profit margin dollar amounts, all of which can be returned to the user initiating the query.
Additionally, not all three tables need share an identical common column as previously demonstrated. For example, suppose a user wants to find if a particular product is currently in stock in a particular color. The first table might contain a list of product names and corresponding product ID numbers. A second table may contain the list of product ID numbers and corresponding product colors. A third table may contain a list of product colors and corresponding quantities in stock. To search for a particular color of the product, the database management software would transverse the tables by starting with the product name, join the product ID number from the first table to the product colors of the second table, and then use the product colors to finally retrieve the quantity in stock from the third table. The path to find if a particular color is in stock then follows the path: product name, product ID number, product color, and finally quantity in stock. It is conceivable that another table might include columns containing product names and colors. Therefore, an alternative path might be: product name, product color, and finally quantity in stock, thus cutting out one node from the path and reducing the time required to perform the search. If this simple example were extended substantially, there could potentially be hundreds or thousands of different paths to follow to get from the starting node to the desired result. As the complexity and amount of the data entries increase, the number of potential paths increases exponentially. This becomes readily apparent where a database system contains a large number of data entries relating to the same primary key entry, such as in a biological database, for example.
Traditional database search methods are able to perform a search across various tables, however, they are grossly inefficient because a chosen path might connect hundreds of nodes when a more efficient path would connect only a few nodes. Hence, in an effort to streamline the query process, it is desirous for a database search tool to automatically create the correct joins and then optimize the joins by calculating the shortest path between two nodes.
One current problem with current database systems is that they usually require an experienced database manager to structure queries based on personal knowledge of the database's schema and contents. Furthermore, a query usually searches for an exact string match, or maybe a partial string with wildcard characters. Even though some database techniques allow the database to store a list of synonyms for a frequently used search term, the list is cumbersome to maintain and is often non-exhaustive.
Typically, the tables within a single database are joined. However, with current database management systems, the tables to be joined need not be contained within the same database. Products such as SQLNet from Microsoft Corporation and SQL Connect from Oracle Corporation enable the use of a heterogeneous database where a collection of database tables on different hardware platforms operating under different database management systems all appear to a user to be on one machine operating under one database management system.
In any case, current database systems require the user to have some knowledge of what fields are searchable within a database to be able to formulate a query that will allow the requested information to be retrieved. If a user is not intimately familiar with the database schema, it can be very inefficient, or even impossible, for a user to find the requested information.
It would be a substantial advancement in the art if the correct joins could be created automatically without the user having any prior knowledge of the database schema. It would be an additional advancement in the art if the correct joins and corresponding paths could be optimized to conduct an efficient search thus reducing the time required for a search. The present invention, among other things, provides a solution to this problem.
More narrowly, one application of relational databases is in the storing of biological information such as gene sequences. Modern biological sequencing efforts, such as those underway for the complete sequencing of the entire human genome, as well as newly developed experimental techniques for biological analysis, have resulted in an unprecedented amount of information contained in publicly accessible databases. For example, GenBank (an annotated database of all published DNA and RNA sequences), SwissProt (an annotated protein sequence database), PubMed (a biomedical literature repository), Molecular Modeling Database (contains 3-dimensional macromolecular structures including nucleotides and proteins), Entrez Genomes (a whole sequence listing for the complete genomes of individual organisms), PopSet (a set of DNA sequences that have been collected to analyze the evolutionary relatedness of a population), and NCBI taxonomy database (containing the names of all organisms that are represented in genetic databases) are some of the publicly accessible databases.
The information and sequence records contained in the public databases are typically accessible using programs or applications collectively referred to as bioinformatic applications. One method of information searching and data acquisition involves the use of specifically designed or proprietary analysis applications which query, retrieve, and interpret information contained in the database directly. These analysis programs may access the informational databases and may contain additional functionality to process, analyze and categorize the information and databases to facilitate their use.
Each publicly accessible database usually provides a non-uniform output, thus requiring a separate bioinformatic application to query each database. Because of the large number of publicly accessible databases, the number of specifically designed or proprietary analysis applications is continually increasing to allow users of such databases to access and analyze the information contained in the databases. This requires a user to purchase and maintain a multitude of applications specifically designed to access each database of interest. This becomes expensive in terms of disk storage, purchase price of such applications, and the time required to learn to use each application. It would be a substantial improvement in the art if a single program product were capable of searching any of the various publicly accessible biological databases as described herein.
Thus, there is a need for an improved design and method of searching and retrieving information from a multitude of relational databases. The system should have the characteristic feature of database independence in order to allow different databases and schemas to be used without having to devote large amounts of time to rewrite the code for existing databases or components and to minimize the changes to existing databases needed to update the system with new functionalities. Furthermore, the system should preserve the property of component reuse, wherein database designers and application programmers may reuse components which are likely to be used by the many different applications accessing the information in the database. Additionally, the database system and applications should be code independent. Thus, whenever a new utility or application is added to the system, or if a component is changed, little or no lines of code will have to be changed in other applications or services improving upgrade performance and reducing maintenance time.