The invention relates to the field of databases. More particularly, the invention relates to a method of enabling keyword based searches on relational databases.
SQL databases are used extensively to store information on intranets. For example, a corporation typically has multiple databases on the intranet to store a variety of information, such as an xe2x80x9caddress bookxe2x80x9d or directory of the employees, mailing list information, product, and sales details. The ability to seamlessly access and browse the information stored in these databases is important. Customized applications with front-ends that are closely tied to the schema of the database are used to enable database searching over the intranet. These applications often direct the search in a structured manner.
Structured searches are most effective when the user knows how to get to the information. However, few users are familiar with or willing to learn about the details of the schema of the various databases stored on their intranet. Furthermore, building customized web applications over each database is time consuming. These issues have limited the exploitation of the richness of intranet database information. In the field of document searches, internet search engines have popularized keyword match searches. A user submits keywords to the search engine and a ranked list of documents including a summary is returned. Internet search engines exploit an index over the document collection to enable such a search. The results of a search may retrieve documents that belong to completely different parts of the hierarchy.
While keyword searches are popular in the field of document searches, there is little support for keyword searching over a database or a collection of databases. One approach to enable keyword searching over a database is to duplicate the contents of the database as documents in a middle tier. Traditional text search techniques are then used to perform keyword searches. However, these methods typically have large space and time overheads as well as manageability concerns. In addition, the methods fail to exploit the native query processing functionality of the databases.
Keyword searches are more difficult in databases than in document collections. This is because the database has an inherent structure that may cause the required information to reside in one or more of many tables/columns. Because relational databases can search non-text data and have stored procedures that can support the notion of semantic nearness, in some ways keyword searching can be enhanced through the use of a relational database.
Some common approaches to keyword based searching of text documents are inverted lists, signature files, and prefix trees. One approach to keyword searching of XML documents parses the documents to generate inverted file information that in turn is loaded into a relational database. This approach effectively builds an inverted list that maps values to individual data objects (rows). Because this approach maps at an instance level, it suffers from scalability issues caused by duplication of data that leads to manageability problems.
The work on universal relations treats a database as a universal relation for querying purposes, thus hiding the inherent complexity of schema normalization for the purposes of querying. The challenge in this approach is to map a selection query over the universal relation to a SQL over the normalized schema. However, keyword queries are analogous to specifying universal relation queries without naming the columns involved in the selection conditions. Thus universal relation concepts are not directly applicable to keyword searches over relational databases. xe2x80x9cDataSpotxe2x80x9d is a commercial system that supports keyword-based searches by extracting the content of the database into a xe2x80x9chyperbasexe2x80x9d. Subsequently, all keyword searches are made over the hyperbase. Thus, this approach duplicates the content of the database and suffers from the scalability issues already discussed. Most major commercial database vendors allow a full text search engine to be invoked while processing SQL. The full text search capability is invoked by virtue of specialized predicates in SQL that are processed by the full text engine. Since a xe2x80x9chitxe2x80x9d using keyword may span multiple tables, full text searching is not sufficient to support keyword search over databases. Furthermore, as already discussed, such engines may encounter serious scalability problems.
Searching over relational databases can be made more efficient by preprocessing the database to form an index of the data records at an interim level of granularity. The index can be searched to generate a focused query on the database tables that searches only the sets of data records identified in the initial search of the index.
A search of a relational database for data records relating to given search criteria is enabled by creating an index of the data records that maps a record to a region of the database in which it is found. The index is accessed to identify regions of the database that contain data records relating to the given search criteria. A query is constructed that corresponds to the given search criteria and the query is executed on the identified regions of the database to retrieve records matching the search criteria.
To prepare a database to enable searches based on keywords a symbol table is constructed that stores location information for each keyword. The location information may be at a table/column level or at a cell level where a cell corresponds to a particular column in a given row. The symbol table may be constructed by hashing the keywords into a hash table and then compressing the table.
Upon receiving search keywords, the symbol table is searched to identify tables and columns that contain the search keywords. Database subgraphs of the overall schema graph of the relational database that span the set of search keywords are enumerated having nodes that represent tables. The enumerated subgraphs are evaluated to generate queries on the database that select records containing the search keywords. The enumeration of database subgraphs may be based on a schema graph describing a structure of the database tables. The enumerated database subgraphs may be pruned by eliminating leaves that do not contain and of the search keywords. A database query may be generated that joins the tables represented in the database.