The invention relates to the field of databases. More particularly, the invention relates to a method of enabling keyword based searches featuring generalized keyword matching on relational databases.
SQL databases are used extensively to store information on intranets. For example, a corporation typically has multiple databases on the intranet to store a variety of information, such as an xe2x80x9caddress bookxe2x80x9d or directory of the employees, mailing list information, product, and sales details. The ability to seamlessly access and browse the information stored in these databases is important. Customized applications with front-ends that are closely tied to the schema of the database are used to enable database searching over the intranet. These applications often direct the search in a structured manner.
Structured searches are most effective when the user knows how to get to the information. However, few users are familiar with or willing to learn about the details of the schema of the various databases stored on their intranet. Furthermore, building customized web applications over each database is time consuming. These issues have limited the exploitation of the richness of intranet database information. In the field of document searches, internet search engines have popularized keyword match searches. A user submits keywords to the search engine and a ranked list of documents including a summary is returned. Internet search engines exploit an index over the document collection to enable such a search. The results of a search may retrieve documents that belong to completely different parts of the hierarchy.
While keyword searches are popular in the field of document searches, there is little support for keyword searching over a database or a collection of databases. One approach to enable keyword searching over a database is to duplicate the contents of the database as documents in a middle tier. Traditional text search techniques are then used to perform keyword searches. However, these methods typically have large space and time overheads as well as manageability concerns. In addition, the methods fail to exploit the native query processing functionality of the databases.
Keyword searches are more difficult in databases than in document collections. This is because the database has an inherent structure that may cause the required information to reside in one or more of many tables/columns. Because relational databases can search non-text data and have stored procedures that can support the notion of semantic nearness, in some ways keyword searching can be enhanced through the use of a relational database.
Some common approaches to keyword based searching of text documents are inverted lists, signature files, and prefix trees. One approach to keyword searching of XML documents parses the documents to generate inverted file information that in turn is loaded into a relational database. This approach effectively builds an inverted list that maps values to individual data objects (rows). Because this approach maps at an instance level, it suffers from scalability issues caused by duplication of data that leads to manageability problems.
The work on universal relations treats a database as a universal relation for querying purposes, thus hiding the inherent complexity of schema normalization for the purposes of querying. The challenge in this approach is to map a selection query over the universal relation to a SQL over the normalized schema. However, keyword queries are analogous to specifying universal relation queries without naming the columns involved in the selection conditions. Thus universal relation concepts are not directly applicable to keyword searches over relational databases. xe2x80x9cDataSpotxe2x80x9d is a commercial system that supports keyword-based searches by extracting the content of the database into a xe2x80x9chyperbasexe2x80x9d. Subsequently, all keyword searches are made over the hyperbase. Thus, this approach duplicates the content of the database and suffers from the scalability issues already discussed. Most major commercial database vendors allow a full text search engine to be invoked while processing SQL. The full text search capability is invoked by virtue of specialized predicates in SQL that are processed by the full text engine. Since a xe2x80x9chitxe2x80x9d using keyword may span multiple tables, full text searching is not sufficient to support keyword search over databases. Furthermore, as already discussed, such engines may encounter serious scalability problems.
Generalized matching capabilities have been used in the document retrieval field to enhance the functionality of keyword searches. Most traditional indexes used in relational databases such as a B+tree are incapable of returning cells (a cell refers to a particular column in a given row) that contain partial matches to a keyword. However, some commercial database systems have recently been enhanced to support full-text indexes that enable generalized search in text columns.
Searching over relational databases can be made more efficient by preprocessing the database to form an inverted list or lookup table of the data record components at an interim level of granularity. The list can be searched to generate a focused query on the database tables that searches only the sets of data records identified in the initial search of the index.
Special inverted lists can be constructed to support generalized keyword matching, such as sub-string or stemmed matches. These lists contain components of data records mapped to a region of the database in which they can be found. Regions of the database having components found in the search keyword are identified and a query is constructed that corresponds to the given search criteria. The query is executed on the identified regions to efficiently retrieve records matching the search criteria.
The inverted list includes a plurality of columns in which a keyword is placed in one column and a prefix component of the data record having a predetermined number of characters is placed in another column. Traditional B+ indexes are constructed for the columns. The inverted list is accessed to determine what tables/columns and prefixes have been associated with a given keyword and then a search query is constructed that searches for each prefix/keyword combination in the returned tables/columns.