The present invention relates to computerized database systems and, in particular, to a database system that provides integrated text retrieval capability.
Conventional databases, including relational and object relational databases, usually consist of a number of tables. Each table consists of a number of tuples (rows) that share some common attribute (column). The value of an attribute is usually a simple data type like an integer, floating point number, date or string.
A query over such a database consists of finding all the tuples in one or more tables that exactly satisfy a given set of constraints represented by a Boolean combination of query elements. For example, a simple query might find all tuples that have attribute values that match (equal) a value of a query element. The search results can either be returned in random order or according to ascending or descending values of one or more attributes of the resulting tuples. An index using a B-tree or hash-type structure may be used to rapidly process queries without a need to review every tuple for each query.
Queries in such database systems can be considered xe2x80x9cexactxe2x80x9d in a sense that either a given tuple matches constraints of the query or does not. If a tuple matches the query, then the tuple is included in the search result. If the tuple does not match the query, then the tuple is not included in the search result.
In contrast to the above described database system, a text retrieval system consists of a collection of text documents. Each document is treated as a collection of keywords. A query over such a database consists of finding all the documents that xe2x80x9ccontainxe2x80x9d one or more of a given set of keywords. The results are usually returned in the order of relevance of the document to the particular query. For example, all the documents may be ranked according to how closely they match the given set of keywords or how many times the keywords are found in the document. The results are usually returned in the order of relevance. Again, so that each document need not be reviewed for each query, a reverse index may be constructed that lists the keywords linked to all the documents that contain each keyword.
Queries in such text retrieval systems can be considered to be xe2x80x9capproximatexe2x80x9d in the sense that a document that does not contain some of the keywords in a query is not automatically discarded. Rather, it is given a low relevance. Documents with relevance above a certain threshold are returned by the system and those with lower relevance are dropped. Complex queries made up of Boolean combinations of different query elements having different keywords may also be implemented.
The different form of queries for database systems and text retrieval systems, as exact and approximate, have resulted in only limited attempts at combining these two types of systems. Some text retrieval systems, for example, allow the use of non-text attributes for limiting the search to particular libraries or to particular documents to which attributes have been associated. Also, some databases allow for keyword searches on text field attributes. Nevertheless, these systems are very rudimentary, maintaining each of the exact query element and approximate query elements separate with respect to optimization and with respect to relevance which applies only to text retrieval query elements.
A unified approach to querying a combined database and text retrieval system is needed, one that expands to concept of relevance to all search results and that provides for superior optimization opportunities.
The present invention provides a unified database/text retrieval system provides an evaluation system which handles xe2x80x9cmixture queriesxe2x80x9d composed of both exact and approximate query elements under a uniform framework. The invention allows mixture queries to be processed by a single index and preserves the properties of associativity and commutivity allowing optimization of the query. The invention further allows relevance values to be attached to component search results from all query elements (exact or approximate) so that the search results may be ordered by relevance.
Specifically then, the present invention provides a unified database/text retrieval system having a logical data table of tuples having attributes where at least one attribute is a text document. A means is provided for receiving a query that is a Boolean combination of value-matching (exact) query elements for a non-text document attributes and keyword-inclusion (approximate) query elements for the text document attribute. A preprocessor converts the value-matching condition to a keyword-inclusion condition using a pseudo-keyword; and an index communicating with the preprocessor provides a reverse index of keywords and pseudo-keywords to tuples.
Thus it is one object of the invention to allow text retrieval and database queries to be processed with a single logical index. It is another object of the invention to provide for a simple conversion means by which value-matching query elements may be converted to keyword-inclusion query elements.
The preprocessor preserves the Boolean combination of value-matching query elements and keyword-inclusion query elements in the corresponding keyword-inclusion query elements after the conversion of the value-matching query elements.
It is thus another object of the invention to allow a combination of database and text retrieval query elements in a query to be manipulated under the rules of associativity and commutivity to allow optimization of the query.
The preprocessor may assign relevance values to tuples identified through the index from the converted, value-matching query elements.
Thus it is another object of the invention to expand the concept of relevance to exact query elements.
The relevance values assigned to tuples may be derived from: attribute values associated with value-matching query elements of the query, previous searches by a user generating the query; degree of consanguinity between the query entered by the user and related query elements automatically augmenting the query.
It is another objective of the invention to allow automatic relevance assignment based on a variety of different inputs.
The invention may include a means for assigning a relevance value to the component search results for both the query elements that are value-matching query elements and the query elements that are text-inclusion query elements. A combiner then combines the relevance of all component search results to provide relevance value to search results meeting the query.
It is thus another object of the invention to provide the ability to combine relevance values of component search results resulting from both value-matching query elements and text-inclusion query elements.
The foregoing objects and advantages may not apply to all embodiments of the inventions and are not intended to define the scope of the invention, for which purpose claims are provided. In the following description, reference is made to the accompanying drawings, which form a part hereof, and in which there is shown by way of illustration, a preferred embodiment of the invention. Such embodiment also does not define the scope of the invention and reference must be made therefore to the claims for this purpose.