This invention relates to querying computer-stored databases, and more particularly to enhancing the likelihood of accessing a query acceptable result without requiring additional query modification by the user.
In this specification, a database is defined as a collection of data items organized according to a data model and accessed via queries. The present invention applies to any data model; however, it is illustrated in terms of the relational data model. The relational model was proposed by E. F. Codd in xe2x80x9cA Relational Model of Data for Large Shared Data Banksxe2x80x9d, Communications of the ACM, Vol. 13, No. 6, June 1970, pp. 377-387. Codd argued that a collection of tables or relations could be used to model real world items and to hold data about them.
In a relational database, data values are organized into columns or fields wherein each column comprises one attribute of the relation. Each column or attribute of the relation has a domain which consists of data values for that attribute. Each row of the relation, which includes one value from each attribute, is known as a record or tuple. The relational model differs from network and hierarchical models in that it does not use pointers or links. Instead, the relational model relates tuples by the values that they contain. This allows a formal mathematical foundation to be defined. Thus, a relational database can be said to be formed from a collection of relations, each of which is assigned a unique name, and which can be expressed in the form of tables. Each row in a table represents a relationship among the attributes. In this specification the terms xe2x80x9crowxe2x80x9d, xe2x80x9crecordxe2x80x9d, and xe2x80x9crelationxe2x80x9d as applied to relational tables are used synonymously.
Two different languages describe a database system. Namely, one language specifies a database scheme, and the other language is used to recite database queries and updates. As to the first, a database scheme is specified by a set of definitions expressed by a data definition language (DDL). The results of compilation of DDL statements are a set of tables that are stored in a special file called either a xe2x80x9cdata dictionaryxe2x80x9d or a xe2x80x9cdata directoryxe2x80x9d. Significantly, the data dictionary contains metadata. That is, the data dictionary defines each attribute in a table in terms of its type, range, etc. The dictionary is consulted before actual data is read or modified in the database.
As to the second language involved in databases, a data manipulation language (DML) enables users to access or manipulate data as organized by the appropriate data model. A procedural DML requires a user to specify what data is needed and how to access the data. One example of a procedural query language associated with relational databases is xe2x80x9crelational algebraxe2x80x9d. It consists of a set of operations that take one or two row relations as input and produce a new relation as their result. Fundamental operations in the relational algebra include select, project, union, set difference, Cartesian product, and rename. Other operations include set intersection, natural join, division, and assignment.
A nonprocedural DML requires only that a user specify what data is needed without specifying how to access the data. In this regard, it should be appreciated that a xe2x80x9cqueryxe2x80x9d is a statement requesting the retrieval of information. Also, the portion of the DML that involves information retrieval is called a xe2x80x9cquery languagexe2x80x9d. Unfortunately, it is common practice to use the terms xe2x80x9cquery languagexe2x80x9d and xe2x80x9cdata manipulation languagexe2x80x9d synonymously.
One form of user-friendly nonprocedural-like DML is known as xe2x80x9cstructured query languagexe2x80x9d (SQL). It uses an artful combination of relational algebra and calculus constructs. It includes features for defining the structure of the data, for modifying data in the database, and for specifying security constraints. The basic structure of an SQL expression includes the three clauses xe2x80x9cSELECTxe2x80x9d, xe2x80x9cFROMxe2x80x9d, and xe2x80x9cWHERExe2x80x9d. The clauses and their contents define predetermined query patterns. In this regard, a query is a search statement which defines the criteria that data in the form of tuples must meet in order to be part of the answer or response of the database to the query. In SQL, a query is formatted as follows:
SELECT y1, y2, . . . , ym 
FROM table X
WHERE conditions on (y1xe2x80x2, y2xe2x80x2, . . . , ymxe2x80x2)
The FROM clause defines the particular table(s) or set of relations in the database, denominated table X, within which the search in satisfaction of the query is to be conducted.
In the SELECT clause, the attributes y1-ym are the columns (variables) in that table X defined by the query to appear in the resulting display or printout.
In the WHERE clause, a predicate is set out where y1xe2x80x2-ymxe2x80x2 are the columns (variables) in the table expressing conditions or constraints that must be satisfied in order for a relation or record to be part of the result or answer.
It should be noted that the subset of attributes (columns y1-ym) in the SELECT clause may be different from the subset of attributes (columns y1xe2x80x2-ymxe2x80x2) in the WHERE clause. This means in practice that the result may recite only certain columns of the records found which is not necessarily the same as the columns on which the search for the records was based. The two sets of columns may thus totally or partially overlap or they may be completely distinct.
At the present time, the results returned by a database responsive to a query require that the user analyze the retrieved data quantitatively and qualitatively. Frequently, the query is modified, applied to the database, and the results again evaluated. The overall process is reiterative, manually intensive, distractive, and consumptive of significant computational and storage resource.
In the prior art, several processes are known which interactively aid the user in query modification during one or more iterations. Reference should be made to Fleischman et al., U.S. Pat. No. 5,388,259, xe2x80x9cSystem for Accessing a Database With an Iterated Fuzzy Query Notified by Retrieval Responsexe2x80x9d, issued February 7, 1995; and Li et al., U.S. Pat. No. 5,608,899, xe2x80x9cMethod and Apparatus for Searching a Database by Interactively Modifying a Database Queryxe2x80x9d, issued Mar. 4, 1997.
Fleischman discloses that a statistical membership function between retrieved values and particular attributes (column variables) can be used to electronically identify selected ones of the retrieved values in order to satisfy imprecise queries. The results are then ordered according to the strength of their membership function. More particularly, a retrieved value either exactly satisfies a precise predicate or fails to. The satisfaction may be represented by a Boolean logical 1, while the failure to satisfy may be represented by a Boolean logical 0. In contrast, an imprecise predicate cannot usually identify with certainty whether retrieved data, which by its nature, is ambiguous or difficult to quantify exactly. The resolution of such ambiguity is treated by fuzzy set theory. Although the system of this disclosure flexibly defines search criteria and assists in interpreting retrieved values, it does not expand the exploration beyond the bounds defined in the original query.
Li discloses an arrangement for graphically displaying returned values in at least two dimensions responsive to a separately displayed database query. A user can interactively modify the query by graphically adjusting the bounds of the displayed search predicate, i.e., SQL WHERE clause. In this regard, Li, as does Fleischman, aids in flexibly redefining search bounds within the scope of the original query.
It is accordingly an object of this invention to devise a machine-implementable method and apparatus for automatically extending the scope of a query search utilizing both retrieved values and association with variables not specified in the original query.
It is another object of this invention that such machine-implementable method and apparatus display retrieved values which either satisfy the query predicate or exhibit a substantial similarity to those retrieved values that satisfy the query predicate.
It was unexpectedly observed that a machine-implementable method and apparatus could be used to reiteratively extend the scope and the results of the query (a) if machine selected, strongly associated, not-previously-selected variables were added to a modified query; and (b) if the extended tuples of values resulting were filtered through a machine-based similarity evaluation among rows, records, or relations.
More particularly, the foregoing objects are believed satisfied by a machine-implementable method for automatic extension of results obtained by querying a database of relationally organized data expressed in tabular row and column format. Each database table includes a plurality of rows (tuples) and a plurality of columns (variables) defined over counterpart domains of values. In the method of the invention, a formatted query designating at least one table, at least one column variable, and at least one predicate constraint is applied to the designated table and tuples of values satisfying the predicate constraints are retrieved. Next, indexes of association among the previously selected and nonselected column variables are computed. After this, the formatted query is modified to include those nonselected column variables having respective indexes of association exceeding a predetermined threshold and the table reaccessed with the modified query. These steps are repeated until a stop condition occurs.
During each iteration, values of similarity are computed among the tuples returned by the modified query from the designated table. The tuples are then filtered in that only those tuples substantially similar to tuples originally elicited are added to the query return. This ensures that any tuples elicited by way of the added variables have a substantial likelihood of being of interest in satisfaction of the query. Since the stop conditions are extrinsically supplied, the duration of the method is always user controllable.