Key word searching is well known, where a user enters a search query in the form of key words or search terms and Boolean operators, such as “And” or “Or”. In response, a search program or search “engine” searches for documents which include the search terms (in the case of unstructured data) or for information in tables that corresponds to the search terms (in the case of structured data). For example, Yahoo Corporation and Google Corporation provide search engines to search unstructured web pages and web files available through the Internet. As another example, Concept Hierarchy Model (CHM) program by Clement Yu et al, and TSIMMIS program by Hector Molina Garcia et al can search structured tables for data corresponding to search terms. Google Corporation also allows key word searches to search images. For example, if a user defines a search query as “house and door”, the Google Image Search engine will return as search results images of houses with doors.
Some search terms, known as “homonyms” have different meanings or contexts. Some of these search terms have different meanings globally, i.e. in unstructured documents. For example, the term “bridge” can mean a dental device or a roadway device spanning a river. Other search terms have different meanings within heterogeneous, structured databases. For example, the search term “affiliation” in one structured database as applied to an employee may mean the type of work the employee performs and in another structured database may mean, the employee's employer. Such differences in meaning of search terms in unstructured or structured databases are called “semantic conflicts”. There are other types of semantic conflicts, such as differences in structural representations of data, differences in data models, mismatched domains, and different naming and formatting schemes used by the different databases. The database schemas described below illustrate some types of semantic conflicts that can exist in heterogeneous databases. Table 1 is an Oracle database of Engineering Faculty members of Chicago based Universities. Table 2 is a Microsoft SQL Server database of employees of engineering related firms.
TABLE 1Data Model: Non-Normalized Relational Schema (partial): Faculty (SS#, Name, Dept, Sal_Amt, Sal_Type, Affiliation, Sponsor, University . . .)Faculty: Any tuple of the relation Faculty, identified by the key SS#SS#: An identifier, the social security number of a faculty member Name: An identifier, Name of a faculty member Dept: The academic or nonacademic department to which a faculty member is affiliated Sal_Amt: The amount of annual Salary paid to a Faculty member Sal_Type: The type of salary such as Base Salary, Grant, and Honorarium Affiliation: The affiliation of a faculty member, such as teaching, non-teaching, research University: The University where a Faculty member is employed
TABLE 2Data Model: Non-Normalized Relational Schema (partial): Employee (ID, Name, Type, Employer, Dept, CompType, Comp, Affiliation . . .)Employee: Any tuple of the relation Employee, identified by the key ID ID: An identifier, the social security number of an Employee Name: An identifier, Name of an employee Type: An attribute describing the job category of an Employee, such as Executive, Middle Manager, Consultant from another firm, etc . . .Employer: Name of the employer firm such as AT&T, Motorola, General Motors, etc. Dept: Name of the department where an Employee works CompType: The type of compensation given to an employee, such as Base Salary, Contract Amount Comp: The amount of annual compensation for an employee Affiliation: Name of the Consultant firm, such as a University Name, Andersen Consulting, . . .
There are several semantic correspondences between Table 1 and Table 2, even though some of the class names for the same type of information differ. First, a ‘Faculty’ class in Table 1 and an ‘Employee’ class in Table 2 intersect. Instances of attribute ‘SS #’ in Table 1 correspond to instances of attribute ‘ID’ in Table 2 where the employees are consultants from Chicago-based Universities. ‘Dept’ attributes in Table 1 and Table 2 share some common domain values; as do ‘Sal_Type’ in Table 1 and ‘Comp_Type’ in Table 2; and ‘Sal_Amt’ in Table 1 and ‘Comp’ in Table 2. These three pairs may be considered either as synonyms or homonyms depending on the nature of the query posed against these two databases. ‘Affiliation’ attributes in Table 1 and Table 2 are homonyms, as are ‘University’ attribute in Table 1 and ‘Employer’ attribute in Table 2, because their domains do not overlap. ‘University’ attribute in Table 1 and ‘Affiliation’ attribute in Table 2 may be considered as synonyms for the subset of class ‘Employee’ where ‘Employee.Type=Consultant’, and where the values in the domain of the attribute ‘Affiliation’ in Table 2 corresponds to the names of Chicago based Universities. Semantic reconciliation approaches identify and reconcile semantic incompatibilities and distinctions such as those illustrated by the example above. The number of semantic conflicts increases as more heterogeneous data sources need to be searched.
The following techniques are known to map the meaning or context of each query to heterogeneous databases, such that the query yields the desired information from each database despite semantic conflicts between the databases. For example, the following technique can be used to map the search term “class” to the foregoing Oracle and Microsoft databases even though the search term “class” has different meanings within these heterogeneous databases. These techniques attempt to find Inter-Schema Correspondence Assertions (“ISCAs”) which correlate the original search term to the search terms or “classes” with the intended context in the heterogeneous databases.
For each term in an original or “local” query, which is being searched in or mapped against a remote database, an integrator program (such as Semantic Coordinator Over Parallel Exploration Spaces “SCOPES”) first tries to establish anchors (or correspondences) in the remote database. Each local search query term may have several anchors. For example there can be q terms, denoted by set Tlocal={t1, t2, t3 . . . tq} in a query, and r matching terms, denoted by set Tremote={t′1, t′2, t′3, . . . t′r} in the remote database. Assume that each term in Tlocal maps to each of the r terms in Tremote with some probability (or a similarity value), this forms r anchors for each of the search query terms.
An initial attempt toward reconciling Tlocal against the remote database may include arbitrarily (or randomly) selecting one anchor for each of the terms in Tlocal. For example, let Tlocal={t1, t2, t3} and Tremote={t′1, t′2, t′3, t′4}. Assume that the set of anchors denoted Au={(t1,t′4), (t2,t′3), (t3,t′2)} is considered initially while interpreting the local query against a remote database. In case the reconciliation fails with this set of anchors, the user may arbitrarily select another set of anchors to continue attempts at reconciliation.
According to the classification proposed in Naiman & Ouksel, (in a document entitled “A Classification of Semantic Conflicts in Heterogeneous Database Systems”, published in Journal of Organizational Computing, 5(2), 167-193), there exist twelve possible semantic relationships between any two terms or concepts from different databases. The classification by Naiman & Ouksel allows them to represent each of these twelve cases as an Inter Schema Correspondence Assertion (ISCA). For example let the sets of ISCAs corresponding to anchors (t1,t′4), (t2,t′3), and (t3,t2) be denoted by sets ISCA (t1, t′4)={a1, a2, . . . a12}, ISCA (t2, t′3)={b1, b2, . . . b12} and ISCA (t3, t′2)={c1, c2, . . . c12} respectively, where all ai, bi and ci (1=<i=<12) denote different inter-schema correspondence assertions from the classification. Each member of the above three sets, ISCA (t1, t′4), ISCA (t2, t′3) and ISCA (t3, t′2), is of the form:[Assert (x,y), naming, abstraction, heterogeneity],where x corresponds to an element in the local database schema, y corresponds to an element in the remote database schema, naming corresponds to a naming relationship between x and y, abstraction corresponds to an abstraction relationship between x and y, and heterogeneity denotes the relative positioning of x and y in their respective schemas. Without complete semantic knowledge of the remote database, any of the twelve inter-schema correspondence assertions for each anchor may be considered plausible unless refuted by contradictory evidence.
The end user can choose one ISCA each from the sets ISCA (t1, t′4), ISCA (t2, t′3) and ISCA (t3, t′2) such that the resulting set of ISCAs form a consistent (or non-contradictory) and contextual proper interpretation for the query. In the absence of complete knowledge, each combination set resulting from the Cartesian product of sets ISCA (t1, t′4), ISCA (t2, t′3) and ISCA (t3, t′2) represents one plausible set of assertions. For example the combination set {a1, b2, c9} represents a plausible set of assertions. However, not all of these combination sets may be consistent (or non-contradictory) with respect to the assertions contained within the sets. Theoretically, in the worst case scenario the total number of sets of plausible inter-schema correspondence assertions, which result from the Cartesian product can be determined as follows. Let Tlocal={t1, t2, . . . , tq} and Tremote={t′1, t′2, . . . t′r}.
In the worst case scenario, assume that there exist ‘r’ anchors for each of the terms in set Tlocal. According to the Naiman & Ouksel classification there are twelve possible semantic relationships between any two terms. Therefore the total number of combination sets, which may be examined during reconciliation is: |CombinationSet|=(12r)q, where q is the number of terms in a query and r is the total number of matching terms in a remote database where each one of the q terms can be mapped to each of the r terms in a remote database with some probability (or a similarity value). There are known techniques to reduce the number of possible semantic relationships and interpretations; however, many possibilities still remain. While the foregoing techniques are viable, they are difficult and time consuming because of the many possible semantic relationships and interpretations between any two search terms.
Accordingly, an object of the present invention is to facilitate semantic reconciliation between unstructured documents which are searched by key words or terms.
Another object of the present invention is to facilitate semantic reconciliation between heterogeneous structured databases which are searched by key words or terms.