A database system may be described as a computerized record keeping system whose overall purpose is to maintain information and to make that information available on demand. Many databases in use today are based on a "relational model" in which the database data is perceived by its users as a collection of tables. The tables in a relational database include a row of column names specifying one or more column fields, and zero or more data rows containing one value for each of the column fields. Therefore, the value contained in each field in a column represents a specific instance of that concept. For example, a table pertaining to parts in a manufacturing database may contain a column of part numbers, and each field in that column may contain a part number value that identifies a specific part or object.
One feature that distinguishes relational from nonrelational databases is the ability to "join" two or more tables. In general, a join is described as a query in which data is retrieved from the fields of more than one table (although data may also be retrieved by joining a table with itself). In order to join two or more tables, the tables need to share at least one column. Typically, the purpose of the join is to connect data about the same object across the tables being joined. In such cases, for the results of the join to be meaningful, the columns that represent the same real-world concepts in the different tables must use the same field values to represent the same instances of those object. For example, assume two columns contain values pertaining to the concept of states. If a field value that represents the state of Ohio in one table is the string "OH", then the field value for the state of Ohio in the other column must also be the string "OH" in order to effect a successful join between the tables.
Typically, the tables within the same database are joined. However, with current database technology, the tables to be joined need not be physically present in the same database. Products such as SQL Connect from Oracle and SQLNet from Microsoft, enable the use of a heterogeneous database where a collection of database tables on different hardware platforms operating under different database management systems all appear to a user to be on one machine operating under one database management system.
To illustrate the use of a heterogeneous database, consider the following situation. A user, who could be a reporter, researcher or marketer, would like to quantitatively judge the effect of personal computers on college education. She would like to find out the correlation between the average GRE scores and average GPAs vs. the per capita consumption of computers by the students at a number of universities.
Assume that all the information that is required to answer this question is available on-line from the different databases that are maintained by different entities. An Examination Testing Service maintains a database including GRE scores, a marketing research firm maintains a database including the number of computers sold to students, and registrar offices at the universities maintain databases that include GPAs.
Despite this, getting the answer could take weeks if not more due to the following process. Generally speaking, an end user does not have the expertise to perform the query. Therefore, the user typically communicates the query to a management information specialist (MIS) who is responsible for such tasks. The MIS person then contacts the administrators of the relevant databases to solicit the database schemes (information regarding the tables, such as the name of the column fields, the data types and formats, etc.). After obtaining the schema's of the databases, the MIS person writes the user's query in a standard query language (SQL), which joins the relevant tables from the different databases. In response to submitting the query to the databases, the requested information is retrieved and the MIS person communicates the answer to the user.
This process is not only very slow, it is also very expensive because of the large amount of inter-human communication involved. Most of the effort is exerted by the MIS person when determining which databases might be relevant and then eliciting the schema information from the administrators of each of the databases. To summarize the process, the end user must inform the MIS person what information is requested, and then the MIS person must determine where it is, and how to obtain it by soliciting the database administrators. In a sense, the set of database administrators act as a "librarian" who knows what resides on which database and in which tables. Unfortunately, no one librarian can keep track of more than a very small fraction of all the tables available, and queries capable of joining the tables must be handwritten by the MIS person.
Part of what makes this process slow and expensive is that often, the tables that must be joined in order to answer the query were not designed to be joined in the first place. The purpose of the join is to co-identify objects across tables.
Accordingly, what is needed is an automated method and system for co-identifying objects across heterogenous information sources without user intervention. The present invention addresses such a need.