Databases are the principal way in which information is stored. The most commonly used type of database is a relational database, in which information is stored in tables called relations. Relational databases are described in A First Course on Database Systems by Ullman and Widom, Prentice Hall, 1997, and in An Introduction to Database Systems, by C. J. Date, Addison Wesley, 1995.
Each entry in a relation is typically a character string or a number. Generally relations are thought of as sets of tuples, a tuple corresponding to a single row in the table. The columns of a relation are called fields.
Commonly supported operations on relations include selection and join. Selection is the extraction of tuples that meet certain conditions. Two relations are joined on fields F1 and F2 by first taking their Cartesian product (the Cartesian product of two relations A and B is the set of all tuples a.sub.1, . . . , am, b.sub.1, . . . , b.sub.n, where a.sub.1, . . . , a.sub.m is a tuple from A, and b.sub.1, . . . , b.sub.n is a tuple from B) and then selecting all tuples such that F1=F2. This leads to a relation with two equivalent fields, so usually one of these is discarded.
Joining relations is the principal means of aggregating information that is spread across several relations. For example, FIG. 1 shows two sample relations Q 101 and R 102, and the result of joining Q and R (the "Join" of Q and R) 103 on the fields named MovieID (the columns indicated by 104.) For reasons of efficiency, relations are usually joined on special fields that have been designated as keys, and database management systems are implemented so as to efficiently perform joins on fields that are keys.
In most databases, each tuple corresponds to an assertion about the world. For instance, the tuple&lt;12:30, 11, "Queen of Outer Space (Zsa Zsa Gabor)", 137&gt;(the row indicated by 105) in the relation Q 101 of FIG. 1 corresponds to the assertion "the movie named `Queen of Outer Space`, starring Zsa Zsa Gabor, will be shown at 12:30 on channel 11."
Known systems can represent information that is uncertain in a database. One known method associates every tuple in the database with a real number indicating the probability that the corresponding assertion about the world is true. For instance, the tuple described above might be associated with the probability 0.9 if the preceding program was a major sporting event, such as the World Series. The uncertainty represented in this probability includes the possibility, for example, that the World Series program may extend beyond its designated time slot. Extensions to the database operations of join and selection useful for relations with uncertain information are also known. One method for representing uncertain information in a database is described in Probabilistic Datalog--a Logic for Powerful Retrieval Methods" by Norbert Fuhr, in Proceedings of the 1995 ACM SIGIR Conference on Research in Information Retrieval, pages 282-290, New York, 1995. Other methods are surveyed in Uncertainty Management in Information Systems, edited by Motro and Smets, Kluwer Academic Publishers, 1997. Database systems that have been extended in this way are called probabilistic databases.
Another way of storing information is with a text database. Here information is stored as a collection of documents, also known as a corpus. Each document is simply a textual document, typically in English or some other human language. One standard method for representing text in such a database so that it can be easily accessed by a computer is to represent each document as a so-called document vector. A document vector representation of a document is a vector with one component for each term appearing in the corpus. A term is typically a single word, a prefix of a word, or a phrase containing a small number of words or prefixes. The value of the component corresponding to a term is zero if that term does not appear in the document, and non-zero otherwise.
Generally the non-zero values are chosen so that words that are likely to be important have larger weights. For instance, word that occur many times is a document, or words that are rare in the corpus, have large weights. A similarity function can then be defined for document vectors, such that documents with the similar term weights have high similarities, and documents with different term weights have low similarity. Such a similarity function is called a term-based similarity metric.
An operation commonly supported by such text databases is called ranked retrieval. The user enters a query, which is a textual description of the documents he or she desires to be retrieved. This query is then converted into a document vector. The database system then presents to the user a list of documents in the database, ordered (for example) by decreasing similarity to the document vector that corresponds to the query.
As an example, the Review column (the column indicated by 107) of relation R 102 in FIG. 1 might be instead stored in a text database. The answer to the user query "embarrassingly bad science fiction" might be a list containing the review of "Queen of Outer Space" as its first element, and the review of "Space Balls" as its second element.
In general, the user will only be interested in seeing a small number of the documents that are highly similar. Techniques are known for efficiently generating a reduced list of documents, say of size K, that contains all or most of the K documents that are most similar to the query vector, without generating as an intermediate result a list of all documents that have non-zero similarity to the query. Such techniques are described in Chapters 8 and 9 of Automatic Text Processing, edited by Gerard Salton, Addison Wesley, Reading, Mass., 1989, and in Query Evaluation: Strategies and Optimizations by Howard Turtle and James Flood in Information Processing and Management, 31(6):831-850, November 1995.
In some relational database management systems (RDBMS) relations are stored in a distributed fashion, i.e., different relations are stored on different computers. One issue which arises in distributed databases pertains to joining relations stored at different sites. In order for this join to be performed, it is necessary for the two relations to use comparable keys. For instance, consider two relations M and E, where each tuple in M encodes a single person's medical history, and each tuple in E encodes data pertaining to a single employee of some large company. Joining these relations is feasible if M and E both use social security numbers as keys. However, if E uses some entirely different identifier (say an employee number), then the join cannot be carried out, and there is no known way of aligning the tuples in E with those in M. To take another example, the relations Q 101 and R 102 of FIG. 1 could not be joined unless they both contained a similar field, such as the MovieID field (column 104.)
In practice, the presence of incomparable key fields is often a problem in merging relations that are maintained by different organizations. A collection of relations that are maintained separately are called heterogeneous. The problem of providing access to a collection of heterogeneous relations is called data integration. The process of finding pairs of keys that are likely to be equivalent key matching is called key matching.
Techniques are known for coping with some sorts of key mismatches that arise in accessing heterogeneous databases. One technique is to normalize the keys. For instance, in the relations Q 101 and R 102 in FIG. 1, suppose that numeric MovieID's are not available, and it is desirable to join Q 101 and R 102 on strings that contain the name of the movie, specifically, the MovieName field (the column indicated by 106) of Q 101, and the underlined section of the Review field (the column indicated by 107) of R 102. One might normalize these strings by removing all parenthesized text (which contains actor's names in Q 101, and a rating in R 102). A data integration system based on normalization of keys is described in Querying Heterogeneous Information Sources Using Source Descriptions, by Alon Y. Levy, Anand Rajaraman, and Joann J. Ordille, in {Proceedings of the 22nd International Conference on Very Large Databases (VLDB-96)}, Bombay, India, September 1996.
Another known technique for handling key mismatches is to use an equality predicate, a function which, when called with arguments Key1 and Key2, indicates if Key1 and Key2 should be considered equivalent for the purpose of a join. Generally such a function is of limited applicability because it is appropriate only for a small number of pairs of columns in a specific database. The use of equality tests is described in the Identification and Resolution of Semantic Heterogeneity in Multidatabase Systems, by Douglas Fang, Joachim Hammer, and Dennis McLeod, in Multidatabase Systems: An Advanced Solution for Global Information Sharing, pages 52-60. IEEE Computer Society Press, Los Alamitos, Calif., 1994. Both normalization and equality predicates are potentially expensive in terms of human effort: for every new type of key field, a new equality predicate or normalization procedure must be written by a human programmer.
It is often the case that the keys to be matched are strings that name certain real-world entities. (In our example, for instance, they are the names of movies.) Techniques are known for examining pairs of names and assessing the probability that they refer to the same entity. Once this has been done, then a human can make a decision about what pairs of names should be considered equal for all subsequent queries that require key matching. Such techniques are described in Record Linkage Techniques--1985, edited by B. Kilss and W. Alvey, Statistics of Income Division, Internal Revenue Service Publication 1299-2-96, available from {http://www.bts.gov/fcsm/methodology/}, 1985, as well as in the Merge/purge Problem for Large Databases, by M. Hernandez and S. Stolfo, in Proceedings of the 1995 ACM SIGMOD, May 1995, and Heuristic Joins to Integrate Structured Heterogeneous Data, by Scott Huffman and David Steier, in Working Notes of the AAAI Spring Symposium on Information Gathering In Heterogeneous Distributed Environments, Palo Alto, Calif., March 1995, AAAI Press.
Many of these techniques require information about the types of objects that are being named. For instance, Soundex is often used to match surnames. An exception to this is the use of the Smith-Waterman edit distance, which provides a general similarity metric for any pairs of strings. The use of the Smith-Waterman edit distance metric key matching is described in an Efficient Domain--independent Algorithm for Detecting Approximately Duplicate Database Records by A. Monge and C. Elkan, in The proceedings of the SIGMOD 1997 Workshop on Data Mining and Knowledge Discovery, May 1997.
It is also known how to use term-based similarity functions, closely related to IR similarity metrics, for key matching. Use of term-based similarity metrics for key matching, as an alternative to Smith-Waterman, is described in the Field-matching Problem: Algorithm and Applications by A. Monge and C. Elkan in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, August 1996.
In summary, known methods require that data from heterogeneous sources be preprocessed in some manner. In particular, the data fields that will be used as keys must be normalized, using a domain-specific procedure, or a domain-specific equality test must be written, or a determination as to which keys are in fact matches must be made by a user, perhaps guided by some previously computed assessment of the probability that each pair of keys matches.
All of these known procedures are require human intervention, potentially for each pair of data sources. Furthermore, all of these procedures are prone to error. Errors in the process of determining which keys match will lead to incorrect answers to queries to the resulting database.
What is needed is a way of accessing data from many heterogeneous sources without any preprocessing steps that must be guided by a human. Furthermore, when pairs of keys from different sources are assumed to match, the end user should be alerted to these assumptions, and provided with some estimate of the likelihood that the assumptions are correct, or other information with which the end user can assess the quality of the result.