Interoperability between database systems, and other forms of information repositories, is becoming an increasingly important area of development. The goal is to allow a single query to access distributed, often heterogeneous data sources and search engines. An example is web meta-search engines (such as meta-seek) that can access a variety of search engines (such as altavista or lycos), and return a single set of integrated results. Another example is directories for earth-science data, such as the Global Change Master Directory, which allows access to a variety of science data distributed at various sites through a single query interface.
Resource discovery is a term that is used to describe the process of determining the nature of entities that are contained within an information repository. When queries are being processed by multiple, heterogeneous repositories, an important step of the process is determining what information is available from each repository. For example, a query that seeks to find areas of deforestation in the Amazon basin between 1995 and the present would need to determine whether a given repository contains appropriate data which might include (in increasing order of specificity): determining if the repository is oriented towards earth sciences; if it contains deforestation information; if the information covers the Amazon basin; if the requested dates are available.
Search interoperability is generally implemented in one of two ways. The first is to define a common set of terms (ontology), and require that repositories that are to interoperate all employ the common ontology. This is feasible in well-established domains such as medicine or particle physics. The other possibility is to build translators to create mappings between a local set of terms within a repository, and a common set of terms used in formulating queries. This allows for local “dialects,” as long as the underlying semantic entities in the repository correspond to those expressed in the query.
A much harder resource discovery problem, and one that has not been adequately addressed to date, is how to determine whether the entities in a repository have the same semantics as those being requested by a query. The entities in the repository may have different labels than those used in the query or they may have the same labels, but not the same meaning. For example, a query with the term “deciduous forest” may be adequately addressed by a repository that has entities labeled “hardwood,” yet it may be quite difficult to determine this correspondence. On the other hand, two different repositories may have the term “temperature,” but one may be daily maximum temperature, the other may be hourly mean temperature, and thus not correspond. It is important to be able to determine whether entities with different or identical labels actually refer to the same underlying semantic concept.
The present invention addresses a particular class of such problems—where the entities in the query and the entities in the repository are both defined in terms of a set of classes produced by supervised classifiers. In many application areas, application data is categorized using classifiers. Examples of categorization include: assigning labels of “fraudulent” and “non-fraudulent” to medical claims records, determining land cover categories such as “forest” or “water” for each region in a satellite image, or assigning a category to a news item for access by a web search engine. As can be seen in the last example, the categorization need not be a simple “flat” scheme—it may be hierarchical, or even overlapping.
Classifiers are automated procedures that take input data, and produce the appropriate categorization for each item. Medical records may be input to a classifier which will output the appropriate designation of “fraudulent” or “non-fraudulent” based on the values of individual fields in the record. Similarly, the spectral reflectance values of each individual pixel in a satellite image may be used by a land cover classifier to determine the most likely class for that pixel. The frequency and arrangement of words in a news item may be input to a news article classifier, which will produce a single category label, or a set of appropriate labels.
Classifiers may be broadly divided into two main types: unsupervised or supervised. Unsupervised classifiers assign the input data to categories or classes using techniques such as clustering; the result is an arbitrary label (e.g., a cluster number) assigned to each category. In other words, the label assigned by an unsupervised classifier does not contain semantic information. Examples of unsupervised classifiers (as described, for example, in C. H. Chen et al., “Finding Groups in Data,” World Scientific, New York, 1993) include the modified Lloyd algorithm (as described, for example, in Y. Linde et al., “An Algorithm for Vector Quantizer Design,” IEEE Trans. Communications, 28(1), pp. 84–95, January 1980), the tree-structured vector quantizers (as described, for example, in K. Rose et al., “Entropy-Constrained Tree-Structured Vector Quantizer Design,” IEEE Trans. Image Processing, 5(2):393–398, February 1996), and k-means (as described, for example, in C. Chinrungrueng et al., “Optimal Adaptive K-means Algorithm with Dynamic Adjustment of Learning Rate,” IEEE Transactions on Neural Networks, 6(1), pp. 157–169, January 1995). Supervised classifiers, on the other hand, use a set of examples, known as training sets that are considered typical of each class, and use these examples to “train” the algorithm that does the categorization. Different training sets produce different categorizations. A supervised classifier then should be considered to comprise a classification algorithm and a training set. Types of supervised classifier algorithms include the Bayes Classifier, the Perceptron, the k-nearest-neighbor, linear discriminant functions (all described, for example, in R. O. Duda et al., “Pattern Classification and Scene Analysis,” John Wiley & Sons, 1973), CART (as described, for example, in L. Breiman et al., “Classification and Regression Trees,” Wadsworth & Brooks/Cole, 1984) and Neural networks (as described, for example, in P. K. Simpson, “Artificial Neural Systems, Foundations, Paradigms, Applications and Implementations,” Pergamon Press, 1990).