1. Field of the Invention
The invention concerns information retrieval generally. More specifically, the invention concerns optimization of query plans for retrieving information from a number of information sources.
2. Description of the Prior Art
Networks now connect computers with information sources located anywhere in the world. The Internet, for example, provides access to a large and diverse body of information, such as technical papers, public domain software, directory services and various databases (e.g., airline schedules, stock market listings). It is thus now possible to speak of global information systems.
Being aware that interesting and useful information exists is insufficient if one cannot find the relevant information sources. The large variety of information sources, and the disparity of interfaces among them renders the task of locating and accessing information over the network even more difficult. In order to address some of these problems, it is important to understand the characteristics of the available information sources.
Autonomy: The first characteristic is the autonomy of the information sources. This means that the information sources (i.e., sites) maintain and update their own data, and they are not willing to change their operations to suit the needs of the global information system. At best, an information source is willing to provide a description of its contents.
Dynamic nature: The second characteristic of information sources is their dynamic nature. Specifically, new information sources are added, while existing information sources disappear or arc no longer maintained.
Number of sources: The third characteristic is the very large number of information sources.
Cost of access: The fourth characteristic is that accessing an information source over the network is expensive (both in time and possibly in money).
The first characteristic distinguishes global information systems from distributed databases, where the information sources are not autonomous, but under the control of co-operating database administrators. The second characteristic sets apart global information systems from enterprise-wide databases, where the set of information sources are relatively stable (though the contents may change, of course). The third characteristic differentiates global information systems from current day multidatabases, that is, systems in which the information is contained in a number of different kinds of data base systems.
These characteristics of the information sources necessitate the following features in an architecture for global information systems.
World-view: A consequence of the very large number of information sources is that it is unreasonable to expect users to interact separately with each source. The users need a conceptually uniform view of the information space, against which they can formulate queries. However, there does not have to be a single such view of the information, but there can be many user and domain-specific world-views. In order to relate the contents of the information sources with the world-view, we need site descriptions.
Expressive site descriptions: A consequence of the large number of information sources and the high cost of accessing these sources is that in answering queries, a global information system must minimize the number of information sources (i.e., sites) that are accessed. Therefore, a key requirement of the site descriptions is that they be rich enough to express various constraints that enable the system to prune the sources accessed.
Extensibility: A consequence of the dynamic nature of the information sources is that it should be possible to easily extend the world-view to manage new kinds of information provided by the sources.
Query only: A consequence of the autonomy of information sources is that while a global information system might be able to support global querying, it is unreasonable to expect that it will support global updating.
The parent of the present patent application disclosed an information retrieval system having some of the above features.
That information retrieval system, shown as system 101 in FIG. 1 of the present patent application, has a knowledge base system 109 which includes a domain model 111. Domain model 111 is a model of information from a specific domain. Domain model 111 has three components: world view 115, information source descriptions 113, and system-network view 117. All of these components include concepts belonging to domain model 111. World view 115 is the part of domain model 111 which is visible to a user of system 101. World view 115 is a conceptually unified view of the information space. In a preferred embodiment, world view 115 is implemented as an expressive object/relational data model. The user can pose queries in terms of the objects and the relations in the world-view, unburdened by details of data location and access. World-views are purely conceptual; all the data required to answer queries is present only in site relations at external information sources. World view 115 is related to the information in information sources 123 by means of information source descriptions 113 which provide descriptions of the contents of information sources 123 and by means of system-network view 117, which describes how the information in a particular information source 123 is accessed.
World view 115, information descriptions 113, and system-network view 117 all include concepts in domain model 111. Knowledge base system 109 is able to classify new concepts and add them to the hierarchy of concepts already in domain model 111. Thus, if knowledge base system 109 receives information about a new information source 123, it automatically classifies new concepts relating to the information source in domain model 111.
Access plan generation and execution 119 in FIG. 1 poses sub-queries to the external sources that contain information relevant to answering the query and combines the answers to these sub-queries to answer the user query. Accessing information sources over the network is expensive, and so an important problem in generating access plans is minimizing the number of external information sources 123. It is an object of the present patent application to provide improved techniques for such minimization.