1. Field of Invention
The present invention relates generally to information retrieval from multiple information sources. More particularly, the present invention relates to a method and system for routing a request for information to different information sources such that a response to the request is obtained quickly and efficiently.
2. Discussion of the Related Art
In the field of information management, it is often desirable to store data in a network of multiple databases, each database containing a subset of the data contained in the network. To make effective use of the information stored in such a network, it is important to be able to access the information quickly and efficiently. However, when a network contains multiple databases, locating a desired piece of data can be difficult since it requires detailed knowledge of the characteristics of each of the databases in the network in order to determine which databases contain the desired data.
One approach to lessening this difficulty has been to provide the user with a summarized description of the contents of each of the distributed databases, thus enabling the user to search those databases which, based on the description, seem most likely to contain responsive information. However, such an approach suffers from the problem that the abstracted descriptions of the databases will, by necessity, be somewhat imprecise, thereby creating the possibility that the user will not be able to locate the desired information. In addition, when there are a large number of distributed databases, even a set of descriptions of the contents of each database can be too much information for a user to process quickly and effectively. Finally, use of database descriptors presumes some level of intelligence on the part of the user, who is asked to select the descriptor or descriptors of the databases that are most likely to contain the desired data. As a result, when the xe2x80x9cuserxe2x80x9d is a computer, such a system necessitates the use of knowledge-based algorithms, which can be complicated, costly, and prone to errors.
One way to reduce these problems would be to simply decrease the number of databases in the network, thereby decreasing the number of database descriptions and enabling each description, in turn, to be more complete. However, this approach can increase the cost of maintaining the database network, since it reduces the database administrator""s flexibility to house data at the most logical location from an information-storage perspective, and can result in an inefficient use of system resources. For example, such an approach incurs the costs of transporting data to the designated storage sites, and also results in the simultaneous underutilization and overutilization of system resources as certain remote storage capabilities are not used while other storage facilities are called upon to store excessive quantities of data, necessitating the purchase of additional, or more costly, storage equipment at these sites. As a result, such an approach requires a complicated trade-off to be made between the ease of using, and the cost of administering, the database network.
Ideally, multiple databases at different locations could be utilized without increasing the complexity of using the system to the end-user, or significantly increasing the cost of operating the system to the system administrator. The physical separation of databases within the network would be transparent to the end user, enabling the user to view the entire network of distributed databases as a single database.
One approach to making the internal network architecture transparent to the user is to simply send each request for data to each of the databases in the network, thus ensuring that the user""s search request will be performed on each of the databases in which responsive information, if any, is contained. There are two general ways to access each of the databases in the network: serially or in parallel. The advantage of accessing the databases serially is that only one database in the network needs to respond to the query at a time, thereby minimizing the amount of network resources being used at any given moment. However, serial access of each database in the network has serious disadvantages, foremost of which is that it can be a relatively time-consuming process, since each of the numerous databases will have to be accessed, one-at-a-time, to insure that all information responsive to the user""s query is located.
Some of the disadvantages associated with serial access of separate databases can be avoided by accessing the databases in parallel. Under this approach, the same query is sent simultaneously to all of the databases in the network, thus avoiding the need to successively poll each different database, and, as a result, decreasing the time required to obtain a response to an information request. But parallel access has disadvantages of its own. For example, each query still requires each of the databases in the network to be accessed, thereby consuming resources at all of the databases, and incurring costs in time and money depending on how distant, or how busy, the databases are. Moreover, truly parallel access of a large number of databases can require a prohibitive amount of processing power, thereby further increasing the cost and complexity of the system.
Accordingly, it is desirable to provide a method and system for accessing data in a network of databases quickly and efficiently, and in a manner that renders the internal architecture of the network of databases transparent to the user. The data is preferably accessed without relying on abstractions of the contents of the databases, instead relying on literal content. This method and system for accessing data in a network of databases desirably provides broad flexibility in data management and distribution across the network.
These and other advantages are achieved by the present invention, which in one exemplary embodiment provides a data retrieval system that includes a plurality of databases, each database including one or more records comprised of a plurality of fields. A search-routing database is also provided that includes one or more records comprised of a plurality of fields, one of which contains a database identifier. In addition, the system includes a proxy server for receiving a first search request and forming a modified search request, wherein the modified search request includes a subset of the fields of data contained in the first search request. The system further includes a search engine for searching the search-routing database using the modified search request and returning one or more database identifiers; a router for sending the first search request to the identified database(s); another search engine for searching the identified database(s) for data responsive to the first search request; and an output device for returning responsive data to a user.
In yet another exemplary embodiment of the invention, a method of retrieving data from a plurality of databases is provided. In this embodiment of the invention, a proxy server first receives an input search request having a plurality of fields from a user. Next, the proxy server creates a modified search request by extracting certain fields from the original search request. A search-routing database is then searched for data responsive to the modified search request. If responsive data is found in the search-routing database, then one or more database identifiers associated with the responsive data are returned to the proxy server. Next, the original search request is routed to the database(s) identified by the one or more database identifiers. The database(s) are searched for data responsive to the original search request. If responsive data is located, it is returned to the proxy server and ultimately to the user.