The present invention relates to retrieval of electronic data in a computer network and, in particular, performing integrated data retrieval searches over a plurality of databases.
A computer network is a network of information sharing devices which comprises a network of computers connected together in a way that lets them share data and other devices (hard drives, printers, CD-ROMs, etc) among each other. Computer networks are typically classified based on the physical area they span; the area that a computer network spans may be a small office, a complete town, or even the entire world. Based on the area spanned by a computer network, these networks can be classified into a Home Area network (HAN), a Local Area Network (LAN), a Wide Area Network (WAN), a Metropolitan Area Network (MAN), and the Internet. The amount of information shared within a computer network depends upon its span and on the amount of data that needs to be shared between the computers (for solving one or more problems).
In a computer network, a server has applications and data that are usually shared by multiple computer users. Various information-sharing devices request information from the servers. These are often referred to as “clients”. Thus, the server determines and provides the data required by the clients. This data may include a huge number of files, documents, audio files, video files, static image files (and pictures), etc. Hence, the servers usually have a large database of multimedia documents and files, and once a client sends a request, the server (or servers) identifies the documents that are requested by a client and sends the appropriate information. Indeed, the identification of relevant documents may require simple or complex computation to be performed by the server before it sends the relevant information to the client.
As the sharing of data increases over computer networks, finding the right data (that may reside within any given computer network or outside) becomes an important problem. To solve this problem various kinds of search engines have been introduced. These search engines take keywords from a client and return multiple search results that are relevant to those keywords. These keyword searches are often based on certain rules. These rules define algorithms that govern the search that is performed over different websites and/or web pages (herein after referred to as sites). For example, these algorithms can define a lower limit on the frequency of occurrence of a keyword in the searched site. Thus, sites in which the frequency of occurrence of the keyword is above the lower limit are treated as a set of “search results”.
In addition to the abovementioned example, a complex algorithm has been discussed in U.S. Pat. No. 6,289,342, titled “Autonomous Citation Indexing And Literature Browsing Using Citation Context”. This patent is assigned to NEC Research Institute, Inc. (Princeton, N.J.) and it relates to context based document search in hyperlinked environments.
Since every search engine is based upon a particular set of rules, it may or may not yield the best results for every search that may be requested by the client. Hence, the client may have to use more than one search engine, and hence may have to go from one searching sites to the next. (For example, if the search engine provided by Google, of Mountain View, Calif., does not provide the results as desired by the user for a given search, the user may have to use the search engine provided by Altavista, of Palo Alto, Calif.). In fact, most of the time, the client and its human user does not even know whether a given search engine provided good results. Hence, the user may end up performing search on more than one search engines in order to obtain accurate information (and then collating the data and figuring out the “good search results” from “not so good search results”.
Websites like www.webcrawler.com host search engines that provide a user with an option of using multiple search engines simultaneously. These sites take a keyword from the user and perform search using multiple search engines. The search results from these search engines are then gathered and displayed to the user. Since these sites make use of multiple search engines, the results provided to the user are usually more exhaustive. For each search result, the server passes an “Identification tag” called the Uniform Resource Locator (URL) to the client. A URL can be defined as a syntax and semantics of formalized information for location and access of resources on the Internet. If the user clicks on the URL provided by the search engine then the user is connected to that web-site or that web page. Thus, the server transfers URLs corresponding to each search result and these URLs are used by the client to access the corresponding site. The transfer of a number of URLs from multiple search engines makes the data to be transferred to the client large. Transfer of this large amount of data between the server and the client of www.webcrawler.com consumes a lot of bandwidth. This is particularly true when the client is a portable device whose bandwidth is limited.
The abovementioned limitation was resolved by search engines supported by website www.metacrawler.com. This search engine collates the data extracted from different search engines before passing the data to the client. For example, www.metacrawler.com makes use of a number of search engines to obtain results matching the user's keywords. Each search engine comes up with a set of search results. Usually a number of search results are common to two or more sets of search results. The search engine supported within www.metacrawler.com identifies these common search results and passes information regarding the common search results only once. This avoids undue multiplicity in the data sent to the client. Thus, the amount of information passed to the client is reduced. However, sites like www.metacrawler.com detect multiplicity by doing a string match on the URLs of the results. This makes these sites computationally intensive and expensive.
Moreover, these sites make use of search engines provided by third parties like Google, AltaVista, etc. These sites have no control over the operation of these search engines. These search engines perform their search independent of each other. These search engines perform a search and send the search results in an unregulated manner. Hence, these sites (that support multiple search engines) often end up overconsuming the bandwidth allocated. This may often lead to delay in the display of information at the user end.
Along with the aforementioned limitations, sites that host multiple searches display only a limited set of search results. For accessing more information related to that search (or for accessing more information from a given search engine), a new request is sent to the server. Thus, for obtaining results for a query, multiple requests for the same query are sent to the server. Therefore, whenever a user makes multiple requests, the server and the communication link established between the server and the client may be substantially burdened (both in terms of communication bandwidth and in terms of computation).
As mentioned above sites like www.dogpile.com and www.metacrawler.com passes the URLs of the search results to the client. This consumes a lot of bandwidth. An approach mentioned in U.S. Pat. No. 6,263,330, Titled “Method And Apparatus For The Management Of Data Files”, reduces the abovementioned overload. The approach assigns pointers to the URLs that are retrieved from appropriate medical information servers. The data that is transferred to the client is an index file that stores pointers to the URLs retrieved and a corresponding map. This map links the pointers to their corresponding URLs. Hence, for each search engine the results are displayed using the pointers and the map. However, this approach reduces the data to be transferred in case of usage of multiple search engines. In case of a single search engine, the approach ends up sending more data. However, there is a further scope of reducing the amount of data transferred in case of multiple search engines.
All search engines present in the prior art are limited by one or more of the limitations mentioned above. Hence, there is a need for a system that minimizes the amount of information transferred between the server and the client for providing multiple sets of search results from different search engines. Also, there is a need for a system that reduces the burden of requests on the server, i.e., a system that limits the communication established between a client and the server. Also, a need exists for optimizing the bandwidth used during the search by controlling different search engines that may be used.