1. Field of the Invention
The present invention generally relates to data processing, and more specifically, to searching for data or information in order to answer a query. Even more specifically, embodiments of the invention relate to methods, apparatus and computer program products that are well suited for retrieving information across heterogeneous search indices.
2. Background Art
The Internet and the World Wide Web have become critical, integral parts of commercial operations, personal lives, and the education process. At the heart of the Internet is web browser technology and Internet server technology. An Internet server contains “content” such as documents, image or graphics files, forms, audio clips, etc., all of which is available to systems and browsers which have Internet connectivity. Web browser or “client” computers may request documents from web addresses, to which appropriate web servers respond by transmitting one or more web documents, image or graphics files, forms, audio clips, etc. The most common protocol for transmission of web documents and contents from servers to browsers is Hyper Text Transmission Protocol (“HTTP”).
The most common type of Internet content or document is Hyper Text Markup Language (“HTML”) documents, but other formats are also well known in the art, such as Adobe Portable Document Format (“PDF”). HTML, PDF and other web documents provide “hyperlinks” within the document, which allow a user to select another document or web site to view. Hyperlinks are specially marked text or areas in the document which when selected by the user, command the browser software to retrieve or fetch the indicated document or to access a new web site. Ordinarily, when the user selects a plain hyperlink, the current page being displayed in the web browser's graphical user interface (“GUI”) window disappears and the newly received page is displayed. If the parent page is an index, for example the IBM web site www.patents.ibm.com, and the user wishes to visit each descending link (e.g. read the document with tips on how to use the site), then the parent or index page disappears and the new page is displayed (such as the help page).
As the computing capacity of web browser computers increases and the communications bandwidth to the web browser computer increases dramatically, one challenge for organizations that provide Internet web sites and content is to deliver and filter such content in anticipation of these greater processing and throughput speeds. This is particularly true in the realm of web-based applications, and in the development of better and more efficient ways to move user-pertinent information to the desktop or client. However, today's web browsers are in general unintelligent software packages. As these browsers currently exist, they require the user to manually search for any articles or documents of interest to him or her, and these browsers are often cumbersome in that they frequently require a download of many documents before one of germane interest is found.
Search engines introduce some level of “intelligence” to the browsing experience, wherein a user may point his unintelligent web browser to a search engine address, enter some keywords for a search, and then review each of the returned documents one at a time by selecting hyperlinks in the search results, or by re-pointing the web browser manually to the web addresses returned. However, search engines do not really search the entire Internet; rather they search their own indices of Internet content which has been built by the search engine indexing software, usually through a process of analyzing information contained in various repositories, one example of which is web content on the Internet.
As presented in the Dogpile report [Different Engines, Different Results. A Research Study by Dogpile.com. April 2007], no single web search engine can retrieve all of the good search results by its own. For example, by searching only Google, a searcher can miss 72.7% of the Web's best first page search results.
To address this problem, another technology has been developed and is known in the art as “MetaSearch engine”. A MetaSearch engine does not keep its own index, but rather submits a query to multiple, component search engines simultaneously, and returns to the user the highest ranked results from each of these search engines. The MetaSearch engine may, for example, return the top 5 listings from 4 search engines. As a result, the more likely interesting information may be filtered out. Today a number of MetaSearch engines have been constructed and are available on the internet such as MetaCrawler and Dogpile.
This invention is also related to the distributed information retrieval (IR) technology. Without loss of generality, we use the context of metasearch to illustrate the idea. But it is applicable to the distributed a environment.
In a metasearch system, each component search engine takes independent decisions regarding which documents to index, how many documents to retrieve given a query, how to rank search results, and so on [Weiyi Meng, Clement Yu and King-Lup Liu. Building Efficient and Effective Metasearch Engines. ACM Computing Surveys, Vol. 34, No. 1, March 2002, pp. 48-89]. Due to such heterogeneity, it is difficult to combine results from component search engines efficiently and effectively. U.S. Pat. No. 6,795,820, for “Metasearch Technique That Ranks Documents Obtained From Multiple Collections,” discloses a framework to combine documents from component search engines, taking both local and global statistics into account when sorting the documents. Wiguna, et al., in “Using Fuzzy Model for Combining and Reranking Search Result from Different Information Sources to Build Metasearch Engine” (Wiratna S. Wiguna, Juan J. Fernández-iébar and Ana Garcia-Serrano, Computational Intelligence, Theory and Applications, International conference 9th fuzzy days in Dortmend, Germany, Sep. 18-20, 2006), presents a way of using fuzzy logic to combine results from distributed search engines. But their approach is only applicable for combining documents.
None of the existing approaches is appropriate for combining search results with different semantics, such as people versus departments or pages versus books. Having data sources which have different semantics, yet which are connected in certain ways, is very common in enterprises today, e.g., as mentioned in U.S. Patent Application Publication No. 2009/0112841, for “Document Searching Using Contextual Information Leverage and Insight.” What is needed is a methodology to properly combine these search results and sort them.