Field of the Invention
The present invention generally relates to data processing, and more specifically, to searching for data or information in order to answer a query. Even more specifically, embodiments of the invention relate to methods, apparatus and computer program products that are well suited for retrieving information across heterogeneous indices.
Description of the Related Art
The Internet and the World Wide Web have become critical, integral parts of commercial operations, personal lives, and the education process. At the heart of the Internet is web browser technology and Internet server technology. An Internet server contains “content” such as documents, image or graphics files, forms, audio clips, etc., all of which is available to systems and browsers which have Internet connectivity. Web browser or “client” computers may request documents from web addresses, to which appropriate web servers respond by transmitting one or more web documents, image or graphics files, forms, audio clips, etc. The most common protocol for transmission of web documents and contents from servers to browsers is Hyper Text Transmission Protocol (“HTTP”).
The most common type of Internet content or document is Hyper Text Markup Language (“HTML”) documents, but other formats are also well known in the art, such as Adobe Portable Document Format (“PDF”). HTML, PDF and other web documents provide “hyperlinks” within the document, which allow a user to select another document or web site to view. Hyperlinks are specially marked text or areas in the document which when selected by the user, command the browser software to retrieve or fetch the indicated document or to access a new web site. Ordinarily, when the user selects a plain hyperlink, the current page being displayed in the web browser's graphical user interface (“GUI”) window disappears and the newly received page is displayed. If the parent page is an index, for example the IBM web site www.patents.ibm.com, and the user wishes to visit each descending link (e.g. read the document with tips on how to use the site), then the parent or index page disappears and the new page is displayed (such as the help page).
As the computing capacity of web browser computers increases and the communications bandwidth to the web browser computer increases dramatically, one challenge for organizations that provide Internet web sites and content is to deliver and filter such content in anticipation of these greater processing and throughput speeds. This is particularly true in the realm of web-based applications, and in the development of better and more efficient ways to move user-pertinent information to the desktop or client. However, today's web browsers are in general unintelligent software packages. As these browsers currently exist, they require the user to manually search for any articles or documents of interest to him or her, and these browsers are often cumbersome in that they frequently require a download of many documents before one of germane interest is found.
Search engines provide some level of “intelligence” to the browsing experience, wherein a user may point his unintelligent web browser to a search engine address, enter some keywords for a search, and then review each of the returned documents one at a time by selecting hyperlinks in the search results, or by re-pointing the web browser manually to provided web addresses. However, search engines do not really search the entire Internet, rather they search their own indices of Internet content which has been built by the search engine operator, usually through a process of reviewing manual submissions from other web site operators. Thus, it is common for a user to use several search engines while looking for information on a particular subject, because each search engine will return different results based on its own index content.
To address this problem, another technology has been developed and is known in the art as “MetaSearch engine”. A MetaSearch engine does not keep its own index, but rather submits a query to multiple search engines simultaneously, and returns to the user the highest ranked returns from each of these search engines. The MetaSearch engine may, though, return the top 5 listings from 4 search engines, which may filter out the more likely interesting information.
MetaSearch engines are constructed to support unified access to multiple search engines. With reference to FIG. 1, when merging results from multiple search indices 20, 22, a MetaSearch engine 24 can adopt either local similarity adjustment or global similarity estimation to provide documents 26. In the local adjustment approach, each component search engine ranks documents locally. Then the MetaSearch engine normalizes the ranks into the same range with additional information such as the quality of component search engines. For global similarity estimation, the MetaSearch engine computes a global similarity score for each returned document with certain information from component engines, such as the local document frequency of a term. Today a number of MetaSearch engines have been constructed and are available on the internet such as MetaCrawler and Dogpile. The component search engines in these systems deal with the same type of data, the document level indices. Documents in these systems as shown in FIG. 1 are first class entities. The term “first class entities” refers to the entities that can be used in programs without restrictions. Here, it refers to the abstract objects (such as books and departments) used in the system designed.
IBM's Enterprise Information Leverage (EIL) system can be regarded as a MetaSearch engine which provides unified access to services engagement data. A service engagement represents the interaction as well as the documents exchanged between sellers and clients. With reference to FIG. 2, an EIL system 30 combines information extraction and semantic search to support information needs of a user. An EIL system leverages structured and unstructured data using novel architecture and special purpose algorithms. Information is organized around an entity (such as engagements, books and departments), and the system supports a semantic concept index based information retrieval 32 by utilizing both information of first class entities in database queries 34 and of document index search 36 where the relevant entities act as a contextual constraint 38. In EIL systems, there is a need to deal with heterogeneous search indices; these indices are associated with documents as well as semantic concepts extracted from these documents. These concepts represent important properties of a service engagement. Analogously, a system can include data about books and each page in a book, or about departments in a company and each person in the department, etc. Furthermore, the indices can be stored in different places. For example, data about books can be stored in relational databases such as DB2, and information about pages in the books can be stored in a search engine such as OmniFind. The semantical differences between heterogeneous search indices may be a problem when merging and ranking the results in a MetaSearch engine.
Furthermore, in systems similar to the EIL system, documents are not first class entities. These entities can be engagements, books, departments, and so on. For instance, a user may want to search for a book about Java programming. If a page of content in a book mentions Java programming, the book should be returned. The ideal result is that a number of books are returned that relate to Java programming where, under each book, the top ranked pages containing the keywords are listed with hit highlights. Based on the hit highlights and the properties of books, a user can decide if a book is of interest. Therefore, it is important to cover as many books as possible given a certain number of book pages.
For example, two search indices for 5000 books have been established. One index is a keyword search index that is stored in a keyword search engine. The other index has specific properties of each book, such as the book titles, authors' names, dates published, abstracts, readers' comments, and so on. Normally, only a limited number of documents can be retrieved from a keyword search engine. For example, by default, OmniFind returns 500 document links for each search call. However, for a search of the term “Java programming”, a return of 500 pages from the same book is not the best result. An ideal result would be to have about 10 to 20 pages returned for a single book to allow the system to rank the books based on both the pages that are returned and the properties (semantic concepts) indexed in a relational database. In this way, there are a sufficient number of books presented for the user without retrieving too many pages. In a regular web search engine, documents are stored as first class entities and there is no need to group documents into a higher level of entities. What is needed is a system and search engine processing methodology that presents a sufficient number of books to a user for review without retrieving an excessively large number of pages.