1. Field of the Invention
The present invention relates to searching or retrieving for information on the Internet, and in particular, to an information search device and information search method for acquiring information from a plurality of search engines existing on the WWW (World Wide Web, hereinbelow referred to as the xe2x80x9cWebxe2x80x9d) on the Internet.
2. Description of the Related Art
A variety of information search engines (hereinbelow simply referred to as xe2x80x9csearch enginesxe2x80x9d) exist on the WWW. In search engines such as Yahoo!xe2x80x94(http://www.yahoo.com) and AltaVista (http://www.altavista.com), a database of the URLs (Uniform Resource Locators) of Web pages existing on the Web is constructed to allow the user to search for a Web page. Yahoo! and AltaVista are general-purpose search engines directed to Web pages on various topics and categories, but there are also search engines that focus on specific topics (i.e., topic search engines). For example, Amozon.com (http://www.amazon.com) has a database directed exclusively to books for searching for books.
When searching the WWW by means of a search engine, users themselves typically select the search engine according to their purpose and search for information by submitting search keywords to the search engine. In other words, normally, a single search engine is used with each search. In this case, xe2x80x9csearch keywordsxe2x80x9d are keywords that are submitted when using a search engine to search for information.
In contrast, there is also a method known as xe2x80x9cmeta-searchxe2x80x9d that employs multiple search engines present on the Web (for example, Selberg, E. and Etzioni, O. xe2x80x9cMulti-Service Search and Comparison Using the MetaCrawler.xe2x80x9d in Proceedings of the 4th International World Wide Web Conference, 1994). In a meta-search, the search keywords submitted by the user are sent to a plurality of search engines, and all search results obtained from the search engines are presented to the user organized in a single report. When using single search engine, a user must search with another search engine if the necessary information is not found by a particular search engine. In other words, the user must switch from search engine to search engine submitting the search keywords any number of times until the necessary information is found. A meta-search eliminates this need for repetitive operations.
Distributed information retrieval methods have been proposed for selecting information appropriate to a query from a plurality of information sources (for example, Xu, J. and Callan, J., xe2x80x9cEffective Retrieval with Distributed Collections,xe2x80x9d in Proceedings of the 21st Annual International ACM STGTR Conference on Research and Development in Information Retrieval, pp. 112-120, 1998). According to such a method, a query is routed to only the databases of selected information source, and an improvement in the speed of the search process can therefore be expected. To select an appropriate database, a database (DB) selection index is first produced using keywords contained in each individual database among distributed databases and the frequencies of the keywords.
When using a single search engine, users must select the search engine according to the desired information. If users wish to get information on a recently published book, for example, they must select a book search engine, and if they wish to find a place to stay, they must select a hotel search engine. However, it is a burdensome task for users themselves to select the appropriate search service for each piece of required information.
A method can be considered by which search keywords are sent to all known search engines by the meta-search method, but sending the search keywords to all search services is impossible from a practical standpoint if there is a large number of search engines due to the problem of processing speed and the burden placed on the network. A current meta-search normally uses on the order of ten search engines, but if the number of search engines reaches, for example, a few thousand, the conventional meta-search method becomes unrealistic.
Appropriate search engines must therefore be selected according to the user""s search keywords. However, the database selection method in distributed information retrieval of the prior art presupposes that all data that are contained in the databases of each information source can be accessed to produce a database selection index. If the information sources are search engines on the Web, however, the entire contents of the databases of the search engines is generally inaccessible, and this prevents the use of the database selection method in the distributed information retrieval of the prior art.
It is an object of the present invention to realize an information search device and information search method that produce an index (hereinbelow referred to as a xe2x80x9cDB selection indexxe2x80x9d) for selecting search engines from search engines existing on the Web and that select a search engine that is appropriate to a user""s search keyword.
In more concrete terms, if, for example, the user""s search keyword is xe2x80x9cpython,xe2x80x9d the object of the present invention is to present the user with results such as those shown in FIG. 1. xe2x80x9cPythonxe2x80x9d is of course the name of one variety of snake, but it is also the name of a script-type object-oriented programming language. If the search keywords are related to multiple topics in this way, the search engine selection results are shown for each topic, and moreover, phrases explaining the topics are added. In the case shown in FIG. 1, the phrase xe2x80x9cObject oriented programming with pythonxe2x80x9d is added for xe2x80x9cpythonxe2x80x9d as an object-oriented programming language, and xe2x80x9cObject-oriented Information Sourcexe2x80x9d and xe2x80x9cScripting Databasexe2x80x9d are listed as the search engines. For xe2x80x9cpythonxe2x80x9d as the reptile snake, on the other hand, the phrases xe2x80x9csnake pythonxe2x80x9d is added, and xe2x80x9cReptile Searchxe2x80x9d and xe2x80x9cSnake Informationxe2x80x9d are listed as search engines. The user selects the choice that matches his or her intent to enable actual submission of the search keyword to the selected search engine. In the example shown in the figure, the user can send the search keyword to the selected search engine by checking the check box displayed next to the search engine that is to be selected and clicking on the button xe2x80x9cSend Query.xe2x80x9d
The search keywords may also be sent directly to each of the topic search engines (i.e., search engines that focus on specific topics) and the search result may be obtained without presenting the user with a list of topic search engines that may be relevant as shown in FIG. 1.
An information search device of the present invention that achieves the above-described objects is preferably provided with: (i) a relevant term collector for collecting terms describing the topics or content handled by a search engine; (ii) an index generator for producing a DB selection index from the collected relevant terms; (iii) a DB selection index that is stored inside a storage device; (iv) a query expansion unit for obtaining a term relevant to a search keyword submitted by the user from a general-purpose search engine; (v) an expanded term storage unit for storing a term obtained by the query expansion unit; and (vi) an engine selector for selecting a search engine based on the information that is stored in the expanded term storage unit and the DB selection index.
Here, the query expansion unit preferably obtains a term relevant to the search keyword from the search result obtained by sending the search keyword submitted by the user to a general-purpose Web search engine.
Preferably, the information search device of the present invention is further provided with: (vii) a reference character string storage unit for a storing character string in a document obtained from a general-purpose search engine by the query expansion unit; and (viii) a phrase generator for generating a phrase that explains a topic that is relevant to the search keyword based on information stored in the reference character string storage unit and the expanded term storage unit.
The process of performing an information search of the Web using the information search device of the present invention can be divided between an index generation phase for generating a DB selection index and a search engine selection phase for selecting a search engine appropriate for the search keyword submitted from a user using the DB selection index.
In the index generation phase, the relevant term collector first collects topics handled by search engines and terms relevant to the content of search engines from the Web pages of the search engines or from other Web pages having hyperlinks pointing to the search engine pages. Next, the index generator generates a DB selection index from the terms collected by the relevant term collector and their frequencies, and stores this index in a DB selection index storage unit (typically, a storage device).
In the search engine selection phase, in the query expansion unit, a term relevant to a search keyword submitted from the user is first acquired from, for example, a general-purpose search engine. This process is performed because only a limited number of terms are collected in the relevant term collector, and the use of only the search keyword submitted by the user usually results in no matches at all with terms registered in the DB selection index. Terms acquired by the query expansion unit are stored in the expanded term storage unit. Character strings contained in the search results obtained for the query expansion process from the general-purpose search engine are stored in the reference character string storage unit as necessary.
After the query expansion process, one or more search engines are selected in the engine selector based on the information that is stored in the DB selection index and the expanded term storage unit. In addition, the phrase generator may generate phrases that explain the topics relevant to the search keywords that were submitted by the user, and present these phrases to the user together with the search engines that were selected in the engine selector.