(1) Field of the Invention
This invention relates to information retrieval systems. More particularly, the invention relates to information retrieval in distributed information system, e.g Internet using query learning and meta search.
(2) Description of the Prior Art
The World Wide Web (WWW) is currently filled with documents that collect together links to all known documents on a topic; henceforth, we will refer to documents of this sort as resource directories. While resource directories are often valuable, they can be difficult to create and maintain. Maintenance is especially problematic because the rapid growth in on-line documents makes it difficult to keep a resource directory up-to-date.
This invention proposes to describe machine learning methods to address the resource directory maintenance problem. In particular, we propose to treat a resource directory as an extensional definition of an unknown concept i.e. documents pointed to by the resource list will be considered positive examples of the unknown concept, and all other documents will be considered negative examples of the concept. Machine learning methods can then be used to construct from these examples an intensional definition of the concept. If an appropriate learning method is used, this definition can be translated into a query for a WWW search engine, such as Altavista, Infoseek or Lycos. If the query is accurate, then re-submitting the query at a later date will detect any new instances of the concept that have been added. We will present experimental results on this problem with two implemented systems. One is an interactive system an augmented WWW browser that allows the user label any document, and to learn a search query from previously labeled examples. This system is useful in locating documents similar to those in a resource directory, thus making it more comprehensive. The other is a batch system which repeatedly learns queries from examples, and then collects and labels pages using these queries. In labeling examples, this system assumes that the original resource directory is complete, and hence can only be used with a nearly exhaustive initial resource directory; however, it can operate without human intervention. 
Prior art related to machine learning methods includes the following:
U.S. Pat. No. 5,278,980 issued Jan. 11, 1994 discloses an information retrieval system and method in which an operator inputs one or more query words which are used to determine a search key for searching through a corpus of a document, and which returns any matches between the search key and the corpus of a documents as a phrase containing the word data matching the query word(s), a non-stop (content) word next adjacent to the matching work data, and all intervening stopxe2x80x94words between the matching word data and the next adjacent non-stop word. The operator, after reviewing one or more of the returned phrases can then use one or more of the next adjacent non-stop words as new query words to reformulate the search key and perform a subsequent search through the document corpus. This process can be conducted iteratively, until the appropriate documents of interest are located. The additional non-stop words for each phrase are preferably aligned with each other (e.g., columination) to ease viewing of the xe2x80x9cnewxe2x80x9d content words.
Other prior art related to machine learning methods is disclosed in the references attached to the specification as Appendix 1.
None of the prior art discloses a system and method of adding documents to a resource directory in a distributed information system by using a learning means to generate from training data a plurality of items as positive and/or negatives examples of a particular class and using a learning means to generate at least one query that can be submitted to any of a plurality of methods for searching the system for a new item, after which the new item is evaluated by learning means with the aim of verifying that the new item is a new subset of the class.
An information retrieval system finds information in a Distributed Information System (DIS), e.g. the Internet using query learning and meta search for adding documents to resource directories contained in the DIS. A selection means generates training data characterized as positive and negative examples of a particular class of data residing in the DIS. A learning means generates from the training data at least one query that can be submitted to any one of a plurality of search engines for searching the DIS to find xe2x80x9cnewxe2x80x9d items of the particular class. An evaluation means determines and verifies that the new item(s) is a new subset of the particular class and adds or updates the particular class in the resource directory.