1. Field of the Invention
This invention relates to the field of data processing and communications, and in particular to the field of document storage, organization, and retrieval.
2. Description of Related Art
The volume of information that is available for access continues to increase, and the rate of increase in volume also continues to increase. This continually increasing information growth has led to increasingly expanding resources for information storage, organization and retrieval.
Conventional search engines, such as those used for finding documents on the World Wide Web, use a variety of techniques to quickly locate documents in response to a user query. One such technique is the creation of a database of indexes corresponding to the documents on the web. A user""s request is processed by finding a correlation between the user""s request and the information contained in the index database, rather than by actually searching the web in response to each user request. Conventional search engines use xe2x80x9ccrawlersxe2x80x9d that locate new or updated documents. When a new or updated document is located, the search engine creates an index corresponding to that document that contains, for example, a list of the most commonly occurring words or phrases in the document. Alternatively, techniques are available that allow the creator of the document to augment the document with a set of keywords or phrases directly, and these keywords or phrases are used to index the document. For ease of reference, the term keyword is used hereinafter to mean a word that is contained in an index to a document, regardless of the methods used to place that word in the index. When a user enters a query, the search results are based upon a matching between the words contained in the user query and the keywords contained in the indexes to the documents. As would be evident to one of ordinary skill in the art, the size of an index to a document can be large, and a database of indexes to virtually all of the documents on the web will be extremely large and will continue to grow at an increasing rate of growth. In 1998, an estimated 1.5 million pages are added to the World-Wide-Web per day, and this daily rate is expected to continue to increase. In addition to the cost of increased storage resources, the performance of database search techniques degrade as the size of the database increases.
Document retrieval based upon a keyword search is becoming increasingly less efficient and less effective as the number of documents that may contain the keyword continues to increase. It is not uncommon for a keyword search on the World Wide Web to return thousands of documents that are related to the keyword, many of which are irrelevant to the user""s quest. To reduce the number of identified documents corresponding to the keyword search, a user must augment the search parameters with additional keywords or phrases. In so doing, however, documents that are relevant to the user""s quest may be excluded from the search results if the user does not choose the same words that are used in the document. A search engine could be enhanced to automatically augment a user""s query with synonymous keywords to avoid this problem, but such an augmentation will aggravate the problem of identifying documents that contain the words but are irrelevant to the user""s quest.
Topical categorizations provide a more selective means of locating documents that are relevant to a user""s quest, because documents that have the same topic as the user""s quest are more likely to have relevant information than documents that merely contain a collection of matching words. Identifying a document""s topic, or topics, however, is a more complex task than identifying the words that are contained in the document. Traditionally, topic identification is a manually intensive task, requiring a large staff of people to read and categorize each document. Advances are continually being made in the information sciences in the development of statistically based algorithms, neural net and genetic based algorithms, and the like for automatic categorization of documents. Topical categorization also provides a highly effective means for general browsing, by allowing a user to select both topics of interest and topics of disinterest to steer the browsing process.
The techniques used to organize, store, and retrieve documents based on keyword searches, however, are not necessarily optimal or desirable for documents that can be categorized by topic. A mere replacement of topic phrases for keywords in a keyword search engine may not provide the improvements in search and storage efficiencies required as the quantity of available information continues to increase. The traditional approach of creating larger and larger search engines and databases that index every available document on the web based upon a frequency of occurrences of words or phrases within each document may be wholly inefficient and ineffective for organizing and retrieving documents based on topic. An indiscriminate use of topic determining techniques, for example, may merely create an even larger vocabulary that a user must use to filter relevant documents, with the inherent risk of choosing a different set of words or phrases than those used to index the documents. Because most documents contain multiple topics, the addition of topic information to existing indexes of documents will also substantially increase the size of the database required to contain this additional information.
It is an object of this invention to provide an information organization and retrieval system that efficiently organizes documents for rapid and efficient search and retrieval based upon topical content. It is a further object of this invention to provide an information organization and retrieval system that can be enhanced incrementally. It is a further object of this invention to provide an information organization and retrieval system that supports context-sensitive search and retrieval techniques. It is a further object of this invention to provide an information and retrieval system that allows a user to employ a vocabulary that may differ from the vocabulary used to organize the information in the information organization and retrieval system.
These objects and others are achieved by providing an information organization and retrieval system that is optimized for the retrieval of only those documents that are relevant to a given set of topics. The invention provides a method and apparatus for automatic document prefiltering and routing via a network of cooperating topical information servers. The information servers are provided to support document organization and retrieval based upon a select set of topics. The select set of topics are organized in multiple overlapping hierarchies, and a distributed software architecture is used to support the topic-based information organization, routing, and retrieval services. Documents are automatically prefiltered to determine whether they are relevant to the select set of topics, and only relevant documents are identified for subsequent retrieval. Documents may be relevant to one or more topics, and will be associated with each topic via the topical hierarchies that are maintained by the information servers.
In a preferred embodiment, the retrieval process is enhanced by providing a method and apparatus that supports the use of predefined or user-defined views for augmenting the search criteria based upon the context within which the user is searching.
The organization and retrieval process in this invention is also enhanced through the use of an internally consistent topic vocabulary. Terms and phrases used by the authors of documents or by the user who is searching for documents are translated into this common internal vocabulary, thereby providing for an enhanced organization and search capability while still allowing for alternative choices of words and phrases.