A large document database is a collection of many documents (e.g., reports, articles, memos, books) stored electronically as files on one or more computers. Users access the database to locate documents of interest and retrieve those documents for further processing. Finding documents of interest by inspecting every document in the database is impractical. Instead, a search system is used to locate relevant documents. A search system allows a user to express an information need in the form of a query. The system's search engine processes the query and returns to the user a hit-list of relevant documents. The user then selects interesting documents from the hit-list and retrieves those documents.
Users typically want to search the document database based on the content of the documents. This is accomplished using an information retrieval (IR) system. See Salton and McGill, "Introduction to Modern Information Retrieval" McGraw-Hill, N.Y., 1983; Frakes and Baeza-Yates, "Information Retrieval: Data Structures & Algorithms", Prentice Hall, Englewood Cliffs, N.J., 1992, which is herein incorporated by reference in its entirety. An IR system identifies relevant documents by matching the information need described by the query with the information content of the documents in the database. A query can be constructed in a variety of ways. Free-text queries contain natural language sentences or phrases. Structured queries consist of terms combined with operators (e.g., Boolean, proximity). Example queries are entire documents that serve as examples of the desired information.
The information content of the documents is identified at indexing time when the search system processes the documents to build an index. One index commonly used by IR systems is an inverted file. An inverted file contains an inverted list for every term used in the document database. A term is any word or vocabulary item identified in a document during indexing. An inverted list identifies the documents that contain the corresponding term. A document entry in an inverted list may additionally contain a term weight (e.g., the number of times the term occurs in the document) and/or the location of each occurrence of the term in the document (e.g., paragraph, sentence, word offset).
The actual content of the index depends on a similarity algorithm used by a search engine. During query processing, the search engine obtains information from the index based on the query, processes the information according to its similarity algorithm, and generates a hit-list. The hit-list identifies the documents deemed relevant to the query. Each entry on the hit-list uniquely identifies the corresponding document and may be supplemented with one or more of the document's attributes. Document attributes include items such as title, author, creation date, length, location, etc. These are identified at indexing time and stored in a document catalog.
In addition to identifying which documents should appear on the hit-list, many systems calculate a relevance score for each document and rank the hit-list in decreasing order of relevance. The relevance score may be viewed as another document attribute, although it is calculated at query processing time and applicable only to the current query.
In a networking environment, the components of a document database system may be spread across multiple computers. A computer comprises a Central Processing Unit (CPU), main memory, disk storage, and software (e.g., a personal computer (PC) like the IBM ThinkPad). A networking environment consists of two or more computers connected by a local or wide area network (e.g., Ethernet, Token Ring, and the Internet.) (See for example, U.S. Pat. No. 5,371,852 to Attanasio et al. issued on Dec. 6, 1994 which is herein incorporated by reference in its entirety.) A user accesses the document database using a client application on the user's computer. The client application communicates with a search server (the document database search system) on either the user's computer (e.g. a client) or another computer (e.g. a server) on the network. To process queries, the search server needs to access just the database index, which may be located on the same computer as the search server or yet another computer on the network. The actual documents in the database may be located on any computer on the network.
A Web environment, such as the World Wide Web, is a networking environment where Web servers and browsers (e.g., Netscape and WebExplorer) are used. Users can make documents publicly available in a Web environment by registering the documents with a Web server. Other users in the Web environment can then retrieve these documents using a Web browser. The collection of documents retrievable in a Web networking environment can be viewed as a large document database.
To create an index for such a document database so that it may be searched, the prior art often uses Web wanderers, also called robots, spiders, crawlers, or worms (e.g., WebCrawler, WWWWorm), to gather the available documents and submit them to the search system indexer. Web wanderers make use of hypertext links stored in documents. A hypertext link is a reference to another document stored in the Web. All of the documents are gathered by identifying a few key starting points, retrieving those documents for indexing, retrieving and indexing all documents referenced by the documents just indexed (via hypertext links), and continuing recursively until all documents reachable from the starting points have been retrieved and indexed. The graph of documents in a Web environment is typically well connected, such that nearly all of the available documents can be found when appropriate starting points are chosen.
Having gathered and indexed all of the documents available in the Web environment, the index can then be used, as described above, to search for documents in the Web. Again, the index may be located independently of the documents, the client, and even the search server. A hit-list, generated as the result of searching the index, will typically identify the locations of the relevant documents on the Web, e.g. with hypertext links can be attributes, and the user will retrieve those documents directly with their Web browser.