1. Field of the Invention
The invention relates to document labeling and, more particularly, to a system and method for assigning labels to an unknown document based on keywords used in labeling related documents.
2. Description of the Related Art
The World Wide Web (“WWW”) is a distributed database including literally billions of pages accessible through the Internet. Searching and indexing these pages to produce useful results in response to user queries is constantly a challenge. A device typically used to search the WWW is a search engine. A typical prior art search engine 50 is shown in FIG. 1. Pages from the Internet or other source 100 are accessed through the use of a crawler 102. Crawler 102 aggregates documents from source 100 to ensure that these documents are searchable. Many algorithms exists for crawlers and in most cases these crawlers follow links in known hypertext documents to obtain other documents. The pages retrieved by crawler 102 are stored in a database 108. Thereafter, these documents are indexed by an indexer 104. Indexer 104 builds a searchable index of the documents in database 108. Typical prior art methods for indexing include inverted files, vector spaces, suffix structures, and hybrids thereof. For example, each web page may be broken down into words and respective locations of each word on the page. The pages are then indexed by the words and their respective locations. A primary index of the whole database 108 is then broken down into a plurality of sub-indices and each sub-index is sent to a search node in a search node cluster 106.
In use, a user 112 sends a search query to a dispatcher 110. Dispatcher 110 compiles a list of search nodes in cluster 106 to execute the query and forwards the query to those selected search nodes. The search nodes in search node cluster 106 search respective parts of the primary index produced by indexer 104 and return sorted search results along with a document identifier and a score to dispatcher 110. Dispatcher 110 merges the received results to produce a final result set displayed to the user 112 sorted by relevance scores. The relevance score is a function of the query itself and the type of document produced. Factors that are used for relevance include: a static relevance score for the document such as link cardinality and page quality, superior parts of the document such as titles, metadata and document headers, authority of the document such as external references and the “level” of the references, and document statistics such as query term frequency in the document, global term frequency, and term distances within the document.
Referring to FIG. 2, there is shown an example of a result set 120. As shown in the figure, in response to a query 126 for the term “car” shown on the top of the page, the search engine YAHOO! searched its index and produced a plurality of results in the form of result set 120 displayed to a user. For brevity, only a first page of result set 120 is shown. Result set 120 includes four results 122a, 122b, 122c, and 122d each with a respective hyperlink 124a, 124b, 124c and 124d and addresses or URLs 128a, 128b, 128c, 128d for documents that satisfy the user's query. Focusing on result 122a, result 122a includes hyperlink 124a including anchor text (“cars.com”) describing the hyperlink and address 128a—where the user can find the respective document. Hyperlink 124a, when selected or clicked-on by the user, instructs the user's browser to request the document from the web site associated with address 128a. For example, if a user selects hyperlink 122b, the user's browser will request information from the web site at the address on the WWW “edmunds.com”.
It is desirable to summarize the content of a document by, for example, labeling the document. These labels may be used to provide a user with alternative search query terms or may be used for mapping other types of data such as mapping a specific source document into a more general category. Prior art methods have so far been unable to effectively label a document in a timely manner.
Thus, there is a need in the art for a system and method which can timely determine labels for a web page.