1. Field of the Invention
The present invention relates generally to Internet search technology and more specifically the present invention relates to a system and method for determining the suitability of a search result.
2. Description of the Related Art
FIG. 1 is a system level overview (100) of a prior art distributed computer network within which the invention may be practiced. The World Wide Web (WWW) is comprised of an expansive network (112) of interconnected computers (102a to 102n) upon which businesses, governments, groups, and individuals throughout the world maintain interlinked computer files known as web pages. Users navigate these pages by means of computer software programs commonly known as Internet browsers (GUI 104a to 104n). Due to the vast number of WWW sites, many web pages have a redundancy of information or share a strong likeness in either function or title. The vastness of the unstructured WWW causes users to rely primarily on Internet search engines (106a to 106x) located in association with or independent of server hub processing units (110a to 110y) to retrieve information or to locate businesses. These search engines use various means to determine the relevance of a user-defined search to the information retrieved.
The authors of web pages provide information known as metadata, within the body of the hypertext markup language (HTML) document that defines the web pages. A computer software product known as a web crawler, systematically accesses these web pages by sequentially following hypertext links from page to page. The web crawler indexes the pages for use by the search engines using information about a web page as provided by its address or Universal Resource Locator (URL), metadata, and other criteria found within the page. The crawler is run periodically to update previously stored data and to append information about newly created web pages. This information compiled by the crawler is stored in a metadata repository or database. Then, the search engines search this repository to identify matches for the user-defined search rather than attempt to find matches in real time.
A typical search engine has an interface with a search window where the user enters an alphanumeric search expression or keywords. The search engine sifts through available web sites for the user's search terms and returns the search of results in the form of HTML pages. Each search result includes a list of individual entries that have been identified by the search engine as satisfying the user's search expression. Each entry or “hit” includes a hyperlink that points to a Uniform Resource Locator (URL) location or web page.
In addition to the hyperlink, certain search result pages include a short summary or abstract that describes the content of the URL location. Typically, search engines generate this abstract from the file at the URL and only provide acceptable results for URLs that point to HTML format documents. For URLs that point to HTML documents or web pages, a typical abstract includes a combination of values selected from HTML tags. These values may include a text from the web page's “title” tag, from what are referred to as “annotations” or “meta tag values” such as “description”, “keywords”, or their equivalent, from “heading” tag values (e.g., H1, H2 tags), or from some combination of the content of these tags.
However, for one HTML parent page with links to multiple different relevant non-HTML documents that satisfy the user's search criteria, the search result may include multiple identical URLs one for each relevant non-HTML document. Each of these identical URLs points to the same HTML parent page and each may include an identical abstract that is descriptive of the parent HTML page. As a result, the search results in redundant abstracts can be practically useless, distracting, and time consuming to review.
To alleviate this problem, the popularity of domain-specific portal sites that act as gateways to very specialized information sources has grown concurrently with the WWW; this growth has occurred in both complexity and volume of data. The term “portal” is generally synonymous with gateway; it is typically used to refer to a WWW site which is intended to be a major starting site or as an anchor site for web users. Current leading general purpose portal sites include: Yahoo!®, Excite®, Netscape®, Lycos®, Cnet®, and MSN The Microsoft Network®. However, while such portal sites attempt to serve as gateways to a wide variety of general purpose information, specialized portals have also been gaining popularity in recent years.
Specialized portal sites, such as the jCentral®, xCentral, or their equivalents attempt to focus on a particular domain that appeals to a target audience. By limiting the scope of their operation, the belief is that specialized portal sites will be able to present information of greater relevance to their target audience. For example, in a portal site such as jCentral® that caters to users interested in learning more about the Java programming language and related topics, users are allowed to conduct a search by querying the portal database. The portal database is a vast repository of pre-collected, indexed, and summarized information, typically gathered from the WWW using automated crawling tools as described previously. When a user enters a query, the portal's search engine attempts to match the keywords specified by the user with summarized metadata that have been previously extracted from the documents stored in the repository, and then returns an ordered list of potential candidate matches relevant to the user's query.
Typically, the search engine will return a result set for a search query including a URL and a text based abstract of the original resource. Also, users are sometimes able to control the length of the abstract. For instance, the HotBot® site at URL: http://www.hotbot.com, provides the choice of having only a list of URLs displayed as the search result, the URL with a brief abstract, or a comprehensive abstract.
Although, the return of search queries in list is useful, it is not intuitive and this is a problem. In particular, there is no means or mechanism that allows a user to perform comparisons between different search result items, and provide an intuitive GUI for displaying this similarity. Such a comparison would assist a user in his or her decision of whether or not a particular document might be of interest. For example, a user knows the content of the document and is generally satisfied with the overall content in relation to the issued search query. Another document B, displayed on the same search result page, has a promising title and abstract. However, there is no additional information available from the search result page. Instead of loading document B into a document viewer, scanning through the content, and determining whether the document itself has similar properties as document A, which is a time consuming process. Accordingly, a need exists for a mechanism to perform this task automatically and conveniently.
A search result set represents just one type of similarity comparison. More generally, the problem arises with a list, which contains document identifiers but which contains no information on whether or not these documents are similar. Accordingly, a need exists for a method and system for comparing the similarity between two or more documents.
Other prior art solutions to comparing the similarity between two or more documents such as those offered by Google which (http://www.google.com) provides a search for similar pages and uses a search result item as search argument input. However, this approach does not have the flexibility to permit user selected arbitrary documents to be compared for similarity. Accordingly, a need exists for a method and system for comparing the similarity between two or more documents.