The World Wide Web (WWW) is comprised of an expansive network of interconnected computers upon which businesses, governments, groups, and individuals throughout the world maintain inter-linked computer files known as web pages. Users navigate these pages by means of computer software programs commonly known as Internet browsers. Due to the vast number of WWW sites, many web pages have a redundancy of information or share a strong likeness in either function or title. The vastness of the unstructured WWW causes users to rely primarily on Internet search engines to retrieve information or to locate businesses. These search engines use various means to determine the relevance of a user-defined search to the information retrieved.
The authors of web pages provide information known as metadata, within the body of the hypertext markup language (HTML) document that defines the web pages. A computer software product known as a web crawler, systematically accesses web pages by sequentially following hypertext links from page to page. The crawler indexes the pages for use by the search engines using information about a web page as provided by its address or Universal Resource Locator (URL), metadata, and other criteria found within the page. The crawler is run periodically to update previously stored data and to append information about newly created web pages. The information compiled by the crawler is stored in a metadata repository or database. The search engines search this repository to identify matches for the user-defined search rather than attempt to find matches in real time.
A typical search engine has an interface with a search window where the user enters an alphanumeric search expression or keywords. The search engine sifts through available web sites for the user's search terms, and returns the search of results in the form of HTML pages. Each search result includes a list of individual entries that have been identified by the search engine as satisfying the user's search expression. Each entry or “hit” includes a hyperlink that points to a Uniform Resource Locator (URL) location or web page.
In addition to the hyperlink, certain search result pages include a short summary or abstract that describes the content of the URL location. Typically, search engines generate this abstract from the file at the URL, and only provide acceptable results for URLs that point to HTML format documents. For URLs that point to HTML documents or web pages, a typical abstract includes a combination of values selected from HTML tags. These values may include a text from the web page's “title” tag, from what are referred to as “annotations” or “meta tag values” such as “description,” “keywords,” etc., from “heading” tag values (e.g., Hl, H2 tags), or from some combination of the content of these tags.
However, for non-HTML type document, such as a postscript file or a word processing document that otherwise satisfies the search criteria, the search engines typically do not return a URL, but instead point to a directory or to an HTML page, which, in turn, refers to the non-HTML document. As a result, for a non-HTML document, the search results include links to, and descriptions of pages that point to this non-HTML document rather than containing a description of the non-HTML document itself.
Moreover, for one HTML parent page with links to multiple different relevant non-HTML documents that satisfy the user's search criteria, the search result may include multiple identical URLs, one for each relevant non-HTML document. Each of these identical URLs points to the same HTML parent page, and each may include an identical abstract that is descriptive of the parent HTML page. As a result, the search results in redundant abstracts that can be practically useless, distracting, and time consuming to review.
An additional challenge that dilutes the efficacy of searches includes the dynamic, i.e., continuously changing nature of the web pages and the pages they point to, and the inability of the crawlers to efficiently update the data and metadata contained in the web pages and in the pages pointing to them. The time lag between the generation of the metadata by the web crawlers and the update of the actual data in the web pages has heretofore presented an unsurmountable problem for the rendering of accurate abstracts. In a conventional search engine search, the results have been based on metadata in the search engine's repository rather than on up to date data recently published on the web.
Oftentimes users are presented with outdated search abstracts even though up to date information is already available on the web. As an example, an actual search conducted on Jan. 19, 2000 using the keyword “lawyer” and the Alta Vista search engine, revealed an abstract pointing to Martindale-Hubbel Lawyer Locator URL. The copyright notice in the abstract read “1996–1999”. However, a visit to the Martindale-Hubbel Lawyer Locator URL showed a copyright notice that read “1996–2000”, clearly indicating a disparity between the metadata in the search abstract and the data in the actual web site.
There is currently no adequate mechanism by which search engines automatically generate accurate and dynamic abstracts, and the need for such a mechanism has heretofore remained unsatisfied.