The World Wide Web (WWW) is comprised of an expansive network of interconnected computers upon which businesses, governments, groups, and individuals throughout the world maintain inter-linked computer files known as web pages. Users navigate these web pages by means of computer software programs commonly known as Internet browsers. Due to the vast number of WWW web sites, many web pages have a redundancy of information or share a strong likeness in either function or title. The vastness of the unstructured WWW causes users to rely primarily on Internet search engines to retrieve information or to locate businesses. These search engines use various means to determine the relevance of a user-defined search to the information retrieved.
The authors of web pages provide information known as metadata within the body of the document that defines the web pages. This document is typically written in, for example, hypertext markup language (HTML). A computer software product known as a web crawler systematically accesses web pages by sequentially following hypertext links (hyperlinks) from web page to web page. The crawler indexes the web pages for use by the search engines using information about a web page as provided by its address or Universal Resource Locator (URL), metadata, and other criteria found within the web page. The crawler is run periodically to update previously stored data and to append information about newly created web pages. The information compiled by the crawler is stored in a metadata repository or database. The search engines search this repository to identify matches for the user-defined search rather than attempt to find matches in real time.
A typical search engine has an interface with a search window where the user enters an alphanumeric search expression or keywords. The search engine sifts through available web sites for the search terms, and returns the search of results in the form of web pages in, for example, HTML. Each search result comprises a list of individual entries that have been identified by the search engine as satisfying the search expression. Each entry or “hit” comprises a hyperlink that points to a Uniform Resource Locator (URL) location or web page.
As new information access technologies became available, the nature of information work has changed. Search engines have transformed most Web tasks into reference tasks: the current expectation is that tasks begin with a search query and end with a suitable result web page. This mode of operation is simple, but it is being overwhelmed with the volume of web pages competing for attention. Ranked results lists have difficulty in scaling to the size of the current Web.
Previous systems that provide summaries of web sites fall into two classes: broad and narrow-coverage. Alexa®, a subsidiary of Amazon®, exemplifies the broad coverage approach. Alexa® monitors browsing traffic and computes a “Traffic Rank” for all web sites. Alexa® also displays information about ownership from the “whois” database and user reviews. Alexa® further displays in an unstructured list web sites linking to a summarized web site. Although this technology has proven to be useful, it would be desirable to present additional improvements. To avoid the computational expense, Alexa® makes no attempt to summarize the contents or structures within a web site.
A variety of Web monitoring tools exist that address narrow coverage of web site summarization. These web-monitoring tools are primarily concerned with detecting changes in a web site. As an example, ChangeDetector®, applies machine-learning techniques to detect subtle changes in web sites. Clustering is used to visualize comparisons between web sites [reference is made to L. Y. Bing Liu, et. al, “Visualizing web site comparisons”, In Proceedings of the 11th International World Wide Web Conference (WWW 2002), web pages 693-703, 2002.] Although this technology has proven to be useful, it would be desirable to present additional improvements. Typically, Web monitoring tools monitor only a specific and targeted set of web sites (or URLs) rather than all web sites crawled on the WWW.
What is needed is a method for providing as much useful information as possible for a web site, while keeping the data volumes and processing requirements feasible. WebTOC illustrates what richness can be added to the directory structure display, by displaying not only web page counts but breakdowns by types of media files and file sizes reference is made to D. Nation, et. al., “Visualizing web sites using a hierarchical table of contents browser: WebTOC”, In Designing for the Web: Practices and Reflections, 1997. Mappucino is another web site mapping tool that can display topic-customized maps, using graph layout algorithms [reference is made to Y. Maarek, et. al, “A system for dynamic and tailorable web site mapping”, In Proceedings of the 6th International World Wide Web Conference, 1997].
The Relation Browser provides an interactive table of contents for a web site reference is made to G. Marchionini, et. al, “Toward a general relation browser: A GUI for information architects”, In Journal of Digital Information, volume 4, 2003. The Relation Browser allows a user to “slice” the available content on a web site by various topic categories and displays the number of matching documents of various types for the specified topic. Although this technology has proven to be useful, it would be desirable to present additional improvements. These systems only crawl web sites on an as-needed basis and do not allow users to browse interactively from web site to web site.
Web site summaries have previously been provided on a web site-by-web site basis to track changes on web sites or to view specialized information. Large-scale summarization has been provided only for web traffic summaries. No method currently exists for providing a rapid overview or summary of a web site or collection of web pages that allows a user to browse interactively from web site to web site. What is therefore needed is a system, a service, a computer program product, and an associated method for interactively presenting a summary of a web site. The need for such a solution has heretofore remained unsatisfied.