This invention relates to client-server systems and methods for obtaining Web related content from one or more servers and presenting that content to a user. More particularly, this invention further relates to client-side software and devices that facilitate delivery and presentation of the Web content.
Public networks, and most notably the Internet, are emerging as a primary conduit for communications, entertainment, and business services. The Internet is a network formed by the cooperative interconnection of computing networks, including local and wide area networks. It interconnects computers from around the world with existing and even incompatible technologies by employing common protocols that smoothly integrate the individual and diverse components.
The Internet has recently been popularized by the overwhelming and rapid success of the World Wide Web (WWW or Web). The Web links together various topics in a complex, non-sequential web of associations which permit a user to browse from one topic to another, regardless of the presented order of topics. The Web is rapidly evolving as a standard for distributing, finding, and accessing information of any type. A xe2x80x9cWeb browserxe2x80x9d is an application that executes on the user""s computer to navigate the Web. The Web browser allows a user to retrieve and render hypermedia content from the WWW, including text, sound, images, video, and other data.
The amazing growth rate in the demand for data over the Internet is partly due to an increasing audience. The World Wide Web has crossed the threshold that makes it affordable and interesting to a much larger audience. There is information available on a very wide variety of topics, and tools exist to help people find and view the information cost effectively.
Another factor fueling the Internet growth is the exploding amount of information that is now available on the Web. The Web has grown from thousands of Web sites to several million Web sites in a very short period of time. The growth continues at an exponential rate. Many corporations and libraries are translating paper and microfilm information archives to electronic media that is published via the Web or similar network. While this has resulted in a wealth of information that is now available to virtually anyone, the information is poorly organized and the sheer volume of the information makes it hard for a typical person to sort through, find, and retrieve specific information.
The shift from paper published media to online media also created a new problem. People wishing to access Web information are limited to accessing it only when connected to the Internet or other network. Network connectivity is largely restricted to a physical wire connection to the computer, or a virtual connection to wireless transmission networks. This makes it hard, if not impossible, to disconnect the computer from the network and still access information.
As more information is brought online, the demand on the computational and network resources to categorize, search, personalize, and retrieve the information is placing new demands on the existing client-server infrastructure that makes up networks like the Web. Additionally, the data demands are affected by a trend for Web sites to evolve from serving pure text to serving richer media content, including graphics, sound, and video. Adding richer media content is popular because it presents information more clearly and attractively, thereby enhancing a site""s impact and popularity.
Due to these emerging factors, a significant problem facing the continued growth and acceptance of the Internet is that conventional methods for accessing the Web do not scale well to meet the rapid growth in supply and demand, or to satisfy the need for better organization. The quality of service for the Web is intuitively measured by the user as the amount of time it takes to search, find, request, and receive data from the Web. Internet users have been conditioned through their experiences with television and standalone multimedia applications to expect instantaneous results on demand. Users are accustomed to changing the TV channel and instantaneously viewing the video content for that channel on the screen. Unfortunately, the Internet is unable to deliver data instantaneously. For the most part, the Internet has significant latency problems that reduce fairly routine Web browsing exercises to protracted lessons in patience.
The basic dilemma is that the quality of service degrades as more people try to use the Web. More unsettling is the corollary that service for popular Web sites is typically much worse than service for unpopular sites. There are several causes of the service problem, including overburdened servers and slow distribution networks.
Networks often have too little bandwidth to adequately distribute the data. xe2x80x9cBandwidthxe2x80x9d is the amount of data that can be moved through a particular network segment at any one time. The Internet is a conglomerate of different technologies with different associated bandwidths. Distribution over the Internet is usually constrained by the segment with the lowest available bandwidth.
In the consumer market, for example, most clients typically connect to the Internet via a local modem connection to an Internet Service Provider (ISP). This connection is generally enable a maximum data rate of 14.4 Kbps (Kilobits per second) to 28.8 Kbps. Some clients might employ an ISDN connection, which facilitates data flow in the range of 128-132 Kbps.
The ISP connects to the primary distribution network using a higher bandwidth pipeline, such as a T1 connection that can facilitate a maximum data flow of approximately 1.5 Mbps. This bandwidth is available to serve all of the clients of the ISP so that each client can consume a 14.4 Kbps, 28.8 Kbps, or 128 Kbps slice of the 1.5 Mbps bandwidth. As more clients utilize the ISP services, however, there is less available bandwidth to satisfy the subscriber requests. If too many requests are received, the ISP becomes overburdened and is not able to adequately service the requests in a timely manner, causing frustration to the users.
Couple this problem with the fact that clients typically go underutilized. While servers are pushed to their maximum output limits, clients often sit idle for many hours per day.
Because the bandwidth issue is constrained by technology development in the physical network architecture, early attempts to solve these problems focused on organizing the Web content in some manner to better facilitate search and retrieval. This in turn enabled users to more quickly access information on the Internet, even though the underlying physical architecture remained the same.
The earliest solutions involve organizing the information by hand. Humans review information by browsing the Internet and assemble large lists of documents containing similar information. The lists are further organized into hierarchies of categorized content. People can view the categorized lists online in an attempt to more quickly obtain a specific piece of information. The advantage of this scheme is that human reviewers are very good at categorizing the information and discarding low-value documents, so the lists of categorized information contain fairly high value information. Some hand-categorized data schemes are organized into popular Web sites. The best known example of this is the xe2x80x9cYahoo!xe2x80x9d Web site.
The disadvantage of this human-driven technique is that it becomes more difficult to keep up when the amount of information grows exponentially. The categorized lists are frequently out of date or inadequate. Additionally, the method requires a user to be connected to the network to view the information.
Another approach is to use massive search engines that automatically retrieve documents on the Web and attempt to index all of the information. The technique of fetching this information is known as xe2x80x9cweb-crawlingxe2x80x9d or xe2x80x9cweb-scrapingxe2x80x9d. Heuristic document categorization algorithms index the information and store the indices (but not the information) in large centralized databases. Users run queries against the massive databases to find specific information, and then retrieve the information from individual web-sites. Popular examples of these types of Web based services include Lycos, InfoSeek, Alta-Vista, and others. They are generally referred to as xe2x80x9cSearch Sitesxe2x80x9d or xe2x80x9cInternet Search Enginesxe2x80x9d.
The advantage of web-crawling and indexing is that computers can automate the process of retrieving and reviewing documents. The speed of computers means that a larger number of documents can be compiled as compared to human efforts. The disadvantage is that the computers have a hard time distinguishing between valuable information and worthless information, and are not very good at categorizing the information. Also, these types of databases are centralized and require an end user to be online to make queries against the database. A third approach to solving the information glut problem is to employ information services that collect and editorialize information that they deem as important. The information is indexed and placed into a centralized database. The services utilize a combination of humans to collect and categorize information, and computers to perform automated information collection. Because these systems effectively filter down the amount of potential information by many orders of magnitude, it is possible to locally store portions of the centralized database on the client server and for the user to view the information when disconnected.
The most popular example of this type of system is PointCast. PointCast collects news articles from many sources, edits them down to a predefined maximum length, categorizes them, and stores them in a centralized database at their data center. Client software then queries the centralized database to obtain the portions of the data in which the user is interested.
The disadvantage of these systems is that a centralized database scales poorly as more and more users attempt to retrieve information. By centralizing all information, the data source becomes a choker point to information flow. Another disadvantage is that while some of these centralized information services provide a good selection of information for users, the information is dramatically more restricted in comparison to the vast wealth of information available on the Web. Users are restricted to these service-selected information categories.
Accordingly, there remains a need to develop improved techniques for facilitating distribution of Web content over the Internet.
This invention concerns a client-based system that improves gathering and organizing of Web content in a manner that mitigates impact on overburdened servers and slow networks. The client-based system enables personalized filtering to collect only that content which the individual user prefers, while rejecting unwanted content. Moreover, the system enables the user to work offline from the server with similar functionality to online operation.
According to one aspect of this invention, the client-based system has a scheduling subsystem to schedule a time to obtain the Web content from the server. When the client reaches the scheduled time, the scheduling subsystem generates an event notification that contains sufficient information explaining how to retrieve the Web content. As an example, the event notification might contain a URL (universal resource locator) that the client uses to go out and fetch the Web content. The event notification might alternatively contain a reference to a multicast address or a broadcast transmission frequency to which the client listens or tunes to retrieve the desired Web content.
The client-based system has a delivery subsystem that is responsive to the event notification to facilitate retrieval of the Web content at the time set by the scheduling subsystem. The delivery subsystem preferably has multiple delivery modules that enable delivery of the content over different types of distribution systems. For instance, the delivery subsystem might comprise a multicast listener to listen to a multicast address for the Web content, or a fetching program that goes out to the server and retrieves the Web content over the Internet, or a broadcast packet rebuilder that reconstructs Web content that is broadcast over a wireless network.
In addition to the Web content or data itself, the delivery subsystem obtains an index to the Web content. The index summarizes the Web content to facilitate local search and find tasks. The index and Web content are stored in a cache at the client, preferably according to some unique identifier such as URLs.
The client-based system also has an indexing subsystem to retrieve the index from the cache and present the index to a user. The indexing subsystem supports a user interface, such as a graphical windowing UI, which enables the user to select from the index portions of the Web content stored in the cache.
According to an aspect of this invention, the user can create personal filters that filter the index to remove items not of interest. The filters can condense the index when it is received prior to be cached, or when the user attempts to view the index.
According to another aspect of this invention, the user can continue to search and find the Web content using the index even though the client is offline from the server. The user is given essentially the same functionality as a live online session, except that requests to remote servers are temporarily accumulated for later submission. For example, the user may fill out an HTML (hypertext markup language) form and click a xe2x80x9csubmitxe2x80x9d button to send the completed form back to the originating Web site. To the user, the clicking action appears to send the form back to the server. However, since the client is offline, the HTML form is kept in the cache until a later online session. When the client subsequently reconnects to the server, all accumulated data (i.e., requests, forms, etc.) that is destined for one or more remote servers is sent in batch to the appropriate servers.
According to another aspect, the user can create his/her own channel. The client-based system enables the user to select preferred Web content that is delivered using different channels. For instance, the user might like to see all basketball-related content. Based on the user""s selections, the system constructs a set of filtration rules and filters the different channels according to the filtration rules to aggregate the preferred Web content. In this manner, the system might extract basketball scores from one Web site, player statistics from another, and upcoming schedules from a third. The client-based system then presents the aggregated Web content as a new channel to a user, such as the xe2x80x9cBasketballxe2x80x9d channel.
In one implementation, the client-based system is built into a Web browser. The browser may be integrated into the operating system, or run as a separate application.