The present invention relates to retrieval and viewing of computer-stored documents, and in particular to automated assistance in browsing stored resources, such as those available on the Internet.
The Internet is a worldwide xe2x80x9cnetwork of networksxe2x80x9d that links millions of computers through tens of thousands of separate (but intercommunicating) networks. Via the Internet, users can access tremendous amounts of stored information and establish communication linkages to other Internet-based computers. Much of the Internet is based on the xe2x80x9cclient-serverxe2x80x9d model of information exchange. This computer architecture, developed specifically to accommodate the xe2x80x9cdistributed computingxe2x80x9d environment that characterizes the Internet and its component networks, contemplates a server (sometimes called the host)xe2x80x94typically a powerful computer or cluster of computers that behaves as a single computerxe2x80x94which services the requests of a large number of smaller computers, or clients, which connect to it. The client computers usually communicate with a single server at any one time, although they can communicate with one another via the server or can use a server to reach other servers. A server is typically a large mainframe or minicomputer cluster, while the clients may be simple personal computers. Servers providing Internet access to multiple subscriber clients are referred to as xe2x80x9cgatewaysxe2x80x9d; more generally, a gateway is a computer system that connects two computer networks.
In order to ensure proper routing of messages between the server and the intended client, the messages are first broken up into data packets, each of which receives a destination address according to a consistent protocol, and which are reassembled upon receipt by the target computer. A commonly accepted set of protocols for this purpose are the Internet Protocol, or IP, which dictates routing information; and the transmission control protocol, or TCP, according to which messages are actually broken up into IP packets for transmission for subsequent collection and reassembly. TCP/IP connections are quite commonly employed to move data across telephone lines.
The Internet supports a large variety of information-transfer protocols. One of these, the World Wide Web (hereafter, simply, the xe2x80x9cwebxe2x80x9d), has recently skyrocketed in importance and popularity; indeed, to many, the Internet is synonymous with the web. Web-accessible information is identified by a uniform resource locator or xe2x80x9cURL,xe2x80x9d which specifies the location of the file in terms of a specific computer and a location on that computer. Any Internet xe2x80x9cnodexe2x80x9dxe2x80x94that is, a computer with an IP address (e.g., a server permanently and continuously connected to the Internet, or a client that has connected to a server and received a temporary IP address)-can access the file by invoking the proper communication protocol and specifying the URL. Typically, a URL has the format http:// less than host greater than / less than path greater than , where xe2x80x9chttpxe2x80x9d refers to the HyperText Transfer Protocol, xe2x80x9chostxe2x80x9d is the server""s Internet identifier, and the xe2x80x9cpathxe2x80x9d specifies the location of the file within the server. Each xe2x80x9cweb sitexe2x80x9d can make available one or more web xe2x80x9cpagesxe2x80x9d or documents, which are formatted, tree-structured repositories of information, such as text, images, sounds and animations.
An important feature of the web is the ability to connect one file to many other files using xe2x80x9chypertextxe2x80x9d links. A link appears unobtrusively as an underlined portion of text in a document; when the viewer of this document moves the cursor over the underlined text and clicks, the linkxe2x80x94which is otherwise invisible to the userxe2x80x94is executed and the linked file retrieved. That file need not be located on the same server as the original file.
Hypertext and searching functionality on the web is typically implemented on the client machine, using a computer program called a xe2x80x9cweb browser.xe2x80x9d With the client connected as an Internet node, the browser utilizes URLs-provided either by the user or a linkxe2x80x94to locate, fetch and display the specified files. xe2x80x9cDisplayxe2x80x9d in this sense can range from simple pictorial and textual rendering to real-time playing of audio and/or video segments. The browser passes the URL to a protocol handler on the associated server, which then retrieves the information and sends it to the browser for display; the browser causes the information to be cached (usually on a hard disk) on the client machine and displayed. The web page itself contains information specifying the specific Internet transfer routine necessary for its retrieval. Thus, clients at various locations can view web pages by downloading replicas of the web pages, via browsers, from servers on which these web pages are stored. Browsers also allow users to download and store the displayed data locally on the client machine.
Most web pages are written in HyperText Markup Language, or HTML, which breaks the document into syntactic portions (such as headings, paragraphs, lists, etc.) that specify layout and contents. An HTML file can contain elements such as text, graphics, tables and buttons, each identified by a xe2x80x9ctag.xe2x80x9d Web browsers utilize HTML interpreters that execute these instructions to display the page.
The number of files accessible just on the web is enormous and constantly growing. As a result, attempting to locate and navigate among documents of interest within the huge space of available files is generally a haphazard process. Certainly the presence of hyperlinks assists the user by identifying files related to the one currently under scrutiny. Any given file probably features several hyperlinks, however, and execution of any of these links typically draws a new web file with hyperlinks of its own. And of course hyperlinks are included at the discretion of a web-document author; they do not, nor are they intended to, provide an exhaustive catalog of web pages containing related information.
In a typical session a user, operating a web browser on a client machine, locates his or her first web page either through prior knowledge of its URL, or using a xe2x80x9csearch enginexe2x80x9d or xe2x80x9cweb crawlerxe2x80x9d that locates pages of possible interest based on user-specified key words. Publicly accessible search engines such as ALTA VISTA, YAHOO! and LYCOS process the user""s search query and return a list of candidate web pages containing the query, any of which can be readily retrieved and viewed by the user through execution of its associated hyperlink. The user scans through the list of candidate web pages, clicking on entries of possible interest, examining these, and possibly executing hyperlinks associated with some of the retrieved documents. The totality of web pages the user may examine in this fashion form a tree structure, with the candidate pages returned by the search engine constituting the roots. The user""s examination can proceed xe2x80x9cdepthwisexe2x80x9d from a root along an arbitrary path of pages linked by a sequence of hyperlinks, or can proceed xe2x80x9cbreadthwisexe2x80x9d at a given hierarchical level through examination of all hyperlinks associated with a given page; generally, a user""s session involves both depthwise and breadthwise searching without any advance strategy. Search engines may assist the user by providing a questionnaire, responses to which help focus the search based on explicitly stated user preferences. Such xe2x80x9cconversationalxe2x80x9d tools, however, intrude on the user""s browsing activities.
The process of searching, even with automated assistance, is by no means assured to locate the most relevant web pages, due both to the combinatorial expansion of the search space (i.e., the number of hyperlink-accessible pages) with increasing depth, and the difficulty of assessing, merely from its hyperlink designation, the potential usefulness of another web page or the likelihood that another page will contain further useful hyperlinks. The user""s time constraints and interest level generally operate to limit the search to a few sites chosen with little information.
The problem is not confined to the Internet. For example, the concept of dividing functionality between a client-based browser and server-based web pagesxe2x80x94where the browser locates, fetches and displays resources, executes hyperlinks, and generally interprets web-page information, while the web page contains data, hyperlink addresses, transfer protocols and computer instructions defining xe2x80x9cpotential functionalityxe2x80x9d that may be executed by the browserxe2x80x94can be replicated on internal networks as well. These networks, sometimes called xe2x80x9cintranets,xe2x80x9d support the TCP/IP communication protocol and typically serve the needs of a single business (or business department), which may be located at a single site (with individual clients connected by a simple local-area network) or multiple physically dispersed sites requiring a wide-area network but not access to the Internet. Various of the computers forming the intranet network can be utilized as servers for web pages, each with its own URL and offering access to network client computers via TCP/IP. Even more generally, the user may have access to a large database of HTML documents resident on a single machine, using a browser to search through them. In any of these circumstances, the user can face similar difficulties searching among documents.
The present invention operates in tandem with a conventional document-retrieval facility, such as a web browser, by tracking the choices made by the user in retrieving and viewing items (such as web pages)xe2x80x94i.e., which links are followed, when searches are initiated, requests for help, etc.xe2x80x94and, based thereon, identifying additional items likely to be of interest to the user. In other words, the invention browses the same search space as the user, but faster and guided by the user""s past behavior. Preferably, the invention operates autonomously, without interruption of the user""s activities or explicit requests for stated preferences, providing an xe2x80x9cobservationalxe2x80x9dxe2x80x94rather than conversational-mode of assistance. The user receives (in real-time or upon request) a set of current recommendations that take the form of (or include) links to the recommended items, and the user is free to execute any of these links to examine the contents of a recommendation. As used herein, the term xe2x80x9cweb pagexe2x80x9d connotes not only items available specifically on the World Wide Web, but instead broadly refers to any items viewable on a browser (or other document-viewing facility) and which may contain links specifying other items and executable by the browser to access such items. Thus, the web page may exist on an intranet or even on a single computer, and need not explicitly utilize the Internet protocol.
To facilitate operation without disruption of the user""s viewing activities, the invention preferably functions in a xe2x80x9cbackgroundxe2x80x9d sense, observing the user""s browsing activity and generating preference criteria in accordance therewith, then manually examining documents, in parallel with the user""s browsing, to identify ones consistent with the preference criteria. These criteria are desirably developed at more than one level. At the item level, the importance of a particular item viewed by the user can be assessed by noting the length of time (relative to the length of the document) the user spends reading the item, the number of hyperlinks in the item that the user executes, whether the user has returned to the item following other browsing, or whether the user has accorded some special status to the item (e.g., by storing it in a xe2x80x9chot listxe2x80x9d of preferred documents for ready access). At the content level, the importance of particular aspects of an item is assessed by textual analysis to identify key preference termsxe2x80x94that is, words or phrases that have particular relevance to subject matter of interest to the user, as demonstrated, for example, by recurrence in different items accessed by the user.
As the user peruses a retrieved item, the invention utilizes this xe2x80x9cidle timexe2x80x9d to perform a search for other items of interest based on the preference criteria. The invention sequentially retrieves items the way a browser would, but for purposes of analysis rather than display. Most generally, the invention performs a xe2x80x9cbreadth-firstxe2x80x9d search from the item currently being viewed by the user or from a previously viewed (or otherwise located) item deemed of greater importance, following each of the links specified in that item and searching their contents for matches to key preference terms. In other words, the invention first examines all items at the same hierarchical tree level before proceeding to the next level. A xe2x80x9cbest-firstxe2x80x9d refinement of the breadth-first search proceeds by ordering the possibilities by likelihoodxe2x80x94that is, using the preference criteria to rank the items at a given hierarchical level, and examining the items in the ranked order.
This approach is especially useful in the preferred search mode, which is time-constrained in changing focus when the user jumps to a new page (whether or not the previous or new page is relevant to the search). In this way, the invention utilizes the occasion of the user""s jump to update preference criteria (based, e.g., on the jump itself and elapsed dwell time on the previous page) and begin the search anew. The breadth-first search, it has been found, is efficient in locating items matching the preference criteria than, for example, a xe2x80x9cdepth-firstxe2x80x9d search that proceeds down an arbitrary path of pages linked by a linear sequence of hyperlinks.
Reporting of recommendations based on items found in accordance with user preferences can occur in various ways. Most typically, the list appears at all times in a screen window; alternatively, it can remain hidden until the user expressly requests its display.