The invention relates to the field of information searching and browsing, and more particularly to a system and method for enhancing searches and recommending documents in a collection through the use of bookmarks shared among a community of users.
Computer users are increasingly finding navigating document collections to be difficult because of the increasing size of such collections. For example, the World Wide Web on the Internet includes millions of individual pages. Moreover, large companies"" internal Intranets often include repositories filled with many thousands of documents.
It is frequently true that the documents on the Web and in Intranet repositories are not very well indexed. Consequently, finding desired information in such a large collection, unless the identity, location, or characteristics of a specific document are well known, can be much like looking for a needle in a haystack.
The World Wide Web is a loosely interlinked collection of documents (mostly text and images) located on servers distributed over the Internet. Generally speaking, each document has an address, or Uniform Resource Locator (URL), in the exemplary form xe2x80x9chttp://www.server.net/directory/file.htmlxe2x80x9d. In that notation, the xe2x80x9chttp:xe2x80x9d specifies the protocol by which the document is to be delivered, in this case the xe2x80x9cHyperText Transport Protocol.xe2x80x9d The xe2x80x9cwww.server.netxe2x80x9d specifies the name of a computer, or server, on which the document resides; xe2x80x9cdirectoryxe2x80x9d refers to a directory or folder on the server in which the document resides; and xe2x80x9cfile.htmlxe2x80x9d specifies the name of the file.
Most documents on the Web are in HTML (HyperText Markup Language) format, which allows for formatting to be applied to the document, external content (such as images and other multimedia data types) to be introduced within the document, and xe2x80x9chotlinksxe2x80x9d or xe2x80x9clinksxe2x80x9d to other documents to be placed within the document, among other things. xe2x80x9cHotlinkingxe2x80x9d allows a user to navigate between documents on the Web simply by selecting an item of interest within a page. For example, a Web page about reprographic technology might have a hotlink to the Xerox corporate web site. By selecting the hotlink (often by clicking a marked word, image, or area with a pointing device, such as a mouse), the user""s Web browser is instructed to follow the hotlink (usually via a URL, frequently invisible to the user, associated with the hotlink) and read a different document.
Obviously, a user cannot be expected to remember a URL for each and every document on the Internet, or even those documents in a smaller collection of preferred documents. Accordingly, navigation assistance is not only helpful but necessary.
Modern Web browsers (software applications used to view and navigate documents on the Web) have introduced the concept of xe2x80x9cbookmarksxe2x80x9d or xe2x80x9cfavoritesxe2x80x9d (collectively referred to as xe2x80x9cbookmarksxe2x80x9d in this document). Bookmarks allow a user to identify which documents he would like to keep track of. The user""s local machine then keeps track of the URLs for those sites, allowing the user to reload and view the sites"" contents at any desired time. Bookmarks can be thought of as xe2x80x9cpointersxe2x80x9d to content on the Web, each specifying an address that identifies the location of the desired document, but not including the document""s content (except, perhaps, in a descriptive title of the document).
In current versions of Netscape Navigator (specifically, at least versions 3.x and 4.x), a user""s bookmarks are stored and maintained in a special HTML file stored on the user""s local machine. This file includes a list of sites represented as title and URL pairs (in a user-defined hierarchy, if desired). The user""s entire set of bookmarks is contained within a single HTML file.
Recent versions of Microsoft""s Internet Explorer (at least versions 3.x-5.x) store user bookmarks (or xe2x80x9cfavorites,xe2x80x9d using Microsoft""s preferred terminology) as individual files on the local machine""s file system. Each favorite is a small file containing the site""s URL, while the favorite""s title is stored as the filename.
Other browsers"" bookmarks are frequently stored as entries in a custom configuration file, in which each site""s title is paired with a URL.
None of the foregoing browsers permit much sophisticated user of a user""s collection of bookmarks, although some limited manipulations are possible. For example, it is usually possible to create and modify a hierarchy of bookmarks (including sorting and moving existing bookmarks around within the hierarchy); to modify the titles paired with the URLs, to search for words within the titles or URLs; and often to derive some additional information about the bookmarks, such as the date and time of the user""s most-recent visit to the site, the collected number of visits, and possibly other information.
In typical use, the bookmark facilities of Web browsers act as a xe2x80x9cfilterxe2x80x9d for those documents a particular user finds to be important or useful While a user might view hundreds of Web pages in a day, only a few of those are typically found to provide useful information. If that information is expected to be useful again in the future, the user will often set a bookmark for those pages. This is a useful way for users to be able to access the Internet; however, traditional bookmarks have the distinct limitation that they are only useful to the extent a user has seen the sites before, since adding a bookmark to a collection is a manual act, typically performed when either the desired page is already being viewed or a URL has been manually received from another person.
Most notably, the known traditional bookmark systems are single-user. Of course (particularly with Netscape Navigator, in which bookmarks already exist in an HFML file), bookmarks can be exported to a public web page, allowing others to view and use the bookmarks, but that in itself does not provide any additional functionality.
Accordingly, when a user desires to find information on the Internet (or other large network) that is not already represented in the user""s bookmark collection, the user will frequently turn to a xe2x80x9csearch enginexe2x80x9d to locate the information. A search engine serves as an index into the content stored on the Internet
There are two primary categories of search engines: those that include documents and Web sites that are analyzed and used to populate a hierarchy of subject-matter categories (e.g., Yahoo), and those that xe2x80x9ccrawlxe2x80x9d the Web or document collections to build a searchable database of terms, allowing keyword searches on page content (such as AltaVista, Excite, and Infoseek, among many others).
Also known are recommendation systems, which are capable of providing Web site recommendations based on criteria provided by a user or by comparison to a single preferred document (e.g., Firefly, Excite""s xe2x80x9cmore like thisxe2x80x9d feature).
xe2x80x9cGooglexe2x80x9d (www.google.com) is an example of a search engine that incorporates several recommendation-system-like features. It operates in a similar manner to traditional keyword-based search engines, in that a search begins by the user""s entry of one or more search terms used in a pattern-matching analysis of documents on the Web. It differs from traditional keyword-based search engines (such as AltaVista), in that search results are ranked based on a metric of page xe2x80x9cimportance,xe2x80x9d which differs from the number of occurrences of the desired search terms (and simple variations upon that theme).
Google""s metric of importance is based upon two primary factors: the number of pages (elsewhere on the Web) that link to a page (i.e., xe2x80x9cinlinks,xe2x80x9d defining the retrieved page as an xe2x80x9cauthorityxe2x80x9d), and the number of pages that the retrieved page links to (i.e., xe2x80x9coutlinks,xe2x80x9d defining the retrieved page as a xe2x80x9chubxe2x80x9d). A page""s inlinks and outlinks are weighted, based on the Google-determined importance of the linked pages, resulting in an importance score for each retrieved page. The search results are presented in order of decreasing score, with the most important pages presented first. It should be noted that Google""s page importance metric is based on the pattern of links on the Web as a whole, and is not limited (and at this time cannot be limited) to the preferences of a single user or group of users.
Another recent non-traditional search engine is IBM""s CLEVER (CLient-side EigenVector Enhanced Retrieval) system. CLEVER, like Google, operates like a traditional search engine, and uses inlinks/authorities and outlinks/hubs as metrics of page importance. Again, importance (based on links throughout the Web) is used to rank search results. Unlike Google, CLEVER uses page content (e.g., the words surrounding inlnks and outlinks) to attempt to classify a page""s subject matter. Also, CLEVER does not use its own database of Web content; rather, it uses an external hub, such as an index built by another search engine, to define initial communities of documents on the Web. From hubs on the Web that frequently represent people""s interests, CLEVER is able to identify communities, and from those communities, identify related or important pages.
Direct Hit is a service that cooperates with traditional search engines (such as HotBot), attempting to determine which pages returned in a batch of results are interesting or important, as perceived by users who have previously performed similar searches. Direct Hit tracks which pages in a list of search results are accessed most frequently; it is also able to track the amount of time users spend at the linked sites before returning to the search results. The most popular sites are promoted (i.e., given higher scores) for future searches.
Alexa is a system that is capable of tracking a user""s actions while browsing. By doing so, Alexa maintains a database of users"" browsing histories. Page importance is derived from other users"" browsing histories. Accordingly, at any point (not just in the context of a search), Alexa can provide a user with information on related pages, derived from overall traffic patterns, link structures, page content, and editorial suggestions.
Knowledge Pump, a Xerox system, provides community-based recommendations by initially allowing users to identify their interests and xe2x80x9cexpertsxe2x80x9d in the areas of those interests. Knowledge Pump is then able to xe2x80x9cpushxe2x80x9d relevant information to the users based on those preferences; this is accomplished by monitoring network traffic to create profiles of users, including their interests and xe2x80x9ccommunities of practice,xe2x80x9d thereby refining the community specifications. However, Knowledge Pump does not presently perform any enhanced search and retrieval actions like the search-engine-based systems described above.
While the foregoing systems and services blend traditional search engine and recommendation system capabilities to some degree, it should be recognized that none of them are presently adaptable to provide search-engine-like capabilities while taking into account the preferences of a smaller group than the Internet as a whole. In particular, it would be beneficial to be able to incorporate community-based recommendations into a system that is capable of retrieving previously unknown documents from the Internet.
The present system and method facilitate searching and recommending resources, or documents, based upon a collection of user document preferences shared by a large group of users. The invention leverages several of the key properties of document collections: only valuable documents are bookmarked; documents are usually categorized into a hierarchy; and documents can be shared. In a preferred embodiment, the present system combines some attributes of bookmark systems, as discussed above, with some attributes of search engines and recommendation systems, also discussed above.
The present system and method maintain a centralized database of bookmarks or user document preferences. This centralized database is maintained as a hierarchy, with individual users"" bookmarks maintained separately from other users"" bookmarks. However, the maintenance of the centralized database facilitates harnessing the power and flexibility of being able to use, in various ways, all users"" public bookmarks and the information contained in and referenced by those bookmarks.
The system and method of the present invention allows for several operations to be performed, including enhanced search and retrieval, enhanced subject-matter-based recommendation generation (for both documents and groups), and automatic document categorization and summarization.