The Internet, of which the World Wide Web is a part, consists of a series of interlinked computer networks and servers around the world. Users of one server or network which is connected to the Internet may send information to, or access information on, any other network or server connected to the Internet by the use of various computer programs which allow such access, such as Web browsers. The information is sent to or received from a network or server in the form of packets of data.
The World Wide Web portion of the Internet consists of a subset of interconnected Internet sites which are characterized by containing information in a format suitable for graphical display on a computer screen. Each site may consist of one or more separate pages. Pages in turn frequently contain links to other pages within the site, or to pages in other Web sites, facilitating the user's rapid movement from one page or site to another.
The Web is so large that users frequently call upon specialized programs such as Web browsers or search engines to help them locate information of interest on the Web. These specialized programs may analyze information about Web sites in a variety of ways, select a set of Web addresses that are expected to meet the user's criteria, and present this list, often in rank order, to the user. Or the specialized program may directly connect the user to the address selected as meeting the user's criteria.
Increasingly, sites on the Web are becoming larger, as companies increasingly use the Web to facilitate large scale electronic commerce and for other purposes.
Some sites which make themselves available to users, and to which users are directed by Web browsers or search engines, may be very large, and may have available several types of information. For example, an electronic commerce or shopping site may have material such as consumer guides, electronic yellow pages, and the like. It may be useful, when a user is accessing one portion of such a site, or when the user requests information of one type, also to be able to provide related information of another type.
Some prior methods of identifying and retrieving related information maintained in the Web site have not proven fully satisfactory. For example, it may be possible to manually associate items of data in a Web site with key words, such that when a user accesses or requests information of one type, such as consumer guide material, information of another type with the same key word or words is also returned or otherwise made available to the user. However, methods requiring the manual assignment of key words may be difficult to implement and update, particularly if the underlying information is frequently changing, as may be the case in a dynamic field such as electronic commerce. It therefore may be useful to have available a method of automatically identifying related information of another type when a user accesses one portion of a Web site, or when a user inquiry is received that seeks information of one type.
A limitation in current methodology that may limit the ability so to provide related information is the difficulty in maintaining in a conveniently and quickly usable form information about the content of a site on the Web. An efficient specialized program for generating lists of useful material in response to user inquiries may utilize information about a Web site that is stored in data bases accessible to the specialized program.
Inverted term lists are frequently utilized to store information about Web pages or sites in a database. An inverted term list may be prepared for each term present in the collection of material being analyzed. (Hereinafter, for simplicity, “document” may be used to refer to the items, such as pages or other portions of Web sites, in the Web site being analyzed. A “term” may be any word, number, acronym, abbreviation or other collection of letters, numbers and symbols which may be found in a fixed order in a document.) Alternatively, lists may be prepared for all terms except certain common words, referred to as stop words, such as “the” or “and”. Alternatively, lists may be prepared only for a specialized subset of terms of special interest, such as technical terms in a particular field, or names.
An inverted term list for a term may contain information about the overall occurrence of that term in the Web site being analyzed. The information which may be maintained in an inverted term list for a given term may include information such as the total number of documents in the Web site in which that term occurs, the total number of occurrences of that term in the documents, and the maximum number of occurrences of that term in any single document, among other things. (Alternatively, some or all of this information may be stored in a lookup table which also contains the address of the inverted term list for the term in question.)
An inverted term list also will include information about the occurrence of that term in particular documents. For each document in which that term occurs, the inverted term list may include information about the location of the document in the Web site, or a reference to a lookup table where such information is stored. The inverted term list may also include the number of occurrences of that term in that document. In addition, the inverted term list may include other information about the occurrences of that term in that document, such as the locations in that document of its occurrences.
An inverted term list may be stored in the form of a linked list or as an array. In a linked list, there may be a header containing the general information that is not specific to a particular document, such as but not limited to the number of occurrences of the term in the Web site, if that information is not maintained in the lookup table. In the linked list there may also be one link for each document in which the term appears. In this arrangement, each link in an inverted term list will contain the location of a document in which that term appears, together with such information about the occurrence of that term in that document as is being maintained, and the address of the next link in the inverted term list. (To save storage space, rather than containing the URL of a document, the inverted term list may contain the address in a lookup table at which the URL is stored. To further save storage space, the inverted term list may store that lookup table address relative to the lookup table address of the prior document in the inverted term list, rather than as an absolute address.)
Inverted term lists are helpful for many techniques for searching large collections of documents for documents of interest. For example, in retrieving portions of a Web site that may be of interest to a user who has accessed another portion of the site, it may be necessary to locate documents from the Web site which contain a particular word. However, if the Web site is large, it is not desirable to conduct a full new search of the Web site for documents containing the specified word. Inverted term lists resolve that problem. To locate documents containing a particular word of interest, it is simply necessary to consult the inverted term list for that word. It is also possible to list the documents in the inverted term list such that those that use the desired word more often are placed at the top of the list; this may help find the most useful document more quickly.
More complicated requests also may be handled with inverted term lists. For example, if it is required to locate documents in which two particular words occur, it is simply necessary to consult the inverted term lists for both words, and to choose any documents which are found on both lists. Again, documents that may be more useful may be placed higher on the list of useful documents, according to considerations such as but not limited to how many occurrences they have of the desired words.
Current techniques for Web searching and retrieval, including techniques for searching and retrieval of information in a large Web site, that do not maintain information about documents in the collection in an accessible data base, other than by means of inverted term lists, may pose problems. In particular, they do not organize and maintain information by the underlying document, rather than by the terms of interest. This leads to a number of problems in retrieving portions of Web sites that may be of interest to a user who has accessed another portion of the Web site, which will now be discussed.
For example, in the course of choosing documents anticipated to be useful to a user, it may be useful to calculate the score a given document will achieve under a particular search query. Under conventional methods, where no information is stored by document in a data base, it is necessary, in order to calculate a document score, to consult an inverted term list for each term in the search query, and to search within each such inverted term list to determine if that term occurs in the document in question. It could be more efficient if in calculating the document score one could avoid consulting inverted term lists for terms which do not occur in the document.
There is a further problem that occurs as a result of the fact that some conventional methods do not store information in a manner organized by document. It is recognized that searches for useful documents can take a relatively long time to process. This is because as the search criteria become complicated, more and more inverted term lists need to be referenced. Moreover, as the underlying Web site becomes bigger, each inverted term list becomes longer, including as it does all references to the term in question in the Web site. An inverted term list is likely to be particularly long if the term in question is relatively common.
Prior efforts to address this problem include refusing to permit the use of common words as part of a search inquiry. As noted above, words such as “the” or “and” may be omitted. Other common words, however, can be of use in narrowing down the search to more useful documents. For example, it might be of interest to find all documents referring to the occurrence of “osteoporosis” in “women.” While searching on “osteoporosis” alone will produce these documents, it may also produce many extraneous documents. It would thus be useful to use the word “women” to refine the search. But this word is very common, and hence is likely to occur in many documents. There is thus a need for a method of making complex searches which include many terms more efficient.
The method that is often referred to as “query by example” is one of a number of prior methods of using the contents of a single document to find similar documents. This may be done, for example, by relevance feedback by having a user begin with a search query, and then by presenting the user with documents selected by means of the query, and allowing the user to identify one (or more) of the resulting documents as relevant. The method then chooses terms from the documents identified as relevant (based upon their statistics) and expands and reruns the query. This method, however, has not been applied to automatically link together portions of a Web site, without the need for explicit hyperlinking or manually indexing the pages.