This invention relates to techniques for organizing material on computer networks for retrieval, and more particularly to methods of indexing material of interest to a user.
Computer networks have become increasingly important for the storage and retrieval of documents and other material.
The Internet, of which the World Wide Web is a part, includes a series of interlinked computer networks and servers around the world. Users of one server or network connected to the Internet may send information to, or access information on, other networks or servers connected to the Internet by the use of various computer programs which allow such access, such as Web browsers. The information is sent to, or received from, a network or server in the form of packets of data.
The World Wide Web portion of the Internet comprises a subset of interconnected Internet sites which may be characterized as including information in a format suitable for graphical display on a computer screen. Each site may include one or more separate pages. Pages, in turn, may include links to other pages within the site, or to pages in other Web sites, facilitating the user""s rapid movement from one page or site to another.
In view of the quantity of information and material available on computer networks such as the Web, and for other reasons as well, automated or semi-automated techniques for retrieving information that is thought to be relevant to a user at a given time may be employed. These techniques may be utilized in response to a specific user request, as when a search query by a user seeks information. These techniques also may be utilized when a user is accessing certain material, in order to make available material that it is thought may be of interest to a user who has accessed the original material. These techniques may also be utilized when a user, given access to particular material, requests other similar material. Other situations when these information retrieval techniques may be employed will also be apparent to one of ordinary skill in the art.
Some information retrieval techniques such as are employed in these circumstances choose documents for retrieval from among documents in a collection based upon the occurrence of specified terms in the documents in the collection. (Hereinafter, for simplicity, xe2x80x9cdocumentxe2x80x9d shall be used to refer to the items, such as Web pages or Web sites, in the collection being analyzed.) There are a variety of different techniques for specifying the terms to be used. (A xe2x80x9ctermxe2x80x9d may be any word, number, acronym, abbreviation or other collection of letters, numbers and symbols which may be found in a fixed order in a document.) In some methods, a search may be made among the documents in the collection for some or all of the terms in a search query generated by the user. In other methods, a search may be made for some or all of the text of a given document. (In some methods, all terms except certain common words, referred to as stop words, such as xe2x80x9cthexe2x80x9d or xe2x80x9candxe2x80x9d, may be included in the search.) In other methods, a search may be made for index terms which have been associated with that document by various means. Still other methods will use a combination of the above techniques, and further approaches to selecting terms for which a search is to be made will be familiar to one of ordinary skill in the art.
After a list of terms for which a search is to be made has been compiled, many information retrieval techniques then proceed by calculating scores for each document in the collection over which the search is being made, based upon the occurrence of the terms on the list in the documents. These scores which are calculated may be referred to as term frequency scores, insofar as the score assigned to a document depends on the frequency of occurrence of terms in the document.
There are a variety of different formulae which may be used to calculate these term frequency scores, including for example the Robertson""s term frequency score (RTF). Term frequency score formulae may assign varying weights to terms found in a document, depending upon such factors as the relative rareness or commonness of the term. Other factors which may be used to vary the weight assigned to a term in calculating a term frequency score will also be apparent to one of ordinary skill in the art.
Documents in a collection which is being searched may be divided into different sections or segments, such as an introduction or summary, a main body, footnotes, captions, and the like. Other divisions of documents will be apparent to one of ordinary skill in the art.
A Web site may permit a user to obtain lists of relevant items of interest, such as Web sites, other documents or names of merchants carrying merchandise in particular categories. The site may be organized so that an item of interest may be considered to be in more than one category. The site may be organized so that the categories presented to the user may vary, depending on a term or terms specified by the user. If this approach is utilized, the user may input terms that relate to the merchandise in which he is interested, such as xe2x80x9cautomobilesxe2x80x9d, and in return he may be presented with several categories, such as xe2x80x9cautomobiles, manufacturersxe2x80x9d or xe2x80x9cautomobiles, salesxe2x80x9d or xe2x80x9cautomobiles, service.xe2x80x9d The categories presented may be chosen by any one of a number of techniques that will be familiar to one of ordinary skill in the art.
It may be desirable present additional material to a user who is searching for items of interest. For example, it may be desirable to present the user with banner advertisements which relate to the item of interest for which he is searching.