Search engines use several techniques for rating or sorting pages arising from a search. Among the known techniques for exploring a set of Web pages, some rely on semantics, a page being rated as being all the more relevant if it comprises a large number of occurrences of the word or words searched for. These techniques are sensitive to a practice, known by the name “spamming”, aimed at making the words commonly employed by Internet users in their search query feature a very large number of times in a given page, this having the effect of making the page appear frequently as relevant.
Other techniques are based on the topological structure of the Web. These techniques take account at one and the same time of the existing links between the pages considered and of the properties of the pages themselves, such as the membership of a page in a network domain or subdomain of the Web. These techniques are generally based on a graph-type representation of the pages to be processed. They are appropriate to the classification of pages satisfying topological properties that are given in the graph. These techniques are sensitive to a variant of the method of “spamming” aimed at referencing a given page a large number of times, this having the effect of locally falsifying the topological characteristics of the graph of the Web.
Some of the techniques utilizing the topological structure of the Web consist in effecting a classification of the Web pages by allocating the various pages a rank which is dependent on the relationships between a page and the others.
An example of such a procedure, known by the term “PageRank”, is used in the implementation of the Google™ search engine and is described in the document: “The PageRank Citation Ranking: Bringing Order on the Web”, by L. Page, S. Brin, R. Motwani and T. Winograd; Technical Report, Computer Science Department, Stanford University, 1998.
The PageRank procedure orders the pages as a function of their visibility on the Web. In this procedure, random page by page browsing on the Web by following the hypertext links, is simulated. This browsing corresponds to that engendered by a user accessing the Web when the latter randomly activates one of the hypertext links located in a viewed page, so as to access another page. This procedure undertakes a probabilistic analysis of this simulated browsing so as to determine the probability of the user being on a given page during random page by page browsing such as this. The rank of a page is all the higher the higher the number of times that this page is cited by other pages.
Such a procedure provides a rating rank which is not necessarily relevant in relation to the search performed by a user, the best rated pages (of highest rank) not necessarily being the pages corresponding best to the user's expectations.
Furthermore, this procedure does not make it possible to identify in the set of documents thematic communities or communities of interest, capable of steering the user more rapidly to an interesting page.
Finally, in the case where a user identifies in the set of documents exhibited a document which is of particularly interest, it is not possible by using a list of documents ordered as a function simply of their rank, to readily determine whether other documents, close to the interesting document or linked with the latter in one way or another, are present in the set of documents.