There is a growing need for new ways of searching for information which increase the relevance of documents found and which are simple and intuitive to use whilst being able to cope with the enormous amounts of information that exists on the internet and other sources.
Web search systems are an example of one type of information retrieval system although the present invention is concerned with information retrieval systems of any type. Web search systems enable us to find web sites that best suit our requirements. Three main components are used to achieve this: web crawlers; index generators; and query servers.
Web crawlers crawl the web one link at a time and send identified web pages to be indexed. This is achieved by making use of links between web sites. This web crawling process can be thought of as a continual process of identifying new web sites and identifying updates to existing web sites.
The crawling process enables many billions of web pages to be identified and in order to make use of this information a systematic way of retrieving pages is required. An index generator provides part of this means. Similar to an index in the back of a book, the index generator identifies keywords to associate with each website's content. Then, when you search for those keywords, the search system can find the most appropriate pages out of the billions that are available.
The index generator includes such information as how often a term is used on a page, which terms are used in the page title, or in the index, for the subsequent use of the query server in ranking the documents. Other information such as the language that the web site is written in and information about how many other web sites link to the web site concerned can also be used.
A query server (also referred to as a search engine) is used to rank the index documents on the basis of how well they match user input search terms. The query server analyses the user search terms and compares them with the indexed web pages. It generates a rank or score for the indexed web pages on the basis of the user input search terms. In this way, web pages relevant to the user search terms are identified with scores or ranks to indicate the degree of likelihood of relevance.
There is an ongoing need to improve the relevance of items retrieved by information retrieval systems such as web search systems. In addition, there is a need to achieve this in a fast and computationally inexpensive manner, which reduces the need for storage resources where possible.
The embodiments described below are not limited to implementations which solve any or all of the disadvantages of known information retrieval systems.