Nowadays, information retrieval from electronic documents is fundamental to the functioning of our society. Such information retrieval may be performed on a set of documents, e.g. an electronic database, with such a set being stored in a centralized manner, e.g. on a personal computer or on a private network, or stored in a distributed manner, e.g. on a virtual private network having nodes in different geographical locations or on publicly accessible networks such as the Internet.
Often, the extremely large number of available electronic documents makes it difficult to retrieve the desired information in an efficient manner. To this end, attempts have been made to determine the relevance of electronic documents based on their information content such that automated information retrieval processes return the electronic documents that are most likely to contain information relevant to the information retrieval process.
An electronic document typically comprises a plurality of pieces (units) of information, which are also referred to as ‘terms’. The classical method of indexing and retrieval of electronic documents uses the notion of assigning a weight wk to a term k to characterize an electronic document, which weight is directly proportional to the frequency of the term (TF) in the electronic document and inversely proportional to the frequency of the documents (IDF) in which the term occurs; wk˜TF/IDF. This method relies on indexing all the terms, e.g. words, of the electronic document irrespective of whether they are core to the document content or are peripheral in nature. Consequently, information retrieval algorithms utilizing the assigned weights wk in respective electronic documents do not necessarily return a set of electronic documents that are relevant to the search query defined by a user.
Information retrieval processes may also utilize a user profile that defines the interest of the user to retrieve a set of electronic documents from a database that are most likely to be of interest to the user. For example, the Rocchio algorithm analyses electronic documents that have been accessed by the user and assumes the accessed documents to be relevant, and weights high frequency terms in relevant electronic documents positively and high frequency terms in irrelevant electronic documents, i.e. non-accessed documents, negatively.
However, the actual interests of a user may be confined to only a small part of the electronic document that he/she is interested in, which is core to the document, rather than being interested in everything in the document. Hence, even a low TF term may be critical to the electronic document from an information retrieval perspective. Thus a personalized search/information retrieval application based on a user profile constructed using only high TF terms may return a significant number of irrelevant results.