Nowadays, a plethora of electronic knowledge repositories such as databases and file systems are available that can provide valuable information. Such repositories may be stored locally on a computer or maybe accessible over the Internet. Probably the best known example of such a repository is the online encyclopedia Wikipedia, and other examples will be apparent. Because the reliability of the information in such databases and in particular in Wikipedia has become very good, it has for instance been reported that Wikipedia has a comparable accuracy as the Encyclopedia Britannica, electronic databases are more and more used as instruments for processing electronic information.
In particular, electronic documents may be referenced against such an electronic database. To this end, the content of the electronic document is compared against the content of the electronic database and corresponding content can be labeled accordingly. This labeling can be used to identify key phrases in the electronic document, for instance for the purpose of providing an accurate summary of the electronic document or to prepare the electronic document for insertion into the electronic database, where the key phrases of the electronic document are converted into hyperlinks such that when added to the electronic database, users accessing the electronic document can quickly jump to the related subject.
One of the problems occurring when trying to extract key phrases from electronic document is how to distinguish between a key phrase and a phrase of lesser relevance. A common approach is to count the number of occurrences of a phrase in the electronic document to identify the more relevant phrases. Alternative approaches include the X2 independence test, which assesses if the occurrence frequency of a phrase in a document is higher than would be expected from chance, as well as the keyphraseness approach, which considers whether or not a phrase is a key phrase based on the frequency of this phrase being selected as the key phrase in other database documents.
After potential key phrases have been identified, the actual key phrases are typically selected by assessing whether a subject in the electronic database corresponding to such a phrase is a subject of a particular relevance. Several algorithms exist to assess the relevance of a subject. For instance, a well-known algorithm is PageRank, the algorithm used by Google to find the most relevant pages in a user-defined query. This algorithm treats the database on which operates as a directed graph in which ranking values are assigned to the nodes of the graph using a recursive approach in which these values are calculated from the values of nodes to which they are linked.
However, the known approaches still suffer from problems. This is because the electronic databases typically contain thousands of subjects, such that many phrases in the document under consideration can be matched with a subject in the electronic database. Consequently, the known approaches have a tendency of selecting too many phrases as key phrases. Compensating for this problem by adjusting a selection threshold can cause the incorrect de-selection of key phrases.