In some applications, it is desirable to identify occurrences of a named entity in a set of documents. A named entity often corresponds to a proper noun, e.g., referring to a name of a person, organization, location, product, event, etc. This task may be challenging, however, because a named entity may correspond to a string having two or more meanings (i.e., a homograph). For example, assume the goal is to identify documents which contain reference to Apple® computers. Some of the documents may use the word “apple” in the context of fruit, rather than computers.
One known way to address this problem is via a content-matching technique. This technique entails identifying the context in which a document mentions a string corresponding to the named entity in question, e.g., the word “apple.” The technique then compares this context information with a-priori reference information associated with the named entity, such as an online encyclopedia entry corresponding to Apple® computers. If there is a match between the context information and the reference information, the technique can conclude that the mention of “apple” in the document likely corresponds to Apple® computers.
This approach, however, is not fully satisfactory. One drawback is that many named entities have no counterpart reference documents that provide authoritative information regarding the named entities.