The exemplary embodiment relates to text processing. It finds particular application in the context of assessing whether text elements have been correctly classified as named entities and identifying previously unclassified text elements as named entities.
A named entity is a group of one or more words (a text element) that identifies an entity by name. For example, named entities may include persons (such as a person's given name or role), organizations (such as the name of a corporation, institution, association, government or private organization), locations (such as a country, state, town, geographic region, or the like), artifacts (such as names of consumer products, such as cars), specific dates, and monetary expressions. Named entities are typically capitalized in use to distinguish the named entity from an ordinary noun.
Named entities are of great interest for the task of information extraction in general, and for many other text processing applications. Identifying a group of words as a named entity can provide additional information about the sentence in which it is being used. Techniques for recognizing named entities in text typically rely on a lexicon which indexes entries that are named entities as such, and may further apply grammar rules, such as requiring capitalization, or use statistical analysis, to confirm that the group of words should be tagged as a named entity. For example, the lexicon WordNet is an on-line resource which can be used to identify a group of words as forming a named entity. This lexicon also indexes the entries according to one or more of a set of semantic types. Lower level types are grouped together under supertypes.
Automated recognition of named entities in text is often difficult because the words which make up the named entity have more than one context, and thus have usage outside the named entity context. Many systems, symbolic or statistical, automatically spot and categorize named entities with a relatively good accuracy (90% or above on an f-scale). However, the accuracy which such systems provide is sometimes not sufficient for certain applications where an accuracy of 100% is sought. For example, for document anonymization, a single named entity remaining in a redacted text can render the entire anonymization process worthless. While manual correction of the automatically processed data can be used to identify named entities which the automated process has missed, it can be a tedious task and not invariably error free.
There remains a need for a substantially automated method and system capable of providing improvements in named entity recognition and correction.