The invention relates in general to the field of information extraction, and more particularly to methods of training a named entity identification and classification system.
Information extraction systems are a class of tools designed to automatically extract useful information from media, such as text transcripts. Information extraction systems include name taggers, entity identifiers, relationship identifiers, and event identifiers. Name taggers identify named entities, such as people, places and organizations in media. Entity identifiers identify linkages between separate words in a media corpus that correspond to the same entity, for example, linkages between a pronoun and a proper name. Relationship identifiers identify relationships, such as employment or location relationships between two entities in a text. For example, a relationship identifier might determine that John Doe works for Business, Inc., Jane Doe is in New York City, etc. Event identifiers identify facts related to entire events. For example, an event identifier might identify a terrorist attack as an event. The event identifier also might link the event to participants in the attack, numbers of casualties, when the attack took place, etc.
Name taggers are trained to identify named entities in text. For example, name taggers are used in the intelligence community to identify intercepted communications related to particular individuals, such as Osama Bin Laden, or places, such as the American Embassy in Cairo. Name taggers can also be used in general search engines and language analysis tools. They also often serve as the foundation for relationship and event identification systems.
Consider the problem of a name tagger extracting named entities from the following text passage:
“George Bush went to New York to speak at the United Nations.”
To an English-speaking person, it would likely be clear that “George Bush” corresponds to the name of a person, “New York” is the name of a place and the “United Nations” is the name of a geopolitical entity. However, to a machine, the above sentence includes a number of words that could prove tricky to identify. For example, if a machine were to evaluate each word of the sentence absent any context, the machine might determine that the word “Bush” refers to a plant. Similarly, the words “New”, “United”, and “Nations” might be difficult for a machine to identify and classify appropriately, as the words have uses other than being a part of named entities.
Information extraction systems can be classified into two general categories, those that are based on generative models and those that are based on discriminative models. When processing input, a generative model assigns a combined probability to the input data and to the possible outputs. For example, in a name tagging problem, a generative model assigns probabilities both to the words being tagged and to the possible tags, themselves. The generative model ensures that the total probability over all possible input-output sequences sums to 1.0. To do so, a generative model is typically organized as a sequence of choices, with the product of the probabilities of those choices yielding the probability for the whole sequence. The model bases these probabilities from processed annotated training data.
A discriminative model, on the other hand, assigns probabilities only to possible outputs for a given input. In a name tagging problem, for example, a discriminative model only assigns probabilities to tag sequences. Discriminative models employ a mechanism to assign scores to each possible prediction, and the scores for all the different possible outputs are then normalized to produce probabilities. Discriminative models do not require that the scores sum to anything in particular before normalization. One approach for deriving these scores manually defines a set of features that characterize tags and their contexts, and then automatically learns a set of weights for those features. The feature weights then determine the model's predicted scores and the resulting probabilities aim to match as closely as possible actual observances of features in annotated training data.
Numerous techniques have been employed in information extraction systems training. Such techniques include various levels of human intervention. One typical training technique includes a human linguist annotating a corpus of text to be used as a training set. This technique can be very time consuming, because manually annotating text is a rather slow process. Another training technique, active learning, involves an information extraction system identifying specific strings of words for a human to annotate. For example, the information extraction system may identify word strings which the information extraction system cannot confidently classify. In a third training technique, an automated system processes a very large body of text without human intervention and derives contextual information based on the frequency of various relative word positions.