The exemplary embodiment relates to named entity recognition and finds particular application in a system and method which make use of document-level entity name and type tags.
Named entity recognition (NER) generally entails identifying names (one or more words) in text and assigning them a type (e.g., person, location, organization). State-of-the-art supervised approaches use statistical models that incorporate a name's form, its linguistic context, and its compatibility with known names. These models are typically trained using supervised machine learning and rely on large collections of text where each name has been manually annotated, specifying the word span and named entity type. This process is useful for training models, but is manually time consuming and expensive to provide a label for every occurrence of a name in a document.
Gazetteers are large name lists of a particular type mined from external resources such as Wikipedia, mapping data, or censuses. A common use is to generate a binary feature for the NER model if a word is part of a known name. For example, Bob is more likely to be a name than went as it appears in a large list of person names. The names in the gazetteer do not need to be categorized with the same type scheme as is applied in the NER task (e.g., the type may simply be large_list_of_people). The goal of gazetteers is to improve recall by including known names that are not necessarily seen in the annotated training data used to train the NER model.
Although statistical NER systems developed for English newswire services perform well on standard datasets, performance declines once the data varies in language and domain.
There has been considerable work on incorporating external knowledge into NER models. For an overview, see David Nadeau, et al., “A survey of named entity recognition and classification,” Linguisticae Investigationes, 30(1):3-26, 2007. For example, one method is to use a structured encoding for each gazetteer entry. See, Jun'ichi Kazama et al., “Exploiting Wikipedia as external knowledge for named entity recognition,” Proc. 2007 Joint Conf. on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 698-707, 2007 (hereinafter, Kazama 2007). A set of features is used for the encoding. The features may be used for modeling labels in a CRF model, as described, for example, in Lev Ratinov, et al., “Design challenges and misconceptions in named entity recognition,” Proc. 13th Conf. on Computational Natural Language Learning (CoNLL-2009), pp. 147-155, 2009, hereinafter, “Ratinov 2009”. Linking data to a knowledge base (KB) has also been used to assist in NER, as described in Angus Roberts, et al., “Combining terminology resources and statistical methods for entity recognition: an evaluation,” Proc. 6th Intl Conf. on Language Resources and Evaluation (LREC'08), pp. 2974-2980, 2008).
Linked data has also been used as a data acquisition strategy for NER, specifically creating training data from Wikipedia (Kazama 2007, Alexander E. Richman, et al., “Mining wiki resources for multilingual named entity recognition,” Proc. ACL-08: HLT, pp. 1-9, 2008, Joel Nothman, et al., “Learning multilingual named entity recognition from Wikipedia,” Artificial Intelligence, 194(0):151-175, 2013) or gene name articles (Andreas Vlachos, et al., “Bootstrapping and evaluating named entity recognition in the biomedical domain,” Proc. HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology, pp. 138-145, 2006; Alex Morgan, et al., “Gene name extraction using flybase resources,” Proc. ACL 2003 Workshop on Natural Language Processing in Biomedicine, pp. 1-8, 2003). The goal is of these methods is to generate large quantities of training data for standard NER models.
Representing external knowledge in vector space embeddings (e.g., Brown clusters, Neural language models, or Skip-gram models) has also been shown to be effective for NER (Ratinov 2009; Joseph Turian, et al., “Word representations: A simple and general method for semi-supervised learning,” Proc. 48th Annual Meeting of the ACL, pp. 384-394, 2010; Alexandre Passos, et al., “Lexicon infused phrase embeddings for named entity resolution,” Proc. 18th Conf. on Computational Natural Language Learning, pp. 78-86, 2014).
However, such methods generally entail building a sizable NER model and do not take into account the document being processed.
There remains a need for a system and method for improving the performance of an NER model without requiring the collection and use of large amounts of additional training data for training the model.