The present disclosure relates generally to information extraction, and more specifically, to a semi-supervised data integration model for named entity classification.
Named entity recognition and classification are important aspects of information extraction to identify information units such as people, organizations, location names, and numeric expressions for time, money and numbers from unstructured text. Typically, information units or numeric expressions are first extracted out as named entities from the unstructured text (i.e., named entity recognition), followed by learning a function from an entity to its type, which is selected from predefined categories such as: People, Organizations, Locations, Products, Genes, Compounds, and Technologies, etc. (i.e., named entity classification).
A learning method hinging upon recognition and classification rules is important to named entity recognition; however, performing classification using handcrafted rules is not scalable as a corpus to classify grows. There are several kinds of learning methods depending on the availability of training examples. Supervised learning methods infer rules from positive and negative examples of named entities over a large collection of annotated documents for each entity type. Supervised learning requires a large annotated corpus and thus is impractical where manually generated labels are not available or are difficult to generate. Unsupervised learning methods apply clustering technology to automatically gather entities from clusters. Unsupervised learning suffers from randomization of clustering and is sensitive to outliers in the data. More commonly, there only exists a small set of training seeds for starting the learning process. A semi-supervised learning system accumulates new rules from newly classified positive and negative examples at a rapidly accelerating rate and applies these rules to unlabeled data iteratively. Semi-supervised learning typically deteriorates rapidly when noise is introduced in the data, causing problems related to selection of unlabeled data for each round of re-training