Text content is found on Web pages, blogs, forums and other areas of the World Wide Web, at social networking sites, and through news feed or message distribution services. A large company may have tens or hundreds of thousands of documents including text content, as well as email archives, archives of invoices and other archives. Much of the available text information is unstructured text, and the amount of unstructured text content is continually growing.
Being able to understand unstructured text content for the purpose of market analysis, analysis of trends or product monitoring can give a competitive advantage to a company. An automatic text processing service helps extract meaningful information from unstructured text content. A named entity recognition (“NER”) service is a type of automatic text processing service that converts unstructured text content into structured content, which can be analyzed more easily. Various NER services have been used in the past, and many are integrated into currently available services, including those offered by Extractiv, DBPediaSpotlight, OpenCalais and AlchemyAPI. For text processing, a NER service (1) detects an entity (e.g., person, organization, product, service or other “thing”) within text content of a document (e.g., Web page, article, invoice, email, white paper, blog post, news feed or other object containing information), (2) identifies the location of the entity in the document, and (3) classifies the entity as having an entity type. NER services have particular significance for automatic text processing because named entities and the relations between them typically contain relevant information.
The information extracted by a NER service may be used to support analysis, decision making and strategy development. Important business decisions may be made based on the extracted information. Thus, the accuracy and reliability of the information extracted by a NER service is highly important. In many cases, however, a given NER service, taken by itself, has trouble consistently identifying named entities correctly for different types of documents. In this respect, different NER services have different strengths and weaknesses.
Combining extraction results from several NER services can improve the overall quality of the extracted information. Prior approaches to combining extraction results from diverse NER services have mostly focused on the stage of detecting entities in documents and/or the stage of identifying locations of entities within the documents. These prior approaches have not considered differences in type classification used by different NER services (e.g., the entity types recognized by the NER services, and the relationships among those supported entity types). This can be a problem if the NER services vary in their ability to detect particular types of entities. For example, NER services that perform poorly when detecting and identifying certain entity types may be given too much consideration when aggregating extraction results. It can also be a problem if NER services use different names for the same entity type, or if the NER services apply type classifications with different levels of specificity (e.g., fine-grained versus general). For this reason, prior approaches to combining extraction results from diverse NER services have limited applicability in real-world scenarios.