Organizations are generating and collecting an ever-increasing amount of data, much of it in the form of text. Often text is generated during the course of business, such as a doctor's notes, an engineer's equipment inspection report, product manuals, etc., only to be siloed away. Increasingly, organizations are interested in understanding the content of this text and extracting useful information from it. However, searching through large volumes of text is cumbersome, expensive, and error prone.
One technique, named entity recognition (NER), seeks to locate and classify entities into pre-defined categories such as the names of persons, organizations, locations, expressions of time, quantities, monetary values, percentages, etc. NER is typically domain specific—implementations directed to one domain, e.g. news, do not generalize to others, e.g. Twitter®. For example, different NER models may be used to identify part numbers in product manuals, baseball players in a collection of sports articles, airplane models in a collection of aircraft worthiness documents, etc.
Existing techniques for NER include training a machine learning model and hand-crafted rules. Hand-crafting rules can be a laborious process, while training a machine learning model often requires large amounts of hand labeling. Neither technique generalizes well. Active learning decreases the amount of hand labeling by iteratively asking a user to label examples, classifying data based on those labels, and identifying more examples based on the classification. However, selecting examples so as to gain the greatest insight with the fewest number of labeling actions is an ongoing technological problem. It is with respect to these considerations and others that the invention has been made.