The present disclosure relates generally to information extraction, and more specifically, to approximate named-entity extraction.
Named-entity extraction seeks to extract entities, such as names of people, addresses, drug names, and the like from unstructured text. There are three major categories of named-entity extraction methods: statistical-model based methods, linguistic-grammar based methods, and dictionary based methods.
Statistical-model based methods automatically learn patterns about entities and text from labeled training data, and generalize the learned patterns to new text to extract entities. The drawback of such methods is that a large amount of manually labeled data is required to construct a good model. The labeled data are typically very time intensive to generate, which hinders the applicability of these types methods in many real named-entity extraction systems.
Linguistic-grammar based methods are known for high extraction precision but at a cost of lower recall. The major drawback is that these methods require experienced computational linguists and domain experts to invest months of work to compose all the rules.
Dictionary based methods perform look-up based matching. Given a dictionary of named-entities (for example, the names of all employees of a company), a dictionary-based method extracts all of the strings from text that match a dictionary entry as an entity name. In contrast to statistical-model based methods and linguistic-grammar based methods, a dictionary-based method does not require any manually labeled data or any domain expertise. A dictionary-based method does not limit the possible types of entities it handles and can be applied to multiple domains. Dictionary-based methods can perform exact string matching efficiently but suffer from low recall, since the surface form of a name from unstructured text can vary substantially from its dictionary version. Alternatively, approximate string mapping can be performed to identify all of the strings having similarity scores to a dictionary entry above a threshold as entity names. Approximate string matching can achieve both high precision and high recall but typically suffers from a very high computational cost.