Extracting structured information from unstructured text is an essential component of many important applications including business intelligence, social media analytics, semantic search, and regulatory compliance. The success of these applications is tightly connected with the quality of the extracted results. Incorrect or missing results may often render the application useless.
Building high-quality information extraction rules to extract structured information from unstructured text is a difficult and time-consuming process. Exhaustive dictionaries of words and phrases are integral to any information extraction system. One of the most important parts of this process can include refining the dictionaries by selectively removing dictionary entries that lead to false positives. Sophisticated extractors that use greater numbers of fine-grained dictionaries to improve accuracy also increase the difficulty of refining the dictionaries for efficient and accurate extraction due to the size and number of dictionaries.