As computers and networks gain popularity, web-based computer documents (“documents”) become a vast source of factual information. Users may look to these documents to get answers to factual questions, such as “what is the capital of Poland” or “what is the birth date of George Washington.” The factual information included in these documents may be extracted and stored in a fact database.
It is beneficial to recognize entity names from these documents. These entity names can be used in organizing the factual information extracted from the documents. Because factual information is usually related to some entities, it can be organized to be associated with the names of these related entities. These entity names are also helpful in analyzing users' factual questions and identifying factual information necessary to answer them.
One conventional approach to recognizing entity names is to use human editors to review the documents. This approach is insufficient because the vast volume of documents and the rapid increase in the number of available documents make it impractical for human editors to perform the task on any meaningful scale.
Another conventional approach to recognizing entity names is to extract entity names from a reputable source, such as the Internet Movie Database. This approach is both under-inclusive and over-inclusive. Because the entity names are extracted from a single source, which tends to cover only certain types of entities (e.g., the Internet Movie Database only contains information about movies and people in the entertainment industry), the extracted entity names are only for a few types of entities. Therefore, this approach is under-inclusive. Because the extracted names may not be entity names (e.g., phrases extracted from entries for adjectives in the Merriam-Webster Online may not be entity names), this approach is also over-inclusive.
For these reasons, what is needed is a method and system for recognizing entity names.