With the proliferation of information generated daily and accessible to users over the Web, the need for intelligent electronic assistants to aid in locating and/or discovering useful or desired information amongst the morass of data is paramount. The use of natural language processing to search text to correctly recognize people, places, or things is fraught with difficulties.
A named entity, such as a person, place, object or other named entity may be a member of a class or type. For example, a person called “John Wayne” may be an example of the class “person”. For example, a place called “Mexico City” may be an example of the class “city”. Automated systems for recognizing named entities are able to extract named entities from digital documents and classify those named entity mentions into one or more pre-specified categories such as person, city, automobile, and others. Named entity results may then be used for many downstream purposes such as improving information retrieval systems, knowledge extraction systems and many others.
First, natural language is ambiguous. Almost every English word or phrase can be a place name somewhere in the world or a name of a person (i.e., a “person name”). Furthermore, many entities share the same name. For example, there are more than 20 cities named “Paris” in the United States. A person named “Will Smith,” could refer to the Hollywood movie actor and musician, the professional football player in the NFL, or many other people. Recognizing non-celebrity names has become more important with the exponential growth of the Web content, especially user created content such as blogs, Wikipedia, and profiles on social network sites like MySpace and FaceBook.
Second, an entity could be mentioned or referred to in many different ways (e.g., pronouns, synonyms, aliases, acronyms, spelling variations, nicknames, etc.) in a document. Third, various knowledge sources about entities (e.g. dictionary, encyclopedia, Wikipedia, gazetteer, etc.) exist, and the size of these knowledge bases are extensive (e.g. millions of person names and place names). The sheer quantity of data is prohibitive for many natural language processing techniques.
Accordingly, there is a need of an automating natural language processing technique which may handle unstructured data.