Named entity (NE) recognition and extraction is an important and necessary task of natural language processing (NLP) for many applications. It is employed for recognizing and extracting entities that include personal name, location, organization, etc. from a document. However, one of the primary limitations with existing NE extraction tools is that they only identify the Standard English names and locations. Typically, an enormous amount of data is required to train any NE tool to identify region specific named entities (NE's). Thus, in case of the languages other than English, it is difficult to identify and extract the NE's based on the characteristics of the languages under consideration because of non-availability of any standard NE extraction tools. In particular, for Indian languages, the identification of NE's is difficult because of its morphologically rich and partial free-word order nature.
Some of the current techniques attempt to address this problem by employing statistical method, such as maximum entropy model (MaxEnt) and hidden Markov model (HMM), for recognizing the NE's. Such techniques also employ a variety of features and contextual information for predicting various NE classes. However, such techniques are language dependent and trained for a limited number of tokens in a limited number of Indian languages. Additionally, the accuracy of such techniques is not ascertained since they predict the NE classes.
Alternatively, some of the current techniques attempt to address this problem by employing NLP, contextual data in surroundings, semantic analysis, and so forth. For examples, the entity recognition and disambiguation system (ERDS) automatically determines which entities are being referred to by the text, based upon input of a text segment, using natural language processing and analysis of information gleaned from contextual data in the surrounding text. However, such techniques are not much efficient in recognizing large volume of named entities in Indian Languages. Thus, there is no technique or tool for efficiently generating the named entities for Indian languages.