Due to an increased knowledge base, the number of documents across different subject matter areas continues to grow. For example, with the advent of the Internet and the World Wide Web (WWW), the documents on the different web sites on the Internet continues to grow as the number of networks and servers connected thereto continue to increase on a global scale. Accordingly, the fields of information retrieval, document summarization, information filtering and/or routing as well as topic tracking and/or detection systems continue to grow in order to track and service the vast amount of information.
In the field of information extraction, work has been done to automatically learn patterns from a training corpus in order to extract entity names and their relations from a given document. A training corpus is defined to include writings, documents, or works for a given subject matter. Moreover, an entity name is defined to include, but is not limited to, proper names. Examples of entity names include a person's name, a organization's name and a product's name. Currently, tools for the extraction of entity names include man-made rules and keyword sets to identify entity names. Disadvantageously, building rules is often complex, error-prone and time-consuming and usually requires a through understanding and detailed knowledge of the system internals of a given language.
Another technique currently employed in the extraction of entity names includes a statistical method. However, the training of such a system requires vast amounts of human annotated data in order to provide an accurate statistical analysis. Moreover, this statistical method for the extraction of entity names is limited in that only local context information can be employed during the training of this method.