Concept identification, which is also referred to as mention detection, is a process that identifies concepts contained in plaintext. As used herein, a concept refers to an item or entry which has a definite meaning in a dictionary such as a web-based encyclopedia. Examples of a concept may include, but are not limited to, a person such as “Michael Jordan”, an organization such as “International Business Machines”, an activity such as “Presidential Election 2000,” and the like. The concepts identified from the plaintext can be linked to their respective articles or webpages that contain the correct meanings thereof. For instance, if the concept “Michael Jordan” is identified in a plaintext, then this phrase can be linked via a hyperlink to a webpage that introduces the former basketball player Michael Jordan.
Disambiguation is an important stage of concept identification. It would be appreciated that a concept may be represented by different surface forms. As used herein, a surface form is a sequence of words that represent a concept. For instance, examples of the surface form for the concept “Michael Jordan” may include “Jordan,” “Michael,” “Air Jordan,” “MJ,” and the like. On the other hand, the different concepts may have the same surface forms. That is, a surface form might be used to represent different concepts. For example, the surface form “MJ” may represent “Michael Jordan” or “Michael Jackson.” The disambiguation is to determine the exact concept to which a detected surface form refers in the context of the given plaintext.