Coreference is a linguistic phenomenon in which a set of elements (referred to as constituent elements) constituting a sentence represents the same entity within the sentence. In the set, the constituent element that is positioned at the rearmost position is referred to as anaphor, and the other constituent elements are referred to as antecedents. A program (module) that is used for performing a process (referred to as a coreference analysis) of finding such sets is called a coreference analyzer. The phenomena of the coreference include, based on the type of constituent elements, coreference of noun phrases, coreference of predicates, coreference of sentences, and coreference over different types of constituent elements, and the like. Hereinafter, for the simplification of description, only nouns (noun phrases) are assumed to be handled as constituent elements to be found as coreference targets. It is easy to expand a coreference analyzer that is based on noun phrases so as to handle other types of constituent elements as well.
Generally, a coreference analyzer performs a learning process and a determination process. In the learning process, the coreference analyzer acquires criteria used for assigning a tag group, which represents a coreference set, by referring to data (referred to as training data) that represents a sentence to which tags representing coreference sets are manually assigned in advance. In the determination process, the coreference analyzer determines whether or not there is a coreference relation by applying the criteria acquired in the learning process for an ordinary sentence (text) to which tags representing coreference sets have not been assigned and for plural noun phrases for which a user desires to know whether or not there is a coreference relation within the text.
The training data, essentially, includes tags representing noun phrases as constituent elements forming a coreference set in the sentence and tags representing whether or not the noun phrases represent the same entity. Accordingly, a correspondence relation (link) between a noun phrase and another noun phrase can be specified. Such training data is data that represents the coreference phenomenon as tags straightforwardly.
One example of the representation method of the training data is shown below. A range enclosed by “< >” is a noun phrase that is designated as a coreference set. Here, “< >” is referred to as a coreference element tag. In addition, “[ ]” is referred to as a link tag, and here, a number enclosed by “[ ]” is referred to as a link ID. A set of noun phrases having the same link ID out of noun phrases represented by the coreference element tags is analyzed as being in a coreference relation.
(9900)
    “<Bob>[1] appears. <He>[1] is a student.”(9901)    “Things such as <seafood type>[2], sensibility for grasping <charming sights>[2] is felt.”(9902)    “I interviewed with a <Monaco's diplomatic agent>[3]. <He>[3] seemed busy.”
The coreference analyzer performs a learning process by using such training data and acquires such criteria that the same tags can be assigned to texts of the training data as many as possible. In addition, in the determination process, the coreference analyzer assigns tags by applying the criteria acquired through the learning process to an arbitrary text to which tags have not been assigned. As a practical example of the tags, there is method using an extensible markup language (XML).
Incidentally, the coreference element tag represented in the training data designates the range of a noun phrase as a constituent element forming a coreference set, that is, a position (referred to as a front boundary) that is the front side of the range and a position (referred to as a rear boundary) that is the rear side. Such a position is designated, for example in units of morphemes or characters. In the examples of the training data (9900) to (9902) described above, ranges including one morpheme, two morphemes, and four morphemes, respectively, are designated as noun phrases (as antecedents) by front boundaries and rear boundaries are designated. In other words, a result of determining a functional cluster (referred to as a chunk) of a morpheme sequence is represented by the coreference element tag, i.e., a result of determining the range of the morpheme sequence is represented. Generally, a task for determining a chunk of the morpheme sequence as mentioned above is called a chunking task. A task for determining a correspondence relation between noun phrases forming a coreference set is referred to as a coreference task in a narrow sense. When a learning process that is appropriate for such training data is performed, essentially, the coreference task and the chunking task are simultaneously solved (called simultaneous learning).