1. Field of the Invention
The present invention relates to a method and an apparatus for recognizing a topic structure of language data in the field of the analysis of natural language.
2. Technical Background
It has been experimentally proven that when a number of subjects are presented with language (or lingual) data in both written and spoken form and are asked to "determine the blocks within the data having the same content and this "content", they tend to refer to an identical structure. The following reference describes an example of such an experiment: Takeshita et al., "Wadai-kouzou-ninshiki no Kanten kara no Hyuuman-komyunikeishon no Kenkyuu (A Study of Human Communication Based on Topic Structure Recognition), Proceedings of the Institute of Electronics, Information and Communication Engineers (IEICE) Fall 1993 Conference".
Such a structure recognized by humans is called "topic (or skimming) structure" and recognizing the topic structure by a computer is called "topic structure recognition". Generally, the topic structure consists of a nest structure; thus, each topic is represented by a "topic portion" (i. e., corresponding word or phrase) indicating a topic, a "topic level" indicating the depth of nesting, and a "topic scope" indicating the beginning and the end of the topic.
In recent years, the circulation of electronic language data has increase; however, it cannot always be said that the best advantage has been taken of such language data. This tendency especially increases if the information includes foreign language texts, or transcripts of spoken data such as minutes of meetings or lectures.
Until now, various models related to topics and their structure have been proposed. The following reference gives an example: B. J. Grosz and C. L. Sidner, "Attention, Intention, and the Structure of Discourse", Computational Linguistics, Vol. 12, No. 3, pp. 175-204, 1986". In the document, the expansion of topics are modeled by using "stacks" because of the nest structure of the topics. In addition, the changing of the nest structure, that is, the operation of "pushing" or "popping" into or from the stack, is decided by the change of intention of the speaker or writer. Moreover, a kind of common knowledge, called as "domain knowledge", is used for determining the topics of extension in the language data.
The domain knowledge includes taxonomical relationships, such as the relationship between upper and lower classes, for example, "Company A is a tele-communication company", and the relationship between an action and its objects, for example, "Company A presents service A and advertises for it".
However, in the above model for topics and their structure, no method is given for recognizing the intention of the speaker; thus, an accurate topic structure cannot be obtained. In addition, a method sufficient for examining what kind of domain knowledge is needed for the expansion of topics and how such knowledge should be used is not given. Even if such methods were given, it would be nearly impossible to prepare the necessary domain knowledge because of its incalculable amount.