1. Field of the Invention
The present invention relates to a document information processing apparatus, and, more particularly, to a technology for, when each word or copula included in a document has a meaning, adding information indicating the meaning or contents of each word or copula to each word or copula.
2. Description of Related Art
Conventionally, as a technology for automatically classifying individual words currently used in text data by statistically processing the individual words, a technology for giving a token to each sequence of word classes having a probability of appearing in the text data that is equal to or higher than a predetermined value, dividing each of sets in which words and tokens coexist, the sets being contained in a sequence of words and tokens of the text data so that the probability of generation of sequences of words and tokens of the text data is maximized, replacing each token with a copula that exists in the text data, and automatically classifying both words and copulas together is known (see Japanese patent application publication (TOKKAIHEI) No. 10-97286, for example).
For a system that summarizes a huge volume of document information, converts them into expressions that are easy to catch by voice, converts documents written in a spoken language into written words that are easy to read, and extracts important component (i.e., characteristic expressions), such as the names of persons and places, the names of organizations, dates, etc. from newspaper articles and so on, a technology for making it possible to declaratively and simply define a rewriting rule including restrictions on character strings and a rule governing the extraction of characteristic expressions without concern for the order of processes is known (see Japanese patent application publication (TOKKAI) No. 2001-67355, for example). In accordance with this technology, a set of rewriting rules described by users is converted into a set of rules governing a grammar of definite clauses by a translation device, and the set of rules governing the grammar of definite clauses is then converted into an integrated rule that can be processed in parallel and at a high speed by a rule integration device. A rewriting execution device then accepts the integrated rule and a document (i.e., an original document) that should be changed, and outputs the changed result.
However, the prior art technologies disclosed by Japanese patent application publication (TOKKAIHEI) No. 10-97286 and Japanese patent application publication (TOKKAI) No. 2001-67355 automatically classifies words and copulas included in a document at the best, but cannot express even the meaning or contents which each word or copula included in the document has.
As communication technologies and information control technologies have been developed in recent years, various types of information equipment terminals frequently perform an exchange of alphabetic information, which is represented by an exchange of an e-mail and browsing of homepages at different places and at different times. However, the interpretation of each word or copula contained in the alphabetic information is carried out based on a human being's memory and judgment. Therefore there are some cases where due to a lapse of judgment of the context or syntax of the alphabetic information and a lapse of memory, the provider of the alphabetic information and the receiver of the alphabetic information differently understand the meaning and contents of the alphabetic information, so that the provider cannot smoothly provide his or her intention to the receiver by using the alphabetic information.