1. Field of Invention
The present invention relates to a text structure analysis method and text structure analysis apparatus applied to analytical processing of text. More specifically, the present invention relates to an apparatus and method for taking differing parts of a plurality of texts and extracting a portion of the content of a given text.
2. Description of Related Art
In known methods of text analysis, when performing processing on two given documents by determining and extracting the differing portions of the given documents, it is common to process them by treating one line or one sentence as a unit and performing structural analysis based on their connections/relationships. For example, there is known a method of processing text by examining the connections/relationships of sentences and creating a tree or graph of the text based on these connections/relationships. Another known method of performing text analysis creates a paragraph having joined sentences from the connections/relationships of the sentences.
Japanese Laid-Open Patent Applications No. 4-23765 (JP 4-23765), No. 6-35960 (JP 6-35960), No. 7-200589 (JP 7-200589), and No. 8-6945 (JP 8-6945) disclose examples of the tree/graph method of text analysis. Japanese Laid-Open Patent Applications No. 4-306768 (JP 4-306768) and No. 5-324708 (JP 5-324708) disclose examples of the method employing paragraphs of joined sentences.
The method according to JP 4-23765 performs syntactic analysis regarding each of two texts and tries to detect differing parts in these texts using syntax trees.
In the apparatus according to JP 6-35960, a document structure detection unit that uses surface-level vocabulary and a document structure detection unit that uses grammatical subjects are utilized to perform text analysis. The apparatus performs detailed structural analysis of documents, not only by using vocabulary information appearing on the surface level of the sentences, but also by using grammatical subjects detected from each sentence, including subjects not clearly indicated in the sentences.
The method of JP 7-200589 extracts text having qualifying relationships between sentences in the form of a tree structure. The method arranges and displays a portion of text using the extracted tree structure.
The method of JP 8-6945 generates nodes based on rules governing the assembly of attributes of neighboring lines, connects the nodes with links, and applies costs to the nodes and links. The method interprets the logical structure of the sentences by traversing text graphs.
The method of JP 4-306768 joins and performs structural analysis of sentences based on connections/relationships between the sentences in the documents.
The method of JP 5-324708 restores paragraph information according to connections/relationships and segmentation rules for each sentence, and performs structural analysis by considering that paragraph information.
All of these known methods and apparatus perform processing that examines the connections/relationships of texts with a sentence or line as being the smallest unit treated. As a result, computational volume is great and large amounts of computational time are required for processing.
Additionally, all of these known methods and apparatus merely perform processing according to predefined rules (rules regarding connections/relationships) The user cannot change the method of analysis in accordance with a particular sentence being processed. Furthermore, after having performed some sort of processing using the results of structural analysis, when outputting the results, various problems arise, such as having to perform analysis again using the results of the structural analysis and having to reconstruct the text for output.
Also, when taking the differing part between two sentences or lines of text, processing in the known methods is performed so as to indicate the fact that a changed portion exists in that line by outputting only the line having the change or by outputting the entire text and assigning a mark to the start of the line having the change.
For example, consider a portion of text representing a three-day weather forecast, such as shown in FIG. 2, having a change in a part of its content. For this example, the weather forecast is changed from that of FIG. 2 to the forecast shown in FIG. 4. Comparing the contents of FIG. 2 and FIG. 4, the probability of precipitation on the 2nd of the month was changed from 40% to 20% and the lowest temperature on the 3rd of the month was changed from 6.degree. C. to 8.degree. C.
In the known methods, when attempting to output the content of the differing parts of two portions of text or line units, only the changed part is displayed, as shown in FIG. 8A, or the entire text is displayed with a mark (for example, an asterisk) assigned to the line having the change, as shown in FIG. 8B.
In the example shown in FIG. 8A, "9&lt;" indicates the content of the 9th line before change, that is, the 9th line in FIG. 2 (Probability of Precipitation 40%) and "9&gt;" indicates the content of the 9th line after change, that is, the 9th line in FIG. 4 (Probability of Precipitation 20%). In the same manner, "15&lt;" indicates the content of the 15th line before change, that is, the 15th line in FIG. 2 (Lowest Temperature 6.degree. C.) and "15&gt;" indicates the content of the 15th line after change, that is, the 15th line in FIG. 4 (Lowest Temperature 8.degree. C.).
In FIG. 8A, because only the line having the change is displayed, the surrounding context cannot be grasped. Similarly, in FIG. 8B, all of the text is displayed, but the context cannot be grasped because too much text is displayed.