The present invention relates to a structured document difference string extraction method and apparatus for a document processor such as a word processor capable of extracting a difference character string between structured documents stored as an electronic file.
A structured document is defined as one, having embedded therein, i.e., containing information on the logical structure of a document, that is, information such as xe2x80x9cthis portion of the document constitutes a chapterxe2x80x9d or xe2x80x9cthis portion makes up a titlexe2x80x9d.
The difference extraction between documents is defined as detecting a most coincident combination of elements constituting each document including paragraphs, lines and characters and extracting non-coincident elements as a difference. Suppose that two documents for which the difference is to be detected are xe2x80x9cABCDEFGxe2x80x9d and xe2x80x9cACDAEFHxe2x80x9d. When the two documents are compared in terms of elements thereof including A, B, C, D, E, F, G and H, the most coincident combination is detected as xe2x80x9ccorrespondence of ACDEFxe2x80x9d. Also, the difference is detected in the form of xe2x80x9cB is deletedxe2x80x9d, xe2x80x9cA is inserted after Dxe2x80x9d or xe2x80x9cG is changed to Hxe2x80x9d.
A conventional method for difference extraction is disclosed in JP-A-2-255964, in which comparison is made in terms of punctuation marks, lines, words and characters. In application of this method to structured documents, a character string representing a logical structure contained in the documents is compared in the same manner as other character strings are compared in the documents.
Extraction of a difference in a structured document by the same means as in a normal document may be inappropriate to the document editor, however, since the result may be non-coincident with the logical structure of the document.
The following Examples 1-3 were considered by the Applicants during development of the present invention, and have not been known or published publicly.
With reference to the structured documents shown in FIGS. 3A and 3B, the case will be explained in which documents having non-coincident logical structures are erroneously matched with each other in the process of difference extraction, thereby leading to an extraction result inappropriate to the document editor.
The structured documents in FIGS. 3A and 3B are described by SGML (Standard Generalized Markup Language; ISO 8879), indicating that a character string sandwiched by marks, for example,  less than A greater than  and  less than /A greater than  called tags is associated with a logical structure A. In other words, the character string xe2x80x9cTARO HEISEIxe2x80x9d sandwiched between xe2x80x9c less than NAME greater than xe2x80x9d and xe2x80x9c less than /NAME greater than xe2x80x9d of FIG. 3A is associated with the logical structure xe2x80x9cNAMExe2x80x9d. HTML (Hypertext Markup Language) which is used in WWW (World Wide Web) is an application of SGML and is applicable to the present invention as well.
Another name of the mark representing this logical structure is a tag. xe2x80x9c less than A greater than xe2x80x9d and xe2x80x9c less than /A greater than xe2x80x9d thus are alternatively called a start tag and an end tag, respectively.
The result of extracting a difference character string between two structured documents in FIGS. 3A and 3B by the is shown in FIGS. 4A and 4B.
FIG. 4B shows the result of extracting difference character strings of the structured document in FIG. 3B relative to the structured document in FIG. 3A. FIG. 4A shows the result of extracting difference character strings of the structured document in FIG. 3A relative to the structured document in FIG. 3B.
As seen from FIGS. 4A and 4B, xe2x80x9cHEISEIxe2x80x9d associated with xe2x80x9c less than NAME greater than xe2x80x9d and xe2x80x9cHEISEIxe2x80x9d associated with xe2x80x9c less than TRANSMISSION DATE greater than xe2x80x9d are not extracted as the difference. This is due to the fact that xe2x80x9cHEISEIxe2x80x9d was coincident and erroneously matched with each each other. This correspondence of xe2x80x9cHEISEIxe2x80x9d not coincident in logical structure is obviously meaningless to the document editor.
With reference to the structured documents shown in FIGS. 5A and 5B, the case will be explained in which character strings are matched erroneously over different document structures in the process of difference extraction due to the insertion of a document structure, thereby leading to an extraction result not proper to the document editor. FIG. 5A shows a structured document having Chapter 1, and FIG. 5B a structured document with one other chapter inserted before Chapter 1.
FIGS. 6A, 6B show an example of extracting a difference character string between the two structured documents of FIGS. 5A, 5B.
FIGS. 6A, 6B show a case similar to FIGS. 4A, 4B, in which FIG. 6B shows the result of extracting a difference character string of FIG. 5B relative to FIG. 5A. FIG. 6A, on the other hand, shows the result of extracting a difference character string of FIG. 5A relative to FIG. 5B.
As seen from FIG. 6A, Chapter 1 of FIG. 6A is matched over Chapter 1 and Chapter 2 of FIG. 6B in spite of the fact that Chapter 1 of FIG. 6A is identical to Chapter 2 of FIG. 6B. This is another case inappropriate to the document editor.
Dual appearance in FIG. 5B of the same character string xe2x80x9cSTRUCTURED DOCUMENTxe2x80x9d unlike in FIG. 5A leads to the erroneous decision in FIG. 6B that the first xe2x80x9cSTRUCTURED DOCUMENTxe2x80x9d is coincident while the second xe2x80x9cSTRUCTURED DOCUMENTxe2x80x9d is non-coincident, so that the second xe2x80x9cSTRUCTURED DOCUMENTxe2x80x9d and extracted as a difference. This is true with each of subsequent cases of difference extraction.
With reference to the structured documents of FIGS. 7A, 7B, explanation will be made of the case in which the difference in marks representing the logical structure of a document makes it impossible to match the contents of documents with each other in spite of the identical logical meaning of the documents, resulting in the extraction inappropriate to the document editor.
In FIGS. 7A, 7B, a tag  less than FIRST ITEM greater than  is attached to only the item that first appears in spite of the fact that the logical meaning of the document remains the same and xe2x80x9cITEMxe2x80x9d.
FIGS. 8A, 8B show the case in which difference character strings between two structured documents of FIGS. 7A and 7B are extracted by the conventional technique.
FIGS. 8A, 8B represent a case similar to FIGS. 4A, 4B, in which FIG. 8B shows the result of extracting difference character strings of FIG. 7B as compared with FIG. 7A, while FIG. 8A shows the result of extracting difference character strings of FIG. 7A as compared with FIG. 7B.
From FIGS. 8A, 8B, it is seen that xe2x80x9cFIRST ITEMsxe2x80x9d are matched with each other and the character strings associated with them are compared with each other as the contents thereof. The logical meaning of xe2x80x9cFIRST ITEMxe2x80x9d and xe2x80x9cITEMxe2x80x9d are the same for the document editor, and therefore the contents of the tags are required to be matched in priority over the tags.
In extracting the difference between structured documents, comparison between them is required taking into consideration the logical meaning and the structure of the structured documents. This requirement is not met by the conventional method in which character strings indicating a logical structure are compared in similar fashion to other character strings in the document.
An object of the present invention is to provide a method and an apparatus for extracting a difference character string between structured documents in a manner suited to the linguistic sense of the document editor taking the logical meaning and structure of the structure documents into consideration.
Another object of the present invention is to provide a method and an apparatus for managing the editing of a structured document for a document processing system capable of managing the editing on the basis of comparison and discrimination of the logical structures of structured documents.
In order to achieve the above-mentioned objects, according to one aspect of the invention, there is provided a structured document difference extraction method including memory means for storing structured documents defined as information on the logical structure of documents before and after editing such as deletion, insertion or change, and a processor for extracting a character string non-coincident between the structured documents before and after editing as a difference, comprising the steps of:
editing and storing a structured document in the memory means;
parsing the logical structures of the structured document before and after editing read from the memory unit on the basis of a set comparison criterion; and
extracting the difference between the structured documents in such a manner as to satisfy the comparison criterion in accordance with the result of parsing of the structured documents.
The comparison criterion includes tags indicating logical structures and types of comparison criterion corresponding to the tags with the contents thereof being stored in a table.
The tags are defined to be ones of the following four types of comparison criterion:
(1) Tags having the contents which are compared only when the particular tags are coincident with each other (identity tags)
(2) Tags having the contents the difference of which is ignored at the time of comparison (ignoring tags)
(3) A set of tags identical to each other in logical meaning (equivalence tags, such as xe2x80x9cFIRST ITEMxe2x80x9d and xe2x80x9cITEMxe2x80x9d)
(4) A set of tags having the contents which are not compared with each other (no-comparison tags).
Furthermore, a document tree representing the structure of each structured document is produced by the above-mentioned parsing method, and the difference between the structured documents is extracted by comparison between the nodes of the respective document trees. In the case where given nodes are non-coincident with each other, the difference is extracted between the nodes by comparison between the characters of the nodes.
In addition, in producing a document tree or hierarchy representing each document structure by the aforementioned parsing method, the allocation of the nodes of the document trees is altered in accordance with the comparison criterion described above.
According to another aspect of the invention, there is provided a structured document difference extraction apparatus comprising a memory means for storing structured documents before and after editing including deletion, insertion or change, and a processor for extracting at least a non-coincident character string of each structured document before and after editing as a difference between the structured documents, wherein:
the processor includes means for editing the structured documents and storing the result of the editing in the memory means, means for parsing the logical structure of structured documents before and after editing read from the memory means on the basis of a preset comparison criterion, and means for extracting the difference between the structured documents in such a manner as to meet the comparison criterion in accordance with the result of parsing of the structured documents.
The extraction means includes a table for storing tags representing logical structures and types of criterion for the tags.
The following four criterion types of tags are defined beforehand for comparison:
(1) Tags having the contents which are compared only when the particular tags are coincident with other
(2) Tags having the contents the difference of which is ignored at the time of comparison
(3) A set of tags identical in logical meaning to each other, and
(4) A set of tags having the contents which are not compared with each other.
Further, the structured document parsing means produces a document tree representing the structure of each document, and the structured document difference extraction means extracts the difference between the structured documents before and after editing by comparing the respective document trees by node. When a given pair of nodes between a pair of structured documents fail to coincide with each other, the difference is extracted by comparing the particular nodes, this time, by character.
In addition, the structured document parsing means, when producing a document tree representing a document structure, alters the allocation of the nodes of the document tree in accordance with the comparison criterion.
With the solutions as described above, structured documents are edited, the logical structure of the edited structured documents is analyzed by the structured document parsing means, a comparison criterion used for extracting the difference corresponding to the logical structure is set in advance, and a difference character string between the structured documents before and after editing is extracted in such a manner as to meet the comparison criterion. The more relevant difference conforming with the linguistic sense of the editor can thus be automatically extracted in accordance with the logical structure.
Also, the difference is extracted by node between document trees, whereas the difference between non-coincident nodes is extracted by character, so that an erroneous extraction of the difference over different structures can be eliminated.