1. Field of the Invention
The present invention relates to an apparatus for reading a machine-readable document on the screen of a computer, and a method thereof. Especially, the present invention intends to support the comparative reading work of the related documents by presenting the related passages across the documents to be compared in a form of easily understanding.
2. Description of the Related Art
The objective of the present invention is to help a person who want to compare the contents of a plurality of related documents, such as one who reviews a plurality of survey reports from different areas to make a summary report on the actual situation of these areas or one who reviews a reply document with reference to the question document to be replied. In such a case, a brief list of related portions of the documents to be compared will be helpful for a user to find out the similarities and differences among that documents. As for representative articles regarding the multi-document comparison support, following seven pieces are cited:    [1] Christine M. Neuwirth and David S. Kaufer. The role of external representations in the writing process: Implications for the design of hypertext-based writing tools. In Proc. of Hypertext '89, pp. 319-341. the Association for Computing Machinery, Nov. 1989.    [2] Hypertextixation of a relation manual group using tf.idf method by Nobuyuki Omori, Jun Okamura, Tatsunori Mori, and Hiroshi Nakagawa, Information processing academy research report FI-47-8/NL-121-16, Information processing academy, Sep. 1997.    [3] Gerard Salton, Amit Singhal, Chris Buckley, and Mandar Mitra. Automatic text decomposition using text segments and text themes. In Proc. of Hypertext '96, pp. 53-65. the Association for Computing Machinery, Mar. 1996.    [4] Inderjeet Mani and Eric Bloedorn. Summarizing similarities and differences among related document. Chapter 23, pp. 357-379. The MIT Press, London, 1999. (reprint of Information Processing and Management, Vol. 1, No. 1. pp. 1-23,1999).    [5] Japanese patent laid-open Publication No. 7-325,827    [6] Japanese patent laid-open Publication 2000-57,152 (P2000-57152A)    [7] Japanese patent laid-open Publication No. 11-39,334
Among these, the document [1] proposes an interface called “Synthesis Grid” which summarizes the similarities and differences across related articles in an author-proposition table.
Also, as for the conventional technology for extracting the related parts across documents, the technology that sets a hyperlink across the related parts of different documents with a clue of the appearance of the same vocabulary has been known. For example, the article [2] shows the technology for setting a hyperlink between a pair of document segments that show high lexical similarity. The articles [5] and [6] show the technology for setting a hyperlink across the related parts among documents where the same keyword appears.
In addition, the article [3] shows the technology for extracting the related parts in a single document by detecting the paragraph group having a high lexical similarity. Also, the article [4] shows a method for discovering topic-related textual regions based on coreference relations using spreading activation through coreference of adjacency word links.
As for the technology for presenting similarities and differences of a plurality of related documents, the article [7] shows a multi-document presentation method that distinguishes the information commonly included in a plurality of documents from the other information. The method displays the whole contents of one selected article with highlighting (hatching) common information, and supplements unique information about remaining articles.
However, there are the following two problems in the above-mentioned conventional technology.
The first problem is that it is difficult to determine related part appropriately for a topic that is described by different documents in different manners. There may be a major topic that can be divided into minor topics, and the way of description of such a topic may differ from document to document. For example, the major topic of a document is not necessarily that of another document. The other document may contain only some minor topics related to the first document's major topic. In such a case, the size of related portions should differ from document to document.
However, the conventional methods described above did not consider the size of passages much. In the following article [8], Singhal and Mitra reported that a widely used similarity measure, i.e., the cosine of a pair of weighted term vectors, is likely to calculate inappropriately lower/higher scores for longer/shorter documents.    [8] Amit Singhal and Mandar Mitra. Pivoted document length normalization. In Proceedings of the 19th Annual International ACM-SIGIR Conference on Research and Development in Information retrieval, pp. 21-29. the Association for Computing Machinery, 1996.
In the following article [9], Callan also reported that passages based on paragraph boundaries were less effective for passage retrieval than passages based on overlapping text windows of a fixed size (e.g. 150-300 words). These observations suggest that related passage extraction should consider carefully the size of the passage to be extracted, especially in such a case that the size of related portions of the target documents much differ each other.    [9] James P. Callan. Passage-level evidence in document retrieval. In Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information retrieval, pp. 302-310. the Association for Computing Machinery, 1994.
The second problem is that the relationship between a set of related part regarding a certain topic and either another set of those regarding a different topic or the whole original document cannot be clearly expressed. For example, the configuration of related parts across long documents is often complicated.
Since then, in order to understand overall relationship between long documents, it is required not only to read a set of related parts across documents regarding individual topic, but also to review the related parts in detail by considering the mutual relationship between a plurality of topics, and the context where each related part appears. At this time, it is desirable to have a look at a plural sets of related parts, and easily to refer to the periphery part of each related part, but such a function is not realized in the above-mentioned conventional technology.