In recent years, attempts have been made to convert a document printed on paper to an electronic form for various applications. The objects of converting a document to an electronic form include facilitating information retrieval in the document and facilitating layout design in printing.
To extract information from a document converted to an electronic form, it is necessary to describe at least bibliographical information (such as the title and author's name), and, to allow for further flexible retrieval, it is necessary to describe a header indicating a chapter or paragraph, correspondence between a figure and a sentence, a list structure, etc. In other words, the logical relationships of a document must be extracted and tags for identifying them (as standard ones, for example, tags described in GML, ISO/ISC 8879, and Standard Generalized Markup Language (1986)) must be added.
Further, to facilitate a layout change in printing, it is necessary to abstract format information and add it to the document. The format of a document is closely related to the logical structure of the document, and, in general, a document is prepared which is provided with the tags of a format control language reflecting the logical structure (such as TeX, and the above described GML can be deemed to be such a one) and interpreted for printing.
In recent years, on the one hand, word processors and the like are generally used to create a document, but the conventional work of converting a document existing only on paper to an electronic form, including the logical structure, thereof has been done manually.
On the other hand, with the progress in the techniques for converting the character information contained in an image to a machine-readable form, in other words, the technology used by an optical character recognition (OCR) system), OCR systems are beginning to be used as an input device for information accumulated in paper form (Trans. IEICE (The Transactions of the Institute of Electronics, Information and Communication Engineers): Report on the Investigation and Research of the Standardization of Office Automation Equipment Related to Information Processing Systems, 1993). Many of them extract character strings from a scanned image, and divide them into images on a character-by-character basis and make them recognizable in the form of output character codes (strings), and, in general, other information possessed by a document image (e.g., the position of the character line and font information) is abstracted.
Thus, studies have been made for enabling various information possessed by a document image to be extracted by an OCR (for example, Yamashita and Amano: A Model Based Layout Understanding Method for Document Images, Trans. IEICE (D-II), Vol. J75-DII, No. 10, pp. 1673-1681, 1992, and Patent Application Laid-Open No. 4-278634 official gazette). According to these disclosed techniques, after the various components of a document image (such as figures, tables, and character lines) are separated based on various characteristics (run length, marginal distribution, and linking of black pixels), their positional relationships, that is, the layout structure (such as column arrangement) is interpreted. However, since a document image essentially includes ambiguity, sometimes it may be difficult to be interpreted uniquely.
However, although layout information reflects the logical structure to some degree, it is not the logical structure itself. For instance, the same document can be printed with a difficult column arrangement, which causes layout information to change, even though the logical structure is maintained. That is, here the logical structure means paragraphs, itemized lists, and reference relationships between figures, tables and texts, and they cannot be determined from layout information.
As the approaches to the understanding of the logical structure of such a document, the following two were proposed:
(1) Labeling is performed based on the keywords unique to each component from a sold document, and a logical structure such as a chapter or paragraph in accordance with the grammar (document structure grammar) which each predefined label should satisfy (Doi et al.: Development of Document Architecture Extraction, Trans. IEICE (D-II), Vol. J76-D-II, No. 9, pp. 2042-2052, 1993). PA0 (2) Matching blocks whose elements are basic rectangles extracted from a document image and their attribute values with those of a registered document model (which describes the logical elements of a document and the layout characteristics that they have), determining a logical structure from the selected model (Yamada: Conversion Method from Document Image to Logically Structured Document Based on ODA, Trans. IEICE (D-II), Vol. J76-D-II, pp. 2274-2284, 1993).
(1) is characterized in that it does not particularly assume the understanding of a document image. (1) consists of a format analyzing section which performs labeling representing character strings from symbols and words unique to heading (such as "Chapter 1" and "Introduction"), and a intersentence structure analyzing section which analyzes a label string using a stack automation in accordance with the document structure grammar. For instance, taking the text "1. Introduction" as an example, it is labeled as these elements in the format analyzing section as shown in FIG. 1, and by the parsing performed by the intersentence structure analyzing section, it is determined to be a header representing "Chapter 1." Doi et al. (Doi et al.: Development of Document Architecture Extraction, Trans. IEICE (D-II), Vol. J76-D-II, No. 9, pp. 2042-2052, 1993), for further dealing with the misallocation of numerals and symbols, added a process in which the degree of matching between the label sequence whose structure has not yet been determined and the label pattern currently on the stack is defined, and, if that value is within a threshold value, the depth is the same as the current depth, and, if that value exceeds the threshold value, a new heading or list is made which is deeper than the current depth by one, but only one interpretations may be allowed for each step of the process. That is, the label sequence is uniquely interpreted when put on the stack. This approach is to describe the logical structure of a document by using a context-free grammar (CFG) and may deal with various documents flexibly. However, it is difficult to apply this to the result of a document image understanding system. The reason for this is that, when the layout and the various characteristics in the text format obtained from a document image are converted to a logical structure, they cannot always be uniquely interpreted in each step of the process. On the contrary, there are many ambiguities which are not solved unless many other characteristics are taken into consideration, and in intermediate stages, more than one interpretation must be retained.
(2) assumes a document image input, and, in this sense, it is an approach assuming that input information (in this case, the block extracted from the image and its attributes) has an equivocation (FIG. 2). First, basic rectangles, the smallest units of the image components, are extracted, and continuous rectangles having the same attribute are grouped into a block. The attributes proposed by this include line spacing, character spacing, and the left offset value (identification of left justification, centering, and the like). The extracted blocks are further grouped into a block called `content` taking account of the logical boundary by an analysis of the section number, column extraction, and the like. After determining the document class by matching the content with those of the pre-determined class models, a logical structure is generated in accordance with the rule defined in the document class. The ambiguity in image interpretation is absorbed in the step of matching, but, after the determination of a document class, a logical structure is uniquely generated in accordance with the rule of that class (which can be considered to be a grammar for generating a logical structure) and the local ambiguity would be abstracted. In addition, in order to describe logical structures for individual document classes beforehand, an another class must be defined even if only one of the characteristics is different, and thus a problem remains in versatility (in other words, the descriptive power) compared to that of a document structure grammar such as that used in the approach in (1).
Further, the Patent Application Laid-Open No. 3-142563 official gazette, in a method for inputting a natural language, analyzing it by a modification candidate detector to generate a paragraph node list with modification candidates, and analyzing the paragraph node list by interaction with a user, discloses a technique wherein, if there is ambiguity in the node modification, the ambiguous portion and the value of the scores of the equivocal candidates are displayed, thereby allowing the operator to sequentially select the desired candidates. However, the Patent Application Laid-Open No. 3-142563 official gazette is basically on an interaction basis and is not extendable so as to automatically perform the whole process by computer control, and also discloses nothing about an extension to generally analyze the logical structure of a document while allowing ambiguity, except for the detection of modification candidates.