This invention relates to an information processing apparatus and method for managing entered document image data.
When the content of document image data is reutilized, the conventional practice is to read in the document image data by image input means such as a scanner and employ data obtained by subjecting the read document image data to character recognition processing. Layouts of document image data can be quite diverse. In order to execute character recognition processing efficiently, layout analysis processing is executed before character recognition processing. The accuracy of such processing is considered important.
If document image data is preserved in the format of the results that will be obtained after application of layout analysis processing and character recognition processing, then the document image data and even its layout can be acquired. Since utilizing the acquired layout when desired document image data is retrieved from the retained document image data is useful in improving retrieval efficiency, document image data retention formats in which the results of layout analysis processing and the results of character recognition processing are preserved in correspondence with the document image data have begun to appear.
Such formats for preserving document image data in many cases are the file formats of specific applications. Examples are the RTF (Rich Text Format) of Microsoft Corporation and HTML (HyperText Markup Language) that is often used in connection with the Internet.
However, file formats of specific applications are sometimes not compatible and can reduce conversion accuracy. Accordingly, there is need for an intermediate format wherein information obtained from results of layout analysis processing and character recognition processing is preserved in as great an amount as possible and is so adapted as to maintain accuracy in order to enable conversion to various applications.
There have been proposed intermediate formats that make conversion possible regardless of the application or system. Examples are the SGML format and the PDF (Page Description Format). Generically, such an intermediate format is referred to as a DAOF (Document Analysis Output Format).
When document image data is preserved and converted, the preservation and conversion is carried out after the document image data is itself changed to bitmap data or compressed. While the above-mentioned preservation format is suitable for image data such as data representing a natural image, retrieval is not possible if the data is document image data. Though document image data may be retained upon being compressed, there is also a need for a preservation format wherein document image data is subjected to image analysis to obtain character codes, a layout description and a description for putting figures and pictures from document image data into the form of images. The preservation format should be applicable to spreadsheet software in which tables are structurally analyzed.
As means for performing such analysis, the conventional DAOF analyzes document image data and preserves document image data in the form of a layout description, character recognition description, table structure description and image description.
The structure of data in the conventional DAOF will be described with reference to FIG. 11.
FIG. 11 illustrates the structure of data in DAOF according to the prior art.
As shown in FIG. 11, the data structure includes a header 1101, in which information relating to document image data to be processed is retained; a layout description data field 1102 for retaining attribute information and rectangular address information of each block recognized for every attribute such as TEXT (characters), TITLE, CAPTION, LINE ART (line images), PICTURE (natural images), FRAME and TABLE contained in document image data; a character recognition description data field 1103 which retains results of character recognition obtained by applying character recognition to TEXT blocks such as TEXT, TITLE and CAPTION; a table description data field 1104 for retaining the details of the structure of a TABLE block; and an image description data field 1105, in which image data of a block such as a PICTURE or LINE ART block is extracted from document image data and retained.
There are instances where these DAOFs are themselves preserved as files and not just intermediate data. A large number of intermediate formats are preserved, and the management and retrieval thereof is important. Unlike instances in which only document image data is preserved, in this case a search by character information is possible. There is strong demand for a search capability utilizing such a DAOF.
However, the conventional DAOF utilizes the results of layout analysis processing and character recognition processing for the purpose of reproducing document image data faithfully. Consequently, desired document image data cannot be searched and groups of document image data associated with this document image data cannot be classified into groups. Further, in order to implement such processing, it is necessary to provide versatility by the logical relationship between data managed by the DAOF.
Accordingly, an object of the present invention is to provide an information processing apparatus and method, as well as a computer readable memory therefor, in which document image data can be utilized and managed more efficiently.
According to the present invention, the foregoing object is attained by providing an information processing apparatus for managing entered document image data, comprising: structure analyzing means for analyzing structure of the entered document image data; character recognition means for recognizing a character string in a text block that has been analyzed by the structure analyzing means; language analyzing means for applying language analysis to results of character recognition performed by the character recognition means; extraction means for extracting synonyms and equivalents of words obtained as results of language analysis performed by the language analyzing means; conversion means for converting a word obtained as the result of language analysis to a word in another language; translation means for translating a character string in a text block that has been analyzed by the structure analyzing means to another language; and storage means for storing at least results of analysis by the structure analyzing means, results of character recognition by the character recognition means and results of language analysis by the language analyzing means, and for storing at least one of the results of extraction by the extraction means, results of conversion by the conversion means and results of translation by the translation means in association with the results of character recognition.
The storage means preferably stores the results of analysis by the structure analyzing means and the results of character recognition by the character recognition means in a data structure in which these results are described logically.
The storage means preferably stores synonyms and equivalents, which are the results of extraction by the extraction means, as individual words and in a form linked to the results of character recognition by the character recognition means.
The storage means stores the words of the other language, which are the results of conversion by the conversion means, in a form linked to the results of character recognition by the character recognition means.
The storage means preferably stores the results of translation by the translation means in a form linked to the results of character recognition by the character recognition means.
The foregoing objects are attained by providing an information processing method for managing entered document image data, comprising: a structure analyzing step of analyzing structure of the entered document image data; a character recognition step of recognizing a character string in a text block that has been analyzed by the structure analyzing step; a language analyzing step of applying language analysis to results of character recognition performed by the character recognition step; an extraction step of extracting synonyms and equivalents of words obtained by results of language analysis performed by the language analyzing step; a conversion step of converting a word obtained as the result of language analysis to a word in another language; a translation step of translating a character string in a text block that has been analyzed by the structure analyzing step to another language; and a storage step of storing, in a storage medium, at least results of analysis by the structure analyzing step, results of character recognition by the character recognition step and results of language analysis by the language analyzing step, and for storing in the storage medium, at least one of the results of extraction by the extraction step, results of conversion by the conversion step and results of translation by the translation step in association with the results of character recognition.
The foregoing objects are attained by providing a computer readable memory storing program code for information processing for managing entered document image data, comprising: program code of a structure analyzing step of analyzing structure of the entered document image data; program code of a character recognition step of recognizing a character string in a text block that has been analyzed by the structure analyzing step; program code of a language analyzing step of applying language analysis to results of character recognition performed by the character recognition step; program code of an extraction step of extracting synonyms and equivalents of words obtained by results of language analysis performed by the language analyzing step; program code of a conversion step of converting a word obtained as the result of language analysis to a word in another language; program code of a translation step of translating a character string in a text block that has been analyzed by the structure analyzing step to another language; and program code of a storage step of storing, in a storage medium at least results of analysis by the structure analyzing step, results of character recognition by the character recognition step and results of language analysis by the language analyzing step, and for storing at least one of the results of extraction by the extraction step, results of conversion by the conversion step and results of translation by the translation step in association with the results of character recognition.
Thus, in accordance with the present invention, as described above, it is possible to provide an image processing apparatus and method through which document image data can be utilized and managed more efficiently.