The present invention relates to image processing apparatus and method and a storage medium, and more particularly, to image processing apparatus and method and storage medium for optically reading an original image and producing a color output.
Conventionally, when a original document is edited and reused, the image is read by a scanner, and character recognition is performed on the read image. Further, in recent years, as many documents have a complicated layout, layout analysis processing is performed prior to character recognition, and the precision of layout analysis processing has been increasingly valued.
If layout recognition and character recognition have been performed on a document original and the document is stored in the format of processed result, it is convenient to use the data in a search system.
The storage format in this case usually is a file format of a particular application program. For example, the format is RTF (Rich Text Format) by Microsoft Corporation, Ichitaro (trademark of Justsystem Corporation) format, or recently, HTML (Hyper-Text Markup Language) format often used on the Internet.
However, if documents are in different formats of particular application programs, compatibility between the formats may not be realized. Further, if document data is read by using another application program, the layout of the document may become different from the original document, thus, the conversion precision may be lowered.
Then, there has been a need for an intermediate format to hold information obtained from the results of layout analysis and character recognition processing, for as many types as possible, and maintain precision in conversion to various application programs.
Against this background, proposed are formats to realize compatibility among various application software or systems by conversion processing. These formats are SGML (Standard Generalized Markup Language) format and PDF (Portable Document Format), for example. Such intermediate formats are needed and utilized. Here these formats will be called a DAOF (Document Analysis Output Format) as a temporary format name, for convenience of explanation.
Upon filing or exchanging document images, image data are stored as bitmap data, or compressed and stored or exchanged. However, in a case where a data format remains the same, a problem occurs when the image data is used later unless the data represents a natural image. For example, if an image including text is stored, search using a character string in the text cannot be made. Further, the text cannot be re-edited by word-processor software or the like on a computer.
Accordingly, there is a need for a format to hold a document image in compressed state, and further, hold the results of image analysis, as character code, layout description, description of imaging figure, picture and the like, further, to send the result of analysis of table structure to spreadsheet software or the like.
As a solution, the DAOF format, thought by the present inventor, is used for analysis of document image, and provides a data structure comprising, as results of document image analysis, data storage areas of layout descriptor, character recognition descriptor, table structure descriptor and image descriptor. The layout descriptor contains attribute information of respective areas in the document, TEXT, TITLE, CAPTION, LINEART, PICTURE, FRAME, TABLE and the like, and rectangular area address information corresponding to the areas. The character recognition descriptor contains the results of character recognition on the character areas, TEXT, TITLE, CAPTION and the like. The table descriptor contains the details of table structure of a table portion determined as TABLE. The image descriptor contains image data, determined in the layout descriptor as PICTURE, LINEART and the like, cut out from the original image. FIG. 3A shows the structure.
The structure of these described results of analysis is stored, not only as an intermediate data but also as one file.
The results of image document analysis are stored in this manner. Then further, there is an increasing need to store color information in addition to character information and layout information in the document image.
The above-described DAOF structure is made with emphasis on faithfully reproducing the results of layout recognition and character recognition. However, this structure does not enable faithful reproduction of colors of original image in monitor displaying or printing the file information. The faithful color reproduction cannot be performed without color matching to match the characteristics of an input device and those of an output device.
The present invention has been made in consideration of the above situation, and has its object to provide image processing apparatus and method and storage medium which enable a color management system (CMS) to obtain an output result faithful to an original image regardless of characteristics of means for optically reading the original image.
According to the present invention, the foregoing object is attained by providing an image processing apparatus which optically reads a color original image by input means, and converts the read original image into color document data with a predetermined structure, comprising storage means for storing unique information indicative of input characteristics of the input means, used when reading the color original image, as a part of definition of the color document data.
Another object of the present invention is, when a document read by a scanner or the like from a paper document and document-analyzed is utilized on a computer, to reproduce colors of a color image, especially colors of a natural image in the document, closely to colors of the original paper document.
Further, another object of the present invention is to enable color reproduction based on color information of a character area of the above image as much as possible, and to reproduce a base color of the document as much as possible.
In accordance with preferred embodiments of the present invention, the foregoing objects are attained by providing document image analysis data structure as follows.
(1) DAOF Header
(2) Scanner Profile
(3) Layout descriptor
(4) Character recognition descriptor
(5) Table analysis descriptor
(6) Image descriptor
(7) Color descriptor
In the embodiments, as the item to store the xe2x80x9cScanner Profilexe2x80x9d is extended, the color reproduction in the image descriptor (6) is possible.
Further, the extension is effective in the color reproduction in the character recognition descriptor (4), and in the reproduction of base color of a document image represented in the color descriptor (7).
As a procedure to generate the above-described DAOF,
(1) Color characteristics of a color image input device are obtained in the form of Scanner Profile. As the color characteristics differ in accordance with scanner type, scanner information of the color image input device is also stored.
(2) Next, document analysis is performed on a color document image, to extract the above-described TEXT area, a table area, an image area and the like (layout descriptor). In the TEXT area, character recognition is performed (character recognition descriptor). In the table area, table analysis processing is performed (table analysis descriptor). In the image area (including line image), a bitmap image is cut out and stored as data without conversion as in character code (in case of figure portion such as a line image, vectors are obtained in accordance with necessity).
(3) In the color descriptor, color information of the areas extracted upon layout processing (2) are described. For example, a base color, the color in an area and the like are described.
To display on a computer or color print-output an electronic document described in the above format, a code descriptor must be converted to that appropriate to the output device. For example, to use the document on MS Word (trademark of Microsoft Corporation), the code descriptor must be converted to that in the RTF (Rich Text Format) format. To color print-output the document, the code descriptor must be converted to, e.g., that in PostScript (trademark of Adobe Systems Incorporated) format. However, in the image area, the position information must be converted to that in an appropriate format, but the bitmap data itself is merely transferred. Upon this transfer, to realize the CMS in the original image, the Scanner Profile, and Monitor Profile or Printer Profile unique to the output device are utilized to perform image data conversion. Then the image data is transferred.
Further, the color information of the respective areas described in the color descriptor is similarly converted, and after the CMS has been realized, converted into the respective description formats.
As described above, the present invention enables faithful color reproduction in addition to faithful document reproduction upon recognition and processing a document image for reuse.