1. Field of the Invention
The present invention concerns a system for analyzing color documents. In particular, the present invention relates to a system in which various features of a color document are identified and in which the identified features are analyzed to create a hierarchical representation of the color document.
2. Incorporation by Reference
Commonly-assigned U.S. applications Ser. No. 07/873,012, now U.S. Pat. No. 5,680,479, entitled xe2x80x9cMethod and Apparatus For Character Recognitionxe2x80x9d, Ser. No. 08/171,720, now U.S. Pat. No. 5,588,072, entitled xe2x80x9cMethod and Apparatus For Selecting Text And/Or Non-Text Blocks In A Stored Documentxe2x80x9d, Ser. No. 08/338,781, entitled xe2x80x9cPage Analysis Systemxe2x80x9d, Ser. No. 08/514,252, entitled xe2x80x9cFeature Extraction Systemxe2x80x9d, Ser. No. 08/664,674, entitled xe2x80x9cSystem For Extracting Attached Textxe2x80x9d, Ser. No. 08/751,677, entitled xe2x80x9cPage Analysis Systemxe2x80x9d, Ser. No. 08/834,856, entitled xe2x80x9cBlock Selection Review and Editing Systemxe2x80x9d, and Ser. No. 09/002,684, entitled xe2x80x9cSystem For Analyzing Table Imagesxe2x80x9d, are herein incorporated as if set forth in full.
3. Description of the Related Art
Conventional page segmentation systems receive image data representing a document, identify regions of the document based on the data, and process the data in accordance with the identified regions. For example, identified text regions may be subjected to-optical character recognition (OCR) processing and identified image regions may be subjected to image compression. Additionally, conventional systems store data representing the physical layout of the document regions, ASCII data corresponding to text regions, and/or compressed data corresponding to image data from the document. Such systems thereby allow substantial reproduction or editing of the document using the stored data. However, these systems operate only on binary (black/white) image data.
In order to process color documents, conventional scanners are often used to convert the color documents into binary image data for subsequent input to a conventional page segmentation system. This conversion discards color information representative of colors within the color documents. Consequently, a conventional page segmentation system which receives thus-converted binary data of a color document is unable to accurately detect features of the color document which are borne out through the use of different colors.
For example, related text areas within a color document are often indicated using a common background color for each of the related areas. Similarly, an area of stand-alone text within a document can be designated by using a background color for the stand-alone area which is different than a background color used for the remaining areas of the document. Accordingly, such relationships between areas within a color document may be incorrectly identified using conventional segmentation systems.
Moreover, because conventional systems discard color information, they provide no means for storing data concerning colored regions of a document. As a result, even if related areas of a color document are successfully identified by a conventional page segmentation system, it is not possible to accurately reconstruct or edit the color document based on output from the conventional system.
Recent systems, such as Xerox Pagis(trademark) and Adobe Acrobat(trademark), are capable of inputting a color document and outputting a representation of the document containing color information. However, the output representation is often inaccurate and is difficult to edit.
In view of the foregoing, what is needed is a system for identifying and representing features of a color document in which colored regions of the document, as well as features borne out by the colored regions, can be accurately identified and also editably represented within a data structure.
The present invention addresses the foregoing problems by providing a system to binarize a color document so as to maintain a representation of colored regions within the document, to submit binarized image data of the document to block selection processing in order to identify features of the document, and to store a hierarchical tree structure identifying features of the document, wherein the hierarchical tree structure includes color information of the document. By virtue of the foregoing, features of an input color document can be identified and also represented within a small data structure from which a substantial representation of the document can be constructed or edited.
Therefore, in one aspect, the present invention is a system to identify features of a color document in which primary color values representing a color document are input, a threshold binarizing range is calculated based on the input values, the input values are binarized into binary values based on the threshold binarizing range, a colored region is identified within the document, and a frame is defined surrounding the identified colored region. In addition, a second threshold binarizing range is calculated based on input primary values corresponding to the colored region, and the input primary values corresponding to the colored region are binarized into binarized values based on the second threshold binarizing range. Preferably, a background color of the image is calculated based on input primary values corresponding to the threshold binarizing area and a background color of the colored region is calculated based on the input primary values corresponding to the colored region.
As a result of the foregoing system, a binary representation of a color document is produced in which a colored region within the color document is represented by a binary frame, and in which background colors of the document and of the colored region are calculated. In addition, the binary image data produced by the system can be input to block selection processing to identify features of the color document. Advantageously, the system thereby allows a conventional binary block selection system to accurately identify features of a color document. Moreover, information concerning the identified features, the frame, and the background colors can be stored and used to construct a substantial and editable representation of the document.
In another aspect, the present invention relates to a system to define a threshold binarizing area for binarizing a color image in which color pixel values corresponding to a color document are input, a core range of pixel values are determined from the input pixel values, a first number of pixel values within the core range of pixel values is calculated, a second number of pixel values within an outer layer of the core range of pixel values is calculated, and a threshold binarizing area is defined. In a case that the number of pixel values within a subject outer layer of the core range of pixel values is a local minimum value, the threshold binarizing area is defined as the subject outer layer and an area circumscribed by the subject outer layer. The resulting threshold binarizing area represents a range of colors approximating a background color of the color document. As a result, background pixels of the color document can be easily discerned by identifying pixels having values within the threshold binarizing area.
Therefore, the above system can be used to transform color pixel values of a color document to binary pixel values in which pixels having a value approximate to the background color are assigned values of xe2x80x9czeroxe2x80x9d, and in which all other pixels are assigned values of xe2x80x9conexe2x80x9d.
In yet another aspect, the present invention relates to a hierarchical tree structure representing a document page which includes a plurality of nodes, each of which corresponds to a block of image data in the document page, and each of which contains feature data defining features of document image data of the corresponding block, wherein at least one node contains information regarding a color within the corresponding block, and wherein at least one node contains information regarding an artificial frame which represents an area of the document page. By virtue of the foregoing, a color document image can be accurately reconstructed and/or edited from a small amount of stored data.
This brief summary has been provided so that the nature of the invention may be understood quickly. A more complete understanding of the invention can be obtained by reference to the following detailed description of the preferred embodiments thereof in connection with the attached drawings.