1. Field of the Invention
The present invention relates to document image analysis, and to text and object recognition techniques for the purpose of creating searchable files from document images. More particularly, the present invention relates to providing more efficient tools and techniques for human based ground-truthing of the searchable files.
2. Description of Related Art
The longtime goal of vendors of text recognition technologies is to create 100% accurate computer searchable files, totally automatically, from a wide range of document images. However, after decades of trying, it has become increasingly apparent that this goal of automation may never be achieved. See, David Doermann, The indexing and retrieval of document images: A survey. Technical Report CS-TR-3876, University of Maryland, Computer Science Department, February, 1998.
So, to compensate for the limited automation of these technologies, human assistance is required. Specifically, text recognition technologies, which include, but are not limited to, Optical Character Recognition (OCR) and Intelligent Character Recognition (ICR), require human assistance, referred to as ground-truthing, that involves human proofreading of the textual output, human comparison of this textual output with the original image text, and human correction of textual recognition errors. See, Doe-Wan Kim, and Tapas Kanungo. A Point Matching Algorithm for Autmomatic Groundtruth Generation. Technical Report: LAMP-TR-064/CS-TR-4217/CAR-TR-961/MDA-9049-6C-1250, University of Maryland, College Park, 2001; R. A. Wilkinson, M. D. Garris, J. C. Geist. Machine-Assisted Human Classification of Segmented Characters for OCR Testing and Training, Technical Report NISTIR 5105 [102K], December, 1992 and In D. P. D""Amato, editor, volume 1906. SPIE, San Jose, Calif., 1993; and Chang Ha Lee, and Tapas Kanungo. The Architecture of TRUEVIZ: A groundTRUth/metadata Editing and VIsualiZing toolkit. Technical Report: LAMP-TR-062/CS-TR-4212/CAR-TR-959/MDA-9049-6C-1250, University of Maryland, College Park, 2001.
For mainstream businesses and government agencies, that wish to post mountains of scanned documents to public Web sites and corporate Intranets, this line-by-line checking for, and correction of, recognition errors is impractical. And since these mainstream organizations require 100% accuracy, to ensure that their document images can be reliably retrieved, they have rejected these text recognition products entirely.
Nonetheless, with or without text recognition products, mainstream organizations do realize that a significant amount of human interaction is required in order to guarantee 100% retrieval. So, what these organizations are seeking, is a way to make this time-consuming manual process far more efficient.
Thus, with this goal in mind, the present invention was created.
The present invention provides a method and a system by which a document image is analyzed for the purposes of establishing a searchable data structure characterizing ground-truthed contents of the document represented by the document image, and in some embodiments including resources for reconstructing an image of the document. According to the present invention, the document image is segmented into a set of image objects, and the image objects are linked with fields that store metadata.
Image objects are specified regions of a document image that may contain a structural element, where examples of structural elements include, but are not limited to, a single word, a title, an author section, a heading, a paragraph, a page, an equation, a signature, a picture, a bar-code, a border, a halftone image, noise, and the entire document image. The image objects into which the document image is segmented may or may not be exclusive, where exclusive image objects do not overlap with other image objects. In embodiments in which the document image consists of a bitmap, image objects may consist of portions of the bitmap that include a shape or shapes including black or colored pixels that are separated from other black or colored pixels by clear regions having specified characteristics.
The image objects are identified and linked with fields for storing metadata. The metadata is used to bind logical structure, and thus meaning, to image objects in the document image. Thus examples of metadata include, but are not limited to, indications, pointers, tags, flags, and plain text represented in computer readable code, such as ASCII, EBCDIC, Unicode, and the like. Image objects linked with metadata fields storing ground-truthed metadata can be organized into searchable records, such as hierarchically organized documents. Thus, the data structure including image objects and linked metadata can be independently searched, retrieved, stored, managed, viewed, highlighted, shared, printed, protected, indexed, edited, extracted, redacted, toggled (between image view and metadata view) and the like.
In the present invention, an interactive framework is presented for efficiently ground-truthing document images via image objects paired with fields for ground-truthed metadata (called herein xe2x80x9cimage object pairsxe2x80x9d). Here, ground-truthing an image object pair is accomplished by ground-truthing its metadata. More specifically, in one embodiment of the invention, in order to xe2x80x9cground-truthxe2x80x9d an image object pair, the following two computer assisted steps are available:
1. Initial metadata is input into an image object pair by either (a) manually creating it, (b) automatically creating it (such as with text recognition, etc.), or (c) importing it.
2. Manually verify the accuracy of this initial metadata, or manually correcting this initial metadata.
Embodiments of the present invention increase the efficiency of human ground-truthing by using an index of unique image object pairs. This image object pairs index can eliminate the time and expense of ground-truthing each instance of each unique image object pair one-by-one, as required by text recognition products. Moreover, this index increases the efficiency of human ground-truthing even more as (1) the number of instances associated with any unique image object pair increases, and as (2) the accuracy of the segmentation process increases. Indeed, since the efficiency of human ground-truthing is so strongly influenced by the accuracy of segmentation, the present invention allows for human control over the segmentation process.
Also, the efficiency of human ground-truthing is strongly influenced by the quality of the document images being processed as well. Specifically, poor quality document images that have a lot of ambiguous content, such as those created from faxed, aged, photocopied, and faded paper originals, may reduce tremendously the effectiveness of an image object pairs index, and thus, the efficiency of human ground-truthing. As a result, the present invention also describes a method for ground-truthing image object pairs without using an image object pairs index. Indeed, this method is also useful for ground-truthing document images that contain a substantial amount of handwritten or hand-printed content as well.
Moreover, it should be pointed out that an image object pairs index is also extremely useful even when no ground-truthing occurs. For example, in one embodiment of the invention, an image object pairs index can be used to efficiently retrieve some, or all, of the instances of any unique image object pair contained within the index, when the metadata within each image object pair is NULL.
In one aspect of the invention, a method for analyzing a document image is provided which comprises segmenting the document image to identify a set of image objects within the document image, and processing the set to group image objects within the set into a plurality of subsets, where the subsets may include one or more members. In this aspect, reference image objects are linked to corresponding subsets in a plurality of subsets. Machine-readable data structures are created including the reference image objects with linked metadata fields, whereby image objects in the corresponding subsets are linked to common metadata in the linked metadata fields. The method includes presenting the reference image objects to the user, and accepting input from one or more users, to interactively populate the linked metadata fields with ground-truthed metadata, by inserting, deleting, editing and/or validating text, flags or other data about the image object in the linked fields. In some embodiments, the method further includes generating a searchable data structure to represent the document image, where the searchable data structure comprises the metadata linked to the set of image objects, and the set of image objects.
In some embodiments, the process of segmenting the document image includes presenting at least a portion of the document image with graphical constructs showing boundaries of the identified image objects in the set to the user, and accepting input from the user to interactively adjust the boundaries to form a new set of identified image objects. Also, the segmenting includes an automated process that identifies separate objects according to segmentation parameters. The user may adjust the segmentation parameters interactively to optimize the automated segmentation for a given document image.
Image objects identified by segmenting the document image are grouped into subsets in some embodiments, which facilitates ground-truthing. According to one approach, the image objects are grouped according to characteristics suggesting that the image objects may have common ground-truthed metadata. For example, image objects are grouped in some embodiments so that image objects in a particular subset consist of image objects having similar shapes. In some embodiments, the grouping process is executed with an adjustable parameter by which similarity among image objects of a subset is adjustable. For example, a threshold for a number of different pixels in the image objects within a subset may be adjusted in order to change the grouping of image objects.
By grouping the image objects into subsets, the image objects may be indexed to facilitate the ground-truthing process. In some embodiments, the index of representative image objects is presented to the user in a table form. The table includes a set of entries that correspond to respective subsets of image objects within the set of image objects. Entries include the representative image objects for the respective subsets and fields for ground-truthed metadata. In the presentation of the table, the representative image objects are ordered according to similarity in shape, similarity in metadata, characteristics derived from the document image, such as position in the document image, or the like. Tools are provided to the user for interactively removing an image object, or group of image objects, from a selected subset, or moving image objects from one subset to another, and otherwise managing the grouping and indexing of image objects from the document image.
Representative image objects for the purposes of this indexing structure, may be selected from the subset of image objects, or may be composed from a combination of more than one image objects from within the subset, from sets of icons, or from other sources.
According to yet another aspect of the invention, the method includes segmenting the document image to identify a set of image objects, and creating machine-readable data structures pairing identified image objects in the set with the linked metadata fields. In this aspect of the invention, representations of the identified image objects are presented to the user, and audio input is accepted and translated using speech recognition tools, to interactively populate the linked metadata fields with ground-truthed metadata. In some embodiments according to this aspect of the invention, the image objects are presented to the user for ground-truthing in a reading order for the document image. Alternatively, representative image objects are presented to the user in an index grouping similar image objects, as discussed above.
The present invention may be applied in combination with other techniques for ground-truthing, and for facilitating the processing of document images. Thus, in one embodiment of the invention a method for analyzing a document comprises segmenting the document image to identify a set of image objects, applying text recognition tools to produce proposed text for the set of image objects, and processing the set to group image objects within the set into a plurality of subsets as discussed above. Linked metadata fields for the image objects are populated with proposed metadata based on the text recognition process. The identified image objects, using reference image objects in some embodiments, are presented to the user, and input is accepted from the user to interactively populate the linked metadata fields with ground-truthed metadata fields. Thus, the proposed text provided by text recognition tools, such as optical character recognition, word recognition and the like, is presented to the user along with the representation of the image objects.
In various embodiments, the text recognition processing is applied to the entire document image or portions of the document image to facilitate contextual processing. In other embodiments, the text recognition processing is applied to the segmented image objects, or representative image objects, individually. Some embodiments may provide resources for performing text recognition at any point in the processing of the document image.
The present invention also includes a process by which analysis of documents can be leveraged among similar documents, by creating a library of representative image objects with linked metadata fields that can be applied in the analysis of new documents. Thus, in one aspect of the invention the process includes providing a database of representative image objects with linked metadata fields storing metadata. The document image is segmented to identify a set of image objects within the document image. The set of image objects is processed to match image objects in the set with representative image objects in the database, and to link the matching image objects in the set with particular representative image objects in the database. The image objects from the document image can be ground-truthed by presenting instances of image objects in the set that are linked with particular representative image object in the database, and by accepting user input to interactively undo the link of the selected image objects with particular representative image object, to populate metadata fields of image objects that have not been linked to representative image objects in the database, and move image objects that have been mistakenly associated with a particular representative image object in the database so that it becomes associated with another representative image object in the database.
The present invention is also embodied by an apparatus that comprises a data processing system having a user input device, a display, memory or access to a memory storing the document image, and processing resources to perform the functions outlined above. In some embodiments, the data processing system is also linked via a communication medium to workstations at which a plurality of users may interactively work in the ground-truthing process.
Other aspects and advantages of the present invention can be seen on review of the drawings, the detailed description and the claims, which follow.