Certain professional people often need to sift through large quantities of paper documentation in order to trace significant content. For example, a person may have to sift through boxes of paper documents looking for a few key words pertinent to a task at hand. With modern computer systems, it is possible to use search techniques to locate such content. Typically, use is made of an appropriate data structure containing metadata for supporting searching.
By metadata is meant data associated with data. For example, metadata can include, for example, a ‘postal code’ for an integer; a ‘title word’ for a word in a text title, a rectangle identifying an object's location on a page, a bitmap representing a word object in a digital image, features derived from a bitmap object, or the like. Metadata can typically be generated by interpretation and administration needs, such as programmatic needs.
To obtain metadata from non-electronic documents, such as, paper documents, or documents on microfiche, or the like, for example, such documents are typically converted, or captured, into a suitable digital format to provide a document image. This can typically be achieved by using an appropriate scanning device. After the documents have been converted into a digital format, a computer-based text recognition process, or system, can be used to provide a digital interpretation of objects found within the document image. An example of such a text recognition process is Optical Character Recognition (OCR). Conventionally, the digital interpretation can be output in the form of computer readable ‘character codes’ which are assigned to individual image objects within the document image. Examples of such codes are ASCII, EBCDIC and Unicode, for example. Typically, such codes are output in electronic files that can be viewed and managed within applications, such as a word processor program, or the like, for example. Examples of processes for capturing document images and applying text recognition to such captured document images are discussed below.
In U.S. Pat. No. 4,985,863 (“Document storage and retrieval system”), images are processed by a text recognition system that outputs character codes in such a way that characters, where recognition may be in error, have alternative code choices. Accordingly, provision is made for a search and retrieval system to locate content that is potentially ambiguous. For example, the recognition process might confuse the letter ‘l’ with the digit ‘1’. In such a case, the word “full” might be interpreted and output as “fu[l1][l1]” where the ambiguous characters are grouped within “[ ]” to signal multiple interpretations. Such an output could be retrieved by the keyword “full”, or the keyword “fu11”, for example. A similar approach is described in U.S. Pat. No. 5,905,811 (“System for indexing document images”). U.S. Pat. No. 6,480,838 (“System and method for searching electronic documents created with optical character recognition”) also discloses an invention dealing with recognition errors when retrieving textual content.
In U.S. Pat. No. 5,109,439 (“Mass document storage and retrieval system”), document images are run through a text recognition process during which search words are identified automatically and put into a table to facilitate later search and retrieval within the document images. The identification of such words is done using simple language-specific heuristics and there is no interaction with a user. Furthermore, an indexing process disclosed in this patent does not use metadata based on a bitmap appearance of the words (only text recognition output codes), nor does it group items in the document image into subsets of the same, or similar words, as they occur in the document image. The invention disclosed is designed for applications such as, mail processing, for example, and does not provide for an interactive approach to searching like that of the present invention.
U.S. Pat. No. 5,706,365 (“System and method for portable document indexing using n-gram word decomposition”) describes an invention for constructing an index for a batch of documents on which a text recognition process has been performed. Not all of the documents input to the system need to be in image form. For example, some of the documents may already contain character code metadata. The index is based on n-grams and is designed so as to provide for correction of text recognition errors and to be small enough for porting to multiple other computer systems.
In U.S. Pat. No. 6,768,816 (“Method and system for interactive ground-truthing of document images”), an index is constructed to speed up ground-truthing of document images. Typically, ground-truthing includes user-input to teach a computer and can be performed before, or after, machine recognition of information is performed. For example, ground-truthing can be used to teach a computer to recognize certain things. A set of ground-truthed things is input to a computer pattern recognition program so as to cause the computer to recognize specific things. In another example, a computer has been used for recognition and a user then goes through associated results so as to correct errors in interpretation. The invention, in particular, provides for an index, which addresses error correction in text recognition processes. It is constructed by grouping image objects into subsets based on features like the bitmap shapes of the letters, such as, for example, all instances of the letter ‘t’ in Times Roman font might be grouped together. If such instances were erroneously recognised by the text recognition method used, a user can apply a single correction command, which then takes effect over the entire subset of instances.
In spite of the teachings of the above inventions, the process of error correction can still be relatively time consuming. Many service professionals do not request that third party service bureaus, providing the capability of capturing paper documents into electronic form, perform such corrections.
With regard to the present invention, it appears to the Applicant that the most relevant prior art seems to be that disclosed in U.S. Pat. No. 6,768,816. However, a significant limitation of this disclosure is that an index is not constructed ‘online’. Furthermore, the index is designed for ground-truthing entire documents into a full-text output result. A major disadvantage of such an approach is that users working with large batches of documents would need to wait until the entire index is constructed before they could begin searching for content. Furthermore, the index gets clogged with all image objects in the document image. Typically, most of the image objects are not of interest to the user.
Online construction, such as in the present invention, permits the user to begin working with initial documents already indexed, while simultaneously providing feedback to an index construction process, thereby to cause image objects of non-interest to be ignored from subsequent documents during the indexing process. In this way, the indexing process can be speeded up and space can be saved. This can be very advantageous and enables, for example, a legal professional to commence working on initially scanned documents of a large batch of documents, so as to index and tag, or bookmark, information in the documents, while a personal assistant is scanning the rest of the batch of documents into a computer system. In addition, the present invention introduces ‘content filtering’ for greatly improved index performance, both in terms of construction speed and storage space.
The Applicant believes that existing art for capturing documents and identifying, or tagging, content therein, possibly incorporating some form of manual error correction, is inefficient for several reasons:
Firstly, professional persons, who typically work with large volumes of paper documents, currently often outsource the scanning and text recognition processes of the paper documents to third parties. Typically, such third parties do not participate in identification and management of textual content. For example, a scanning bureau that digitises paper documents for an attorney, for example, is typically unable to determine which objects in an associated document image would be of interest to the attorney. Accordingly, it would be inefficient to have the bureau personnel perform error correction on the full-text output of the text recognition process since, typically, only a small part of the full-text may be of interest to the attorney. Therefore, to perform error correction on the full textual content may not be unnecessary. However, if such error correction is not performed, the attorney may not be able accurately to locate content where the interpretation output of a text recognition process includes erroneous interpretation of image objects.
Secondly, outsourcing the capture of paper documents into an electronic format can result in delays since, typically, the end user would not have access to the documents until the third party has returned them. Furthermore, outsourcing document capture and content book marking typically forces an ‘offline’ approach to document capture and content book marking. Accordingly, the benefits of an ‘online’ approach in document capture and content book marking, can typically not be employed when document capture is outsourced. Furthermore, in some cases, it may not be prudent to outsource document capture, since such documents may be confidential.
Thirdly, resultant electronic files as returned to the end user are typically not indexed for querying the document image for content by, for example, a keyword search, or the like. A common approach is to return to the user a collection of files in, for example, Adobe™ Portable Document Format (PDF)™, or the like, in which the files comprise the document image and text recognition codes against which keyword queries may be performed. Typically, actual construction of an index across such a file collection, would require the end user to employ additional resources, such as, a special program designed to construct an index of text within a batch of PDF files, for example. This typically involves additional computational overhead to construct such an index.
Fourthly, text recognition applications used to capture documents are typically not included within a program used to display the documents and to provide a search interface to the user. It would be advantageous if a single application could be provided which provides for text recognition when capturing documents, displaying associated document images to a user and which provides a search interface to the user.
It is an object of the invention to provide a solution arranged to at least alleviate the problems mentioned above.
It is an object of the invention to provide a process arranged to enable an end user to provide feedback while constructing an index directly from source documents during an indexing process.
Advantageously, the invention provides an ‘online’ approach during conversion, or capturing, of documents. Accordingly, the process is arranged to inhibit delays and expense typically associated with ‘full-text’ capture. Advantageously, the process of the invention is arranged to inhibit having to perform error correction of the full-text of a captured document. Furthermore, the process provides a relatively efficient content identification and tagging process. Text recognition performed on content of interest to the user can be corrected relatively easily and efficiently, by using an index model to propagate a correction over one or multiple error locations. Text recognition errors can also be addressed by using flexible ‘fuzzy’ search methods.
It is believed that, the efficiency of the process of the invention is so much better than existing methods, that many professional persons could use the invention not only to process a desired batch of documents identified within a larger batch, but also to process the entire larger batch at the outset. For example, large-scale litigation cases by attorneys often require that attorneys examine an initial collection of documents at relatively great time and expense, in order to identify which documents therein are of interest. The documents of interest are then captured into a computer system (usually by a third party service bureau). The entire larger batch is typically not provided to a service bureau since the existing art, typically, makes it too expensive and time consuming to have the entire batch of documents captured. The present invention provides users a relatively efficient way of identifying such documents without the time delay and cost overheads characteristic of methods currently employed.