1. Field of the Invention
The present invention relates to systems in the form of computer software and hardware and methods for management of electronic image documents containing textual data.
2. Background of the Related Art
Historically, various physical documents including paper documents, microfiche, and microfilm, have been used for information storage. Physical documents have been manually archived and indexed. Indexing physical documents enables a user to find a particular document within an archive. In many cases, the physical document is a paper document that has been recorded on microfilm or microfiche and archived in this form. Whatever the form of the physical document, whether paper document, microfilm, or microfiche, physical document storage systems have typically been bulky, labor intensive, prone to loss of physical documents through misfiling, and difficult to use.
More recently, physical documents have been stored as electronic image documents in image formats on digital media such as magnetic media and optical media, so that images of the physical documents may be retrieved by computer. An electronic image document may be created from a physical document, meaning a paper or microfiche or microfilm document, and then stored on digital media. The electronic image document is created by scanning or digitally photographing or otherwise converting the physical document to an image format using a combination of hardware and software. Some physical documents are derived from an electronic image document, so that conversion from a physical document would not be required, as the physical document already exists in an image format. All of the physical documents in an archive could be converted to electronic image documents and stored in image format in a data structure. Unfortunately, electronic image documents in a data structure are not efficiently searchable based on the content of the electronic image documents.
To make electronic image documents in a data structure searchable, prior methods linked text files that contained the textual information contained within the electronic image document with the corresponding electronic image document in the data structure. The data structure could then be searched using text-based search strategies and the corresponding electronic image documents retrieved from the data structure. Typically, the text file was created from an electronic image documents by processing the electronic image documents with an optical character recognition (OCR) software engine. The OCR engine analyzes the pixels of each electronic image document and recognizes the alphanumeric characters that may be contained within the electronic image document. When any subset of the pixels of the electronic image document are found to be an alphanumeric character by the OCR engine, the OCR engine then creates corresponding text characters in a corresponding text file. The text file may then be stored in a searchable data structure that links the text file to the electronic image document from which the text files was derived. Text based search strategies, such as searching for a particular character string within the text file, would then link search results to the corresponding image file, so that the end user may then view the image file that contains the particular character string. However, there are inherent inefficiencies in this process.
The OCR engine requires a detail optimized electronic image document to most accurately generate a text file from an electronic image document. A detail optimized electronic image document may be created directly from the physical document. Alternatively, an electronic image document may be converted to a detail optimized electronic image document. The OCR engine then processes the detail optimized electronic image document to create the corresponding text file.
A detail optimized electronic image document may be defined as an electronic image document that optimizes the accuracy of the OCR process. Optimizing an electronic image document for detail to produce a detail optimized electronic image document may include producing a high resolution electronic image document in black and white. When a detail optimized electronic image document is processed by the OCR engine, the accuracy of the conversion of pixels in the electronic image document to text is maximized, and the error rate of the conversion of pixels in the electronic image document to text is minimized. The efficiency of conversion of pixels in the electronic image document to text by the OCR engine may be also improved by using a detail optimized electronic image document, so that using detail optimized electronic image documents may result in faster production of text files by the OCR engine. This would result in increased productivity when the OCR engine processes many detail optimized electronic image documents.
An accurate text file means that the text file accurately mirrors the text contained in the corresponding electronic image document, which provides a number of advantages. Having an accurate text file improves the ability to search the text file content using text based search strategies, which makes the corresponding electronic image document more accessible. For example, a search in the data structure for a particular character string finds text files that contain the character string, and those text files accurately reflect the character string in the corresponding electronic image documents. Conversely, character strings in the electronic image documents are accurately reflected in the text file. Inaccuracies in the text file would mean that text based search strategies such as a search for a particular character string would fail to uncover an electronic image document that contained that particular character string whenever the particular character string in the electronic image document was inaccurately reproduced in the corresponding text file. Thus, inaccuracies in the text file resulting from errors in the OCR process leads to loss of the information contained in the electronic image documents because of the inability to locate particular electronic image documents using text based search strategies.
Although the detail optimized electronic image document maximizes the accuracy of the OCR process, the detail optimized electronic image document may be a large file, with corresponding increased storage requirements and slower retrieval time. Furthermore, a detail optimized electronic image document lacks visual appeal. This may be particularly true when the detail optimized electronic image document was originally derived from an electronic image document that included colored elements.
Alternatively, a visually optimized electronic image document may be created directly from the physical document, or an existing electronic image document may be converted into a visually optimized electronic image document. The OCR engine may then processes the visually optimized electronic image document to create the corresponding text file. A visually optimized electronic image document retains the original colors of the electronic image document, and may eliminate details not necessary for a user to optimally perceive what is contained in the electronic image document. The visually optimized electronic image document is often a more appealing and, in some case, more legible electronic image document than a corresponding detail optimized electronic image document. Furthermore, the file size of a visually optimized electronic image document can be smaller and, depending on the image content, may be significantly smaller than the file size of a corresponding detail optimized electronic image document. Thus, a visually optimized electronic image document may require less storage and have faster retrieval times than the corresponding detail optimized electronic image document.
However, an OCR engine may have a higher error rate when creating the text file from the visually optimized electronic image document. This increase in the error rate can reduce the accuracy of the text file, can reduce the ability to search the text file, and may affect the overall utility of the data structure.
Accordingly, prior methods have produced electronic image documents from physical documents or from electronic image documents having a balance of detail and content somewhere between the detail optimized electronic image document and the visually optimized electronic image document. The goal generally has been to create electronic image documents that strike a balance between the ability to be accurately processed by an OCR engine, the electronic image document file size, and the visual appeal of the electronic image document. However, the resulting electronic image document may be a compromise that does not include the advantages of either the detail optimized electronic image document or the visually optimized electronic image document. In other words, the resulting electronic image document may lack visual appeal and produce errors when processed by an OCR engine. An end user may be unable to locate some electronic image documents within the archive and the electronic image documents displayed to the end user may not be aesthetically pleasing.
Therefore, a need exists for apparatus and methods that can most efficiently convert electronic image documents to searchable text files while presenting a visually optimized electronic image document to an end user.