The present invention relates generally to document and image retrieval, and more particularly, to an imaged document optical correlation and conversion system that uses optical correlation (OC) technology to access textual and graphic information contained in imaged documents. The result is a powerful document management capability for paper archives and incoming hard copy.
Federal agencies responsible for the review and declassification of information (e.g., the Central Intelligence Agency, the Department of Defense, the Department of Energy) are facing looming deadlines for reviewing approximately 2 to 2.5 billion pages of documents. These federal agencies have identified a need for large-scale improvements in the productivity of the declassification process. These federal agencies have a most critical need for an effective, automated process to convert paper archives to electronic form to allow the additional processing that will make the information in the archives both releasable and useful. To make this problem worse, a significant percentage of these documents are duplicates and should be eliminated before the declassification review process. The total process involves the conversion of billions of single and double-sided hard-copy pages, index cards, and information that already exists in some electronic form into a managed, declassified and distributable form.
Outside of the federal government, there is a multi-billion dollar problem of managing the paper documents that persist as part of an organization""s business process. Many companies have a historical backlog of paper documents that the companies must access in an efficient way. Other companies receive, create and disseminate paper documents as an essential element of other companies"" business process. There is a need for a system that uses commercially available high resolution scanning to create images from paper documents, a large file management package to store the imaged documents, and an innovative application of optical correlation OC technology to access and organize the imaged documents that are created. Although documents containing only text can be searched with prior art techniques, the inventors are not aware of any method for automatically identifying scanned documents using images or images and text to identify a document without using optical character recognition.
It is, therefore, an object of the present invention to provide a method and apparatus for automatically identify scanned documents by comparing a pattern against electronic versions of the scanned documents.
It is another object of the present invention to use an optical correlator for comparing the pattern against the electronic versions of the scanned documents.
It is yet a further object of the present invention to locate patterns within electronic versions of stored patterns.
It is another object of the present invention to index the scanned documents as wavelet transforms in a database and to store each pattern as a wavelet transform.
These and other objects of the present invention are achieved using optical correlation (OC) technology, previously used with great success to detect tanks and other weaponry in aerial imagery, is used with imaged pages which are stored as image templates. An image template of a search word, a classification, an agency seal or a particular individual""s signature becomes the basis of a user query. The target to be detected can be text as image (a search word, a classification) or image as image (an agency seal, a signature). The result is a faster (not one or two, but hundreds of times faster), flexible method of automatically identifying documents that match a target image template.
Organizations that keep extensive records, such as the intelligence community (CIA, DIA, NSA), the military (Army, Navy, Air Force, Marines), law enforcement (FBI, the Justice Department, state and local police departments), law firms and health care enterprises (HMOs) are all prime candidates to benefit from technology used in the present invention. For these organizations, the incoming stream of raw data on paper is a vital source of information. To take advantage of electronic distribution methods, the documents must be converted to electronic form. The first electronic version that is created from a paper document is typically a scanned image of the document, followed optionally on selected documents by optical character recognition (OCR), creating a second version of the scanned paper document. The OC brings a 400-fold increase in the speed of image analysis, allowing large amounts of imaged text to be quickly processed.
The foregoing objects are also achieved by a method of automatically identifying documents. An electronic version of a pattern stored in a first database is correlated with electronic versions of scanned documents stored in a second database. A signal is output that an electronic version of a pattern has been correlated with an electronic version of a scanned document.
The foregoing objects are also achieved by an article including at least one sequence of machine executable instructions. A medium bears the executable instructions in machine readable form, wherein execution of the instructions by one or more processors causes the one or more processors to correlate an electronic version of a pattern stored in a first database with electronic versions of scanned documents stored in a second database. A signal is output that an electronic version of a pattern has been correlated with an electronic version of a scanned document.
The foregoing objects are also achieved by a computer architecture for automatically identifying documents. The computer architecture includes correlating means for correlating an electronic version of a pattern stored in a first database with electronic versions of scanned documents stored in a second database. Outputting means are provided for outputting a signal that an electronic version of a pattern has been correlated with an electronic version of a scanned document.
The foregoing objects are also achieved by a computer system including a processor and a memory coupled to the processor, the memory having stored therein sequences of instructions, which, when executed by the processor, causes the processor to perform the steps of correlating an electronic version of a pattern stored in a first database with electronic versions of scanned documents stored in a second database and outputting a signal that an electronic version of a pattern has been correlated with an electronic version of a scanned document.