1. Field of the Invention
The field of the invention relates generally to word recognition in written documents and more particularly to providing the ability to search imaged documents directly and identify words in their source language.
2. Background Information
Accurate character recognition is crucial in today's document processing environment. For example, in the intelligence community, it is often of great importance to search a voluminous number of documents for specific key words. Many of those documents, however, may be in a foreign or even an obscure language. In an ideal environment, a staff of translators would translate all the documents into English prior to executing a machine based search program for a specific key word or group of words. Providing for the human translation of a foreign language document into English, prior to conducting a computerized search, optimally ensures that the nuances of the foreign language are realized and accurately captured. This approach, however, is limited due to the shortage of foreign language translators in key languages, such as Arabic, or in exotic languages, such as Pashtu.
To overcome these limitations, a conventional automated document searching system, such as the system depicted in FIG. 1, has been employed. The document searching system 100 comprises three basic components, namely an automatic Optical Character Recognition module (“OCR”) 104, an automatic translation module 106, and a search engine 108. Together, these components comprise a conventional image recognition system. As illustrated in FIG. 1, one or more documents 102 are provided to the character recognition module 104, from which characters are extracted. These documents 102 may already exist in electronic form. Alternatively, these documents may be in a physical form. In this case, the system 100 provides a scanner to convert the physical document into electronic form.
The automatic character recognition module 104 is used to discern individual characters from a document. Numerous optical character recognition systems are available to provide this function. Furthermore, if a document is written in a foreign language, the character recognition module 104 will employ OCR software packages tailored to specific foreign languages. In other words, specialized OCR software is capable of identifying characters unique to a foreign language.
Once the optical character recognition is complete, the automatic translation system 106 translates the OCR data from its foreign language into a target language such as English. Once the OCR data is completely translated resulting in searchable data information, the search engine 108 is employed to search for a specific term or concept. Search concepts 110, in their English form are, for example, input into the search engine 108. The search engine 108 scans the document's searchable data information to determine whether the document contains characters, words, or phrases that match the input search concept 110. The documents identified as containing a match 112 are flagged and ultimately reviewed further for pertinence.
The difficulty with conventional document searching systems is that errors accumulate throughout the automatic translation process and the character recognition process. Unfortunately, even with the best character recognition techniques, errors occur due to many factors including but not limited to the poor quality of the original document and extraneous document markings. Also, the inconsistencies inherent in handwriting overwhelm most computer-based recognition systems. Additionally, conventional character recognition techniques quite commonly misinterpret one letter for another. For example, the letter “m” is often interpreted to be the letters “rn”. Such a misinterpretation by the character recognition process may ultimately result in an untranslatable word when subsequently subjected to an automatic translation system. Moreover, a slight misinterpretation in characters may lead to similar words in the foreign language, with completely different meanings, being interchanged with one another. As such, any small variation in the original document may result in an erroneous interpretation by the automatic translation system.
Furthermore, the machine translation process results in errors when the machine translation fails to recognize words. Although a word may be properly converted into electronic form from its foreign language, often times the meaning of the word will not survive the machine translation process, thus resulting in data loss. This in turn may ultimately result in a highly important document going undetected.