1. Field of the Invention
This invention relates to improvements in text and image processing methods and techniques, and more particularly to improvements in methods for word or term identification and location in document images, and still more particularly to improvements in methods for computer searching a number of document images for existence of query words or terms with reduced memory requirements.
2. Description of the Relevant Art
There has been increasingly widespread interest in document processing, both in electronic and in paper document forms. Often it is desired to locate particular search terms within a large corpus of documents; for example, in performing research to locate papers or publications that pertain to particular subjects, in finding particular testimony in deposition or discovery documents that contain particular words or phrases, in locating relevant court decisions in a legal database that have certain key words, and in manifold other instances.
Sometimes the documents are presented in electronic form in which the document text and images have been encoded in an electronic memory media from which the documents can be retrieved for perusal or for "hard copy" or paper reproduction. In the past, when a large number of such documents are to be searched to locate one or more query terms, usually words, an index is built against which the query terms are compared. Such index generally is formed of two parts. The first part is a document identifier (herein the "document id"). The document id is merely an identification of each document in the collection, and may be a number, key word or phrase, or other unique identifier. The second part is a word and the number of times the word appears in the document with which it is identified (herein the "word frequency").
In the past, as shown in FIG. 1, to identify the particular documents in which search or query words exist, usually the index of all of the words is brought into a computer memory 10, and the query words are compared, one at time, against each of the words in the memory. As each word is compared, a "score" is kept of the documents in which it appears. Thus, a first query word is processed 11, and a partial "score" is computed 13 for the first word. Then a next query word is processed 14, and a cumulative "score" is computed 16. As the successive query words and cumulative scores are processed until completed 17, the cumulative score is continued to be generated. After the last query word has been searched, the "scores" can be used to identify or sort the documents 18 in order of the number of "hits" by the query words, and a list of documents found can be displayed 19.
Such techniques, however, require a large amount of computer accessible memory, particularly for large document collections. The memory requirement often makes it impractical for document searching on personal or portable computers, even if the documents are stored on large capacity memory disks, and generally require large, mainframe computers with associated large memories.
In the field of image processing, recently, direct paper document searching techniques have been proposed in which one or more morphological properties of the images on the document are processed and used for comparison against a query word, term or image. In accordance with such techniques, a document is scanned and the morphological properties of its various images directly determined without decoding the content of the image. In performing searches of a large corpus of documents, however, one technique that can be used is to generate an index similar to that described above, but with a list of frequencies of morphological properties used in place of the words. Again, especially in large document collections, a large amount of memory is required to perform search queries.