1. Field of the Invention
The present invention generally relates to a document management system, and more particularly relates to a document management system which inputs documents as electronic images, and stores, displays, and searches the electronic images.
2. Description of the Related Art
In the descriptions below, Japanese double-byte characters (Hiragana, Katakana, and Kanji) are expressed in Latin alphabet letters.
An electronic filing system, which digitizes paper documents and stores the digitized documents, normally includes a function for searching the stored documents and a function for indicating relevant parts of retrieved documents to the user. For example, when a user searches for documents containing a search term “patent publication” and opens a retrieved document, occurrences of the search term “patent publication” in the retrieved document are highlighted. Such a function is called search result highlighting. Also, there is a method of searching documents in which various forms of a word are treated as the same word to increase the number of documents a search will find. For example, treating Japanese words “memorii” and “memori” (both mean “memory” in English) as the same word may make it easier to find relevant documents. Also, treating various forms of a word written in upper case, lower case, single-byte characters, or double-byte characters, such as “Memory”, “MEMORY”, and “MEMORY (in double-byte characters)”, as the same word may make it easier to find relevant documents. Such a method of standardizing various forms of a word is called word form normalization. On the other hand, generating various forms of a word from one form of the word is called word form denormalization. In word form denormalization, for example, “MEMORY”, “memory”, and “MEMORY (in double-byte characters)” are generated from the word “Memory”.
An exemplary process of search result highlighting is described below. When the operator enters a search keyword(s), a search subsystem searches documents and returns a list of documents found. The operator selects a document in the list and displays the document. In the above process, the search subsystem performs word form denormalization on the search keyword and highlights all occurrences of various forms of the search keyword in the displayed document. One of the disadvantages of this method is that the word form denormalization may not always generate all forms of a search keyword. Take a method of word form normalization where all Katakana-Hiragana prolonged sound marks (“—” which mark indicates a prolonged sound in a Japanese word) are removed from indexed words. In such a method, for example, a Japanese word “konpyuutaa” (“computer” in English) is normalized into “konpyuta” and added to the search index. Also, “konpyuuta” is normalized into “konpyuta”. As a result, “konpyuutaa” and “konpyuuta” are treated as the same word in the search index. Such a search index enables finding documents containing different forms of a search keyword. However, there is a problem when the indexed word “konpyuta” is denormalized into original forms. For example, “konpyuta” may be denormalized into many forms such as “konpyuuta”, “konpyuutaa”, “koonpyuuta”, “koonnpyuuta”, and “konnpyuuta”, as a result of inserting the Katakana-Hiragana prolonged sound mark “—” after each Katakana character. In such a method, the longer a word is the greater the number of word forms generated by word form denormalization becomes. A huge increase in the number of word forms generated by word form normalization results in an increase in processing time. Therefore, in practice, the word form denormalization process stops when the number of word forms exceeds a certain limit. In this case, the generated word forms may not always include all original forms of the normalized word. In other words, words in a retrieved document which words correspond to an indexed word but are not included in the generated word forms are not highlighted.
Japanese Patent Application Publication No. 2005-135041 discloses a highly functional document image search/browse system having an OCR apparatus and a separate document processing apparatus. The OCR apparatus generates OCR data which includes reading hypothesis data containing multiple hypotheses of character line extraction, character segmentation, and character recognition; and document structure data having ruled line information, frame information, character line information, browse attribute information, and the like of a document image. The document processing apparatus provides a function for extracting important keywords from typed and handwritten character strings using the OCR data, a function for searching documents, and a function for displaying documents in a manner a user requests using the document structure data.
However, the purpose of the system disclosed in Japanese Patent Application Publication No. 2005-135041 is mainly to improve OCR accuracy, and the system requires a complex configuration and much time for OCR processing.