1. Field of the Invention
The present invention relates to an image processing apparatus for generating electronic document data capable of searching for an object from a document image, an image processing method, and a computer-readable medium.
2. Description of the Related Art
Conventionally, paper documents or electronic documents including “objects” (for example, photos, drawings, line drawings, and tables) and “object explanations” (texts that give explanations, commentaries, or the like on objects in bodies) in documents are widely used. The “object explanation” makes an explanation/commentary on the above-described “object” in the body that is the main text. A term such as “FIG. 1” is often used to associate them. The term that associates the “object” and the “object explanation” with each other, like “FIG. 1”, is called an “anchor term”. In addition, a caption region near the “object” often includes the “anchor term” and an explanation (to be referred to as a “caption term”) that describes the object.
To extract only anchor terms from the body of, for example, an optically read paper document or an electronic document, advanced, heavy-load analysis using natural language processing and the like needs to be performed for all text information in the body. Such analysis processing requires to hold knowledge about how an anchor term occurs or is used in the body. For this reason, it is difficult to accurately extract the anchor terms from the enormous quantity of text information in the body, tending to make the process load much heavier. The extraction accuracy is very important because it greatly affects the accuracy of the object link function for associating an object and an object explanation with each other.
On the other hand, the number of characters written in the object captions is smaller than in the body. Hence, the anchor terms can more easily be obtained by analyzing the captions than by analyzing the body. When analyzing a document, anchor terms may be extracted first from object captions. Then, the document may be searched for body parts including the relevant anchor terms. To hold the information of all analyzed pages to generate an electronic document, the storage capacity needs to be enormous. To prevent this, when processing each page, only contents to be described in the electronic document of the page may be accumulated, and the remaining data may be discarded.
In the processing form mainly using such page-basis processing, a search is performed on the text of the document after the anchor terms are extracted from the objects. Processing of associating each object with the object explanation is executed after all pages have been processed. When processing each page, information about objects in the page and text information need to be extracted and accumulated as the contents to be described in the electronic document. After all pages have been processed, anchor term extraction and a text search in the document are performed, and the objects are associated with the object explanations based on the accumulated information (Japanese Patent Laid-Open No. 11-25113).
In the case in which the object link function is implemented to enable easy reference to the relationship between the above-described “object” and the “object explanation” in the above assumption, the operation components of the link function are arranged in the portions of the “object” and “object explanation” as the link function addition target, thereby adding the function that allows easy reference. The correspondence relationship between the object and the object explanation can be determined after the analysis of all pages of the document has ended. The link to the object is generated using the result.
However, if the accurate position of an anchor term that is the link function addition target in the text of the body cannot be recognized, the operation component cannot be arranged. In addition, before the anchor term search in the body has ended, the character portion in the text of the body corresponding to the anchor term in the explanation is unknown. Hence, it is also necessary to accumulate information representing the position and size of each character in the text information after the page-basis processing. This problem will be described with reference to FIGS. 1A and 1B. Document data 101 and 102 shown in FIG. 1A include a document body 111 and an anchor term 112 included in the explanation of the document. An object 113 that is a drawing in the document and a caption 114 of the object includes an anchor term 115 in the caption. Referring to FIG. 1B, a body 131 indicates part of the beginning of the body 111 in FIG. 1A. After the page-basis processing, information of the position and size of each character in it is accumulated. Hence, to accumulate the entire information of the positions and sizes of all characters in the document, a considerable large storage capacity (work memory) is required.