Field of the Invention
The present invention relates to an electronic document processing apparatus and an electronic document processing method, and particularly to an electronic document processing apparatus and an electronic document processing method that extract texts from electronic documents containing electronic document layout information.
Description of the Related Art
Conventionally, search of texts in an electronic document is performed by extracting texts contained in the electronic document and determining whether a search key is included therein. As a search method, it is common to determine whether at least a part of the search key is included in the extracted characters.
As more restrictive search techniques, there are perfect word-matching search that determines whether a word is completely included, and phrase search that uses a phrase composed of a plurality of words including a blank as a search key. There is also full text search that performs exhaustive search of electronic documents to find a search word and, if an electronic document including a text that matches the search word is found, retrieves, as the search result, the location where the text is written.
Furthermore, there also exists a sophisticated search technique such as concept search that allows specifying the content desired to be searched in the form of a sentence and searching information whose content is close to the sentence.
For example, let us assume that a text extracted from an electronic document is “He is a good boy. But, she is a bad girl.” With concept search, the search brings a hit by “nice boy”, as well as “good boy”. However, “bad boy” brings no hits in the search. This is because the search process is suitable in that the extracted text is consistent with the concept meant by the text.
Therefore, when performing concept search, it is required that a text is consistent as a Japanese sentence if the extracted text is written in Japanese and consistent as an English sentence if the extracted text is written in English.
In an electronic document containing layout information of characters, on the other hand, there is a case that the order of commands expressing text drawing (referred to as text drawing commands, hereinafter) and the starting position of drawing where the text drawing commands are specified on a page are independent of each other. For example, there may be a case that the first text drawing command starts from the center of the page, subsequently the second text drawing command starts from the lower part of the page, and finally the last text drawing command starts from the upper part of the page.
In actual electronic documents, there are cases that such an expression exists in PDF (Portable Document Format; Registered trademark) or PDL (Page Description Language). For example, there is a printer driver type software that is a software for creating a PDF. It is a software for creating a PDF file from a print command if a driver for creating a PDF is selected instead of a usual printer driver when providing a print instruction in a word processing application or drawing application that created the original document.
On this occasion, the order of text drawing in which the application that created the original document passes the print command to the PDF creating driver depends on the application. For example, the application may be a layout-free electronic document creating application (e.g., Microsoft Office PowerPoint, Microsoft Office Visio; both are registered trademark). With this application, there may be a case that text drawing commands are entered into the PDF file created by the PDF creating driver in an order that significantly lacks consistency of sentences, when text drawing is performed without considering the sentence layout on a page. In other words, although the coordinate positions on the page expressed by the text drawing commands are correct, the order of the text drawing commands in the PDF file are random.
In the case of a layout-free electronic document creating application, text objects are sequentially numbered and managed according to the order in which the objects have been created by the operator. However, since the operator creates a document taking advantage of the layout-free operability, text objects are not necessarily arranged according to the order in which the text objects have been created to keep consistency of sentences. If a PDF file is created from such an electronic document, a PDF file is created as shown in FIG. 1, for example.
FIG. 1 illustrates an exemplary preview of a PDF file 101 created by a layout-free electronic document creating application and an array of text drawing commands 102 in the PDF file. The reason why the text drawing commands are arranged in a manner such as the array 102 is that the text objects are created by a layout-free electronic document creating application. The order in which the text objects have been created on this occasion is: “Michael”, “Confidential”, “sushi”, “Michael”, “Possibly”, and “appreciates”. However, the text objects are subsequently rearranged in a manner shown by the preview 101 so that the sentence is composed according to the operator's intention. If this electronic document is converted into a PDF file, text drawing commands will be entered in an order shown by the array 102.
Since consistency of sentences is not preserved when texts are extracted from such a PDF file, the search engine that received such results can only perform a word search at best, and there is a problem of degraded precision in a sophisticated search such as concept search.
In order to cope with such a basic problem, Japanese Patent Laid-Open No. H08-194697 (1996) “Apparatus and method of identifying words described in a PDL file” discloses an exemplary prior art of sorting text drawing commands according to the coordinates when acquiring the texts in a page. The Japanese Patent Laid-Open No. H08-194697 (1996) discloses a technique that, instead of extracting texts according to the order in which the text drawing commands are described in the electronic document, temporarily extracts all the text drawing commands and resource information associated therewith such as coordinates. Subsequently, offset coordinates of the text drawing commands (starting position of text drawing) are sorted, and the texts are extracted in the order of the sort result so that a text extraction result according to the arrangement of texts is obtained.
However, with a layout-free electronic document creating software (application), there may be a case that texts cannot be successfully acquired by sorting them in the order of offset coordinates of the text drawing commands because decorated text strings may be created such as those arranged in an arch-like manner.
In this specification, a “decorated text string” refers to an arrangement of texts in an arch-like, wave-like, circular (loop-like), square, or star-shaped manner, whereby texts are not tidily aligned along a predefined direction.
FIG. 2 illustrates an example where texts cannot be successfully acquired although text drawing commands are sorted in the order of offset coordinates, with the PDF file shown in FIG. 1 taken as an example.
In FIG. 2, texts (text string) are decorated so as to align in an arch-like manner. A text 202 denoting “Possibly” has coordinates (4, 20). A text 203 denoting “Michael” has coordinates (8, 25). A text 204 denoting “appreciates” has coordinates (12, 25). A text 205 denoting “sushi” has coordinates (20, 17). A text 206 denoting “Michael” has coordinates (5, 10). A text 207 denoting “Confidential” has coordinates (10, 10).
By sorting the above texts in the order of offset coordinates of the text drawing commands, texts shown in FIG. 2 are acquired in the following order based on the coordinates of respective texts. In other words, they are acquired in the order of text 203, text 204, text 202, text 205, text 206, and text 207. However, in the case of a PDF file such as shown in FIG. 2, the order of texts intended by the design of this page is naturally, texts 202, 203, 204, 205, 206, and 207.
Depending on how the texts are decorated, the acquired texts may be randomly arranged due to the above-mentioned sorting in the order of offset coordinates of the text drawing commands.
Conventionally, when performing, in an image including decorated texts, a sophisticated search such as concept search that requires consistency of sentences in the entire page, there has been a case that texts acquired by text string extraction differ from the original sentence.