Conventionally, a scan function by an MFP (Multi Function Peripheral) is used for the purpose of making paper documents into electronic documents and thereafter reserving the data or reusing the data. In the scan by the MFP, a format regarding JPEG (Joint Photographic Experts Group) or a format regarding TIFF (Tagged Image File Format) is used. However, a format regarding PDF (Portable Document Format) or a format regarding XPS (XML Paper Specification) is recently used.
If the scan function by the MFP is used for the purpose of reserving the data after electronic documents are made, the data amount is required to be reduced and thus compression by the JPEG is generally used. However, although a large compression effect can be obtained in compressing a natural image by use of the JPEG, compression of a character line-drawing by use of the JPEG has a problem in that an edge portion of the character line-drawing becomes blunt. Specially, when a compression ratio increases in order to reduce a data size after encoding, the bluntness of the edge portion in the character line-drawing is notable.
Meanwhile, a method is proposed that, in a format capable of describing a structure of a document such as the PDF or the XPS, a character region, a background region and an image region are extracted by a layout analysis technique such that a compression ratio may be compatible with an image quality. The most suitable compression method for each of the extracted regions is selected for compression and thus a high compression efficiency can be accomplished as a whole. Such compression method is generally called “a high compression PDF” or “a high compression XPS.”
When using the PDF or the XPS, it is possible not only to reserve image information but also to reserve meta-information other than image information in a format. As a conventional technique, the following technique is additionally known. That is, in this techniques character regions of a title or date of a sentence and a reporter are extracted by the layout analysis technique, and search keywords are added to the extracted character regions as an electronic document by use of an OCR (Optical Character Reader) function, the search keywords being added as a table of contents. An added value is given that a character object, which is image data, is vectorized. The extracted keywords are transmitted to a search site as a search query (request for question) and a result acquired from the search site is displayed together with an input image. These techniques are disclosed in JP-A-2006-350551, JP-A-2004-348774, JP-A-2002-183165 and JP-A-11-184924.
In the prior art, keywords are extracted by executing the OCR, the extracted keywords are transmitted to a search site as search query (request for question) and a search result is obtained from the search site. However, a user can have access to search information by use of only a dedicated device and further the user should explicitly indicate keywords in reserving data. Especially, if a user makes a large amount of paper documents into electronic documents, the user is required to input one by one, which is very inconvenient for the user.
In JP-A-2006-350551, all data cannot be utilized since the data is used for only generation of a table of contents even in the use of OCR. In JP-A-2004-348774, although a system is proposed which extracts keywords and then simultaneously displays a search result from a search engine together with an original document by a browser present in the system, a dedicated system is necessary and thus it is inconvenient in view of use of an input document. In JP-A-2002-183165, the technique searches for keywords from a document and enables a user to select a search query. However, it cannot be applied except for a search after processing.