In recent years, amid calls for environmental issues, move to paperless offices has been promoted. Under the circumstance, the following system which implements reuse of objects, compressed storage of document images, and search for original documents of printed documents has been proposed (e.g., Japanese Patent No. 3,017,851). That is, this system scans paper documents such as documents accumulated and stored using binders or the like, distributed documents, and the like using a scanner, segments the scanned document images into objects by analyzing the layout of the scanned document images, and converts the objects into data by analyzing the objects.
As for the search for original documents, the following search method is suitably used. That is, feature amounts for respective attributes of objects such as text, photo, line image, and the like are calculated from original documents and scanned document images. Then, a plurality of similarities such as text similarity, photo similarity, layout similarity associated with the layout of respective objects, and the like are calculated, and their calculation results are comprehensively examined (such search will be referred to as a “compound retrieval”).
According to the compound retrieval, it is possible to remarkably improve the search performance. Especially, since a document management system handles a wide variety of documents such as documents which include many objects with a text attribute to those which include many objects with photo and line attributes, the effectiveness of the compound retrieval is very high in consideration of search precision and search efficiency.
If a document to be registered in the document management system is digitized data generated by scanning a document image, it is possible to generate a search index based on information analyzed by layout analysis and object analysis. Likewise, if a document to be registered is image data of a raster format, it is possible to generate a search index based on analysis information. However, if a document to be registered is application data generated by an application which runs on a personal computer (PC), such data cannot undergo similar analysis processing, and a search index cannot be generated. In other words, application data of a document generated on the PC and digitized data obtained by scanning a document printed on a paper sheet cannot be equally handled.
Of course, as for a general-purpose application which is frequently used in the office, a program which analyzes data of that application is incorporated in the document management system as a module. Then, application data is rasterized and undergoes analysis processing in the same manner as a scanned document image to generate a search index. However, since it is impossible to prepare for the aforementioned analysis module for all applications, the search index of application data which cannot be rasterized by the document management system cannot be generated by the analysis processing.