Recently, a document management system is needed for managing various documents (paper documents, facsimile letters, e-mail messages, Web pages and the like) existing in a company, by digitizing and classifying the documents in a library on a computer for executive utilization and search. In such document management system, for example, a paper document generated in a predetermined document form is read with a scanner, and is stored as image data in a document server. At the same time, index information as text data, indicating a company name, an address and the like, is extracted from the image data, and linked with the image data and stored. For example, information indicating the location of the associated image data (URL or the like) is held with the index. In this manner, a search for a desired document (image) can be easily made from an index. Further, Japanese Patent Application Laid-Open No. 6-223113 discloses a system to extract a keyword from an image in a document including text(s) and image(s). According to the system disclosed in this publication, an image is subjected to character recognition, then a keyword is selected from comparison between words obtained by natural language processing and a keyword table.
However, in the above document management system, how to link information in image data with an index item is a problem. As one method for extracting an index from image data, proposed is performing character recognition in a predetermined area upon scanning of paper document and storing obtained text information as index information. In this method, the predetermined area is determined by a user's previous setting a character recognition area of image data and an index item to be linked with the area. Accordingly, in this method, it is necessary to previously set “what area is to be subjected to character recognition as index item data (here referred to as “index extraction information”)” in correspondence with the form of document to be scanned. By this necessity of setting work, document registration in the document management system is complicated.
Further, it may be arranged such that the index extraction information for plural types of document forms are previously registered, and the user selects a desired setting in correspondence with a document to be read. However, when many documents are to be read and plural types of form exist, it is necessary for the user to select a setting for each document form. Also, the document registration is complicated. Further, every time a new document form, for which index extraction information has not been set, appears, it is necessary to perform the setting and registration of index extraction information in the document management system.