This invention relates generally to a processing system for document data, and more particularly to a document image processing system suitable as an input unit to an electronic document image file.
Conventional electronic document files merely store each page of a document as an image, and secondary information for information retrieval must separately be given from outside using code input means (e.g., a keyboard). In order to automate a file input operation, however, it is preferred that secondary information is generated by automatically reading titles, author names and the like described in the documents. In order to further improve information retrieval, it becomes necessary to realize automatic input of the captions of tables and chapter captions, or automatic keyword extraction by recognition of the text itself. Segmentation of the image of the object document into portions such as captions, authors, abstract, text, figures, pictures, and the like, has also been required to reduce the memory space and to increase facets for retrieval.
A system which understands the content of a document and processes the document on the basis of the result of understanding to cope with the problems described above has so far been investigated, and an example of such a system is disclosed in "Basic Studies on System for Cuttings of Newspaper Articles" by Yoji Noguchi and Junichi Toyota (Resume 6C-1 of the 23rd National Convention of Information Processing Society of Japan; 1981). However, since this document understanding system is directed to the cuttings of newspapers, it is not clear whether or not the technique can be applied to documents having arbitrary formats. In addition, the portions of characters are merely segmented, but a method of combining segmentation with recognition is not disclosed.