The need for a technique for efficiently handling documents within organizations is growing. By way of example, with the enactment of the Japanese SOX Act (the Financial Instruments and Exchange Act), the need for voucher management in the context of business operations by corporations is growing. In addition, by way of example, information within corporations, particularly document data that does not get stored in relational databases (not of a fixed format), is increasing rapidly (a phenomenon referred to as information explosion is taking place). Under such circumstances, the need for managing and searching for documents by such metadata as title, creation date, author, etc., is also growing. In the case of operational documents, for example, if searches could be carried out with such business IDs as document title, client name, creation date, order number, etc., it would be possible to quickly find documents required for internal control audits. Alternatively, in the case of design documents, if searches could be carried out by document title, department of origin, creation date, product code, etc., it would contribute to effective utilization of technical information. Further, in the case of record documents regarding complaints and malfunction information, if searches could be carried out by date of occurrence, date of handling, product name, monetary damage, component name, etc., it would contribute to faster handling should similar malfunctions occur. In addition, in the case of documents regarding operation rules, notifications, etc., if searches could be carried out by document type, creation date, period of implementation, etc., it would contribute to efficient operations that comply with the rules.
Numerous techniques for analyzing documents that are not of fixed formats and automatically acquiring metadata have been proposed (e.g., see Patent Documents 1 to 3, and Non-Patent Documents 1 and 2). These references assume that the document type of interest is defined in advance, and features of the metadata written in documents of that type are examined in detail, and held as a “model” for documents of the type of interest. Further, matching is performed between the model and text strings that appear in a document, and it is inferred which text string is which element in the model (i.e., which text string is metadata). As features, layout features (e.g., “title is often centered,” etc.), features of text strings that appear in proximity to metadata (e.g., “order number often appears adjacent to the text string ‘order number:’ on the right,” etc.), features of partial text strings of metadata (e.g., “client name often begins with ‘ (dokuritsu gyousei houjin, Japanese for ‘Independent Administrative Institution’)”’) are used.
In addition, as presented in Patent Documents 4 to 6 and in Non-Patent Documents 3 to 8, efforts are already being made for automatically preparing models for automatically acquiring metadata.