Recently, a demand for a technique for efficiently dealing with documents in an organization has increased. For example, in line with the enforcement of the Japanese version of the Sarbanes-Oxley Act (Financial Instruments and Exchange Act), voucher management needs in business activities of companies have increased. Also, for example, intra-firm information, especially, (atypical) document data that is not stored in a relational database is rapidly increasing and a phenomenon called “information explosion” is occurring. In such a state, the needs for managing and searching documents by metadata such as the title, creation date and creator are also increasing. For example, in the case of business documents, if it is possible to perform search by an operational ID such as the document name, customer name, creation date and order number, it is possible to quickly find a document required in the audit of internal control. Also, in the case of design documents, if it is possible to perform search by the document name, creating department, creation date or product code, there is an advantage in effective utilization of technical information. Further, in the case of documents of complaint or failure information, if it is possible to perform search by the occurrence date, countermeasure date, product name, amount of damage or part name, there is an advantage in a prompt response at the time of occurrence of similar failure. Also, in the case of documents such as operating rules and notifications, if it is possible to perform search by the document type, creation date or implementation period, there is an advantage in effective job performance according to a rule.
Many techniques of analyzing atypical documents and automatically acquiring metadata are proposed (see the following Patent Literatures and Non-Patent Literatures). In these techniques, at the time of reading content described in a document, it is effective to perform processing in disregard of space characters. This is because it is possible to extract metadata without an influence of space characters to adjust character alignment. For example, as illustrated in FIG. 1A, space characters are inserted to realize the centering, or, as illustrated in FIG. 1B, space characters and a tab character are inserted to adjust the alignment. In FIGS. 1A and 1B, “□ (square)” as illustrated as reference numeral 100 indicates a double-byte space character, “● (dot)” as illustrated as reference numeral 101 indicates a one-byte space character, and “→ (arrow)” as illustrated in 102 indicates a tab character. In order to extract metadata without an influence of such space characters, it is effective to skip the space characters at the time of the reading of character data.