A technique has been disclosed that refers to a code conversion dictionary which stores a pair of a word having one meaning unit and a compression code and compresses and converts the input document data to the compressed data (for example, refer to Japanese Laid-open Patent Publication No. 5-324730).
Here, there is a case where text mining processing is performed based on the compressed and converted compressed data. In this case, first, extension processing is performed to the compressed data, and the text mining processing such as lexical analysis, syntax analysis, and semantic analysis is performed relative to the extension data obtained by the extension processing.
Also, a technique has been disclosed that divides the document data into words, calculates an appearance frequency of the divided word, and creates a word appearance frequency table in which the words are sorted in an order of the appearance frequency (for example, refer to Japanese Laid-open Patent Publication No. 6-348757 and the like). The processing for dividing the document data into the words is referred to as the lexical analysis.
[Patent Literature 1] Japanese Laid-open Patent Publication No. 5-324730
[Patent Literature 2] Japanese Laid-open Patent Publication No. 9-214352
[Patent Literature 3] Japanese Laid-open Patent Publication No. 6-348757
[Patent Literature 4] Japanese National Publication of International Patent Application No. 2005-530224
However, there is a problem in that a processing time to obtain processing result of the text mining processing gets longer when the text mining processing is performed based on the compressed data. That is, when the text mining processing is performed based on the compressed data, the extension processing is performed relative to the compressed data, and the text mining processing is performed relative to the extension data obtained by the extension processing. Therefore, the processing time from an instruction to perform the text mining processing to a time when the execution result is obtained gets longer.
Here, the problem in that the processing time from the instruction to perform the text mining processing to the time when the execution result is obtained will be described with reference to FIG. 1. FIG. 1 is a diagram of exemplary data management processing. A case where LZ77 and LZ78 compression algorithms are applied is illustrated in FIG. 1. As illustrated in FIG. 1, the data management processing compresses an uncompressed file by using a longest matching string and manages the compressed file. In the data management processing, when the instruction to perform the text mining processing is received, the compressed file to which the text mining processing is performed is extended, and the lexical analysis is performed. That is, the data management processing divides the extended string into words. In the data management processing, the divided words are counted, and a count result which is a result of the count is generated. In the data management processing, the generated count result is utilized for the text mining processing, and the execution result of the text mining processing is output. In this way, in the data management processing, before the text mining processing based on the compressed file is performed, the extension processing is performed relative to the compressed file. Therefore, the processing time from the instruction to perform the text mining processing to the time when the execution result is obtained gets longer.
Even when the technique for creating the word appearance frequency table is used, in a case where the word appearance frequency table is created based on the compressed data, the extension processing is performed to the compressed data first. After that, the lexical analysis, calculation of the appearance frequency, and creation of the word appearance frequency table are performed to the extension data. Therefore, the processing time from the instruction to perform the text mining processing including the processing for creating the word appearance frequency table to the time when the execution result is obtained gets longer.