An amount of data processed by an information processing system has been increased along with a widening field to be processed year by year. Further, types of the data greatly vary from conventional business-use data to actual world information, which is typically a sensor technique. Furthermore, in recent years, it has been vigorously attempted to acquire new knowledge for business and society from log data that has been considered as having less values by analyzing the log data generated in a processing process of the information processing system. A great amount of data including the log data is referred to as “big data” and, as a foundation for realizing great-amount data processing, data processing at high speed for the information processing system is demanded more than ever before.
Roughly two types of methods are provided for speeding up the information processing system. One of them is a method for improving performance of a stand-alone computer referred to as “scale up”. Another one is referred to as “scale out” and, by arranging a plurality of computers, the performance of the information processing system is improved. Recently, the performance of the computer by the scale up has been less improved than before, and the performance thereof by the scale out has become a main stream. Furthermore, in distributed processing of a scale-out type, a plurality of computers including commodity hardware are arranged to realize the distributed processing at a moderate cost, and also a distributed processing foundation which a user can conveniently use is provided for the information processing system.
As described above, since it is assumed that the commodity hardware is used, the distributed processing foundation used in recent years realizes a scale-out property of the distributed processing by dispersing and arranging a file to be processed in each computer to perform the distributed processing at a high speed.
Further, to perform the distributed processing foundation at a high speed, a dedicated file system is prepared. Since it is assumed that the commodity hardware is used in the file system, by redundantly retaining the file to be processed in the plurality of computers, fault tolerance of the file can be realized.
Furthermore, since the great-amount data processing has become more popular, compared with the conventional processing, a storage structure of the data has been changing. So far, the data to be processed has been generally stored in a relational database. Such data is referred to as “structured data”. When the relational database is used, it is appropriate for the processing for searching and extracting the data. However, a great number of hours and loading works on the database are necessary to perform the search and the extraction.
On the other hand, in the great amount of data processing, since the amount and types of the data have been increased, the conventional relational database cannot solve problems. First, the data that cannot be treated by the conventional structured data, such as images and audio, is going to be a processing target. Such data is difficult to be processed by the conventional relational database. Although such data is structured like the log data, since the great amount and types of data in a file format exists, it is not realistic to load the data into the relational database.
From the above-described problems, in the distributed processing foundation in recent years, the method has become the main stream for performing the distributed processing on the data in an original file format without loading the data to be processed into the relational database. Since structuring the data such as the images and the audio is difficult, they are generally referred to as non-structured data. Further, the data that is structured like the log data but exists as the file format is referred to as semi-structured data. The semi-structured data includes a comma separated values (CSV) file and an eXtended marked-up language (XML) file.
Since the semi-structured data including the CSV file is not structured but stored in the file format, data access to the semi-structured data depends on the data structure of the semi-structured data. A case where a depending relationship causes a problem will be described below according to examples.
An example of the CSV file will be described herein. When the CSV file is sequentially read from a disk, the data on the CSV file is sequentially accessed in a row direction. In a case of the CSV file, one row generally stores information related to a time stamp, a name for discriminating each record, and attribute values of various types as one record. Therefore, when the CSV file is sequentially read, the data for each record can be sequentially read.
On the other hand, when analysis is performed using the information stored in the CSV file, processing is generally, widely performed in which the attribute values of the same type are extracted and the extracted attribute values are added up. In such processing, when the attribute values of the same type are extracted and added up, access in the row direction of the CSV file occurs. Therefore, the CSV file is simply stored in the disk and the access in the row direction of the CSV file becomes random access on the disk and, thus, an access speed in the row direction is slowed down.
As to the problem described above, as a conventional solution method, a method is provided for using columnstore in a format capable of processing the data stored in the database in the row direction, Products using this method includes Google BigTable (trademark) indicated in non-patent literature (NPL) 1. Further, the patent literature disclosing a similar technique includes patent literature (PTL) 1.
In the method using the columnstore, when input data such as the structured data and the semi-structured data is loaded into the information processing system, the input data is stored in the columnstore to be converted into the data having the data structure appropriate for the row direction access. More specifically, the data is stored with the row and the column of the data previously replaced with each other so that the access in the row direction becomes the sequential access.