Every organization manages a business data group, which is a set of a plurality of business data, and the business data group is classified roughly into an unstructured data group, which is a set of a plurality of unstructured data, and a structured data group, which is a set of a plurality of structured data. The proportion of the unstructured data group in the business data group is said to be generally by far greater than that of the structured data group (approximately 80% of the business data group is said to be the unstructured data group). However, the unstructured data group is often left in a file server or the like without being used immediately after it has been created. In addition to such circumstances, as the number of unstructured data groups to be accumulated in the organization or the like continues to increase, unstructured data groups need to be analyzed for secondary business use.
Unfortunately, it is difficult to conduct meaningful business analysis using unstructured data alone.
Therefore, business analysis is known to be carried out using structured data in addition to unstructured data (“combined analysis,” hereinafter). Combined analysis is generally not typical but atypical. In atypical analysis, for example, meaningful data is searched out from a large business data group and utilized for the analysis. Specifically, for example, various types of data are combined to create new data for analysis and meaningful data is searched out from the created data.
Distributed file system is generally used for managing unstructured data groups. Distributed file system is a system where groups of folders on a plurality of computers coupled by a network are handled as subfolders of a single shared folder, allowing any of the plurality of computers to access the groups of folders. In this system, by constructing a virtual structure different from the actual folder structure, files and folders that are distributed appear as if these files and folders are managed by a single computer. Distributed file system is generally incorporated in a parallel distributed processing infrastructure that is capable of executing a distributed process using files (data) managed by the distributed file system (e.g., Hadoop). A programming model such as MapReduce in which analytical processing on data is simplified into group extraction processing (Map processing) and data aggregation processing (Reduce processing) has been known as a parallel processing technique (for example, PTL 1).
On the other hand, a database is generally used for managing structured data groups. A database is managed by a database management system (“DBMS,” hereinafter).
Therefore, in developing an analysis application for acquiring necessary data from a structured data group and an unstructured data group for the purpose of combined analysis, the analysis application needs to access both the DBMS and the parallel distributed processing infrastructure (distributed file system), requiring the developer of the analysis application to have deep knowledge about the both systems (the DBMS and the parallel distributed processing infrastructure (distributed file system)). Moreover, the analysis application needs to include a function of combining unstructured data and structured data to create data for analysis. Especially in an atypical analytical task, it is not always the case that the data for analysis can be created using specific data according to a certain combining method. Files (data) to be handled are not limited to the specific files (data), and the combining method needs to be changed dynamically, depending on the formats of the files (data). For these reasons, the costs of developing an analysis application are high (i.e., burden on the developer is high).