As Web 2.0 technology has developed, the volume of Internet applications and operational data (e.g., subscriber action data and platform system data) in Internet platforms has grown tremendously. To facilitate the processing of massive quantities of operational data, a parallel data-processing architecture is typically used. That is, the processing of massive quantities of data is accomplished through the joint collaboration of multiple distributed computers.
Typically, in large Internet website platforms, the most widely used parallel data processing architecture is the Hadoop system architecture. The Hadoop system architecture includes a cluster of systems composed of a master server and multiple slave servers. The master server divides the massive data into multiple data blocks and then allocates the divided data blocks to the multiple slave servers that operate in parallel. Each slave server processes its data block and sends the processing results to the master server. After merging the processing results, the master server then outputs the merged processing results. In addition, the merged results output from the master server in one stage of processing may also serve as the input for data processing in the next stage to obtain the next-stage merged results. Such a combination of parallel and serial processing enables the parallel data processing system to process massive quantities of data with a high degree of efficiency.
Such conventional techniques of analyzing data are primarily based on relational databases. However, it is very difficult to use relational databases to analyze massive quantities of data in a parallel data processing architecture. It is particularly difficult to meet demands when it is necessary to perform complex data analysis such as classification and report generation.