With the increasingly development of the Social Networking Services (SNS) and the number of the user of the Internet, the size of user data generated at the network side is increasing geometrically. The processing and analyzing of the user data is becoming more and more important for the business decision of network operators. Usually, the database at the network side obtains valuable parameters such as the behavior habit of the user, the popularity of the application, etc, by analyzing user data of Terabyte level.
During the data analysis, the database performs deduplication and accumulation operation on thousands of or millions of user data with respect to various indicators that the operators expect to acquire from the process of user data analysis. For example, there includes five pieces of user data (also called five pieces of pipeline data), each of which is used to record the behavior data of a user. Said deduplication operation is used to eliminate repeated user data regarding to a specified indicator. For example, when calculating an indicator of “number of online active person of the application”, a user having User Identity (ID) of 1001 has visited an application having application ID of 1 twice. Thus, when calculating the number of online active person of the application 1, the two pieces of user data generated by the twice visitation need to be deduplicated, thereby only one piece of user data generated by the user 1001 is reserved. That is, the number of visitation of application 1 is adjusted to the number of person visiting the application 1, to avoid the error differences of the indicator brought by the multiple pieces of user data, which is generated by the same user. Said accumulation operation is to add multiple user data of a same category together, to obtain a corresponding result of the indicator. For example, when calculating the indicator “number of online active person of the application”, both user 1002 and user 1003 visited application 2 respectively. Then, when calculating the number of online active person of the application 2, the two pieces of user data generated by two different users 1002 and 1003 are accumulated together to obtain that the number of online active persons for application 2 is 2. As can be seen, the accumulation operation is used to obtain a result of an indicator, and the deduplication operation is used to perform error difference elimination for the user data on which the accumulation operation is based. When performing user data analysis, the commonly used implementation method is to perform deduplication operation on the original user data by a first mapper&reducer process, and then to perform accumulation operation on the deduplicated user data by a second mapper&reducer process, so as to obtain a corresponding result of an indicator.
The present data analysis process needs to perform twice mapper&reducer processes. However, too many stages of mapper&reducer may cost numerous computation resource of the database. Especially when there are lots of indicators to be computed, the computation task of the database will be too large to handle by the database system.
With the upcoming of the age of big data, there arises a cube data structure, such as the On-Line Analytical Processing (OLAP) system. This data structure may store multidimensional data, wherein each piece of data can be described in different views and the user data may be analyzed and searched from different views or any combination of the multiple views. An exemplary cube data structure is shown in FIG. 2, in which the stored data possesses attributes of three views, “product type”, “area” and “time”. In this data structure, each piece of data can be described from different views. As the shape of this data structure looks like a cube, therefore it is named as the cube data structure.
The cube data structure based data analysis has a prominent feature, i.e., an indicator may be analyzed from different views or the combination of the views. Taken FIG. 2 as an example, the data that meets the conditions of the indicators may be filtered from two separate views “product type” and “area”, and it may also be filtered from two different view combinations, “product type”+“area” and “product type”+“time”, respectively. Each piece of data in the data structure may be illustrated from different views.
In practice, the number of the views of the cube data structure involved in the data analysis varies from dozens to thousands. Thus, the number of the combinations of the views obtained through permutation and combination may be much larger. While the data analysis for each view or view combination needs to go through various computation processes such as data loading, deduplication operation, accumulation operation and so on, such many views or view combinations may result in unexpected computation complexity. If these independent computation processes are executed serially, the time cost will greatly exceed the acceptable range of the operators; if executed in parallel, the database will have a heavy burden and the computation bottleneck may happen.
In general, the present ways for data analysis have too much computation complexity and low data processing efficiency, which may cost more time and more computation resource.