Data stream processing has become one of the primary means of data mining and data analysis. One example of a data stream includes a web log which is a data stream including a large volume of data. Another example of a data stream includes an e-commerce website that continuously adds product posting information, continuously adds text message transmission records, and the like. Such data streams have the following features: (1) a large volume of data, (2) each piece of the data has an identifier (ID) where the characteristics of each ID must be analyzed, and (3) a time attribute, for example, a chronological property.
Data stream analysis generally requires that the analysis must be performed in real-time and at a high-speed. Accordingly, data stream analysis systems are able to provide a real-time response based on the current actions of specific users. For example, by performing a real-time analysis of logs, the current status of the user and recent access activity may be analyzed to more effectively increase accuracy of recommendations, or to provide real-time anti-spamming. However, it has always been technically difficulty to analyze data streams at high speeds in order to satisfy the real-time requirement. In particular, the technical analysis is even more difficult when the data volumes are very large.
An embodiment of a conventional distributed data stream processing system is illustrated in FIG. 1. A raw data stream S is distributed to a plurality of functional modules F. The plurality of functional modules F performs processing simultaneously, and transmits the results of the processing to a data integration module I. The data integration module I integrates the processed data and outputs the integrated data. However, the following limitation occurs in existing distributed data stream processing systems:
(1) In the processing of the data streams having large volumes of data, the data processing and data analysis become very time consuming. Also, existing distributed data stream processing systems generally employ a shared memory model. The shared memory model is a method of exchanging data between different modules. In particular, the exchanging of data occurs between upstream and downstream modules. For example, the results of module A are placed in memory (a database, a file, etc.) and then module B reads the results from the memory. Thus, a data exchange occurs between the modules A and B. With such a model, real-time computing is not easily achievable; only quasi-real-time computing can be achieved. In other words, when the model cannot handle processing the existing data stream, most of the existing processing technology is unable to satisfy the growth rate of the large volume of data in real-time data streams, and the delay in processing the data may be relatively long. Thus, data analysis can only be performed offline, leading to further delays in data analysis and data mining. Accordingly, due to the delays, timely responding to current or recent actions of users is difficult.
(2) Distributed parallel computing has already become popular for processing large volumes of data. However, existing parallel computing systems are essentially limited to a framework of functional reproduction. Functional reproduction is a method of implementing parallel computing where all computing modules have the same function and run the same processes. The computing modules only differ in the computed data, and computing systems use the computing modules to implement parallel computing. Thus, due to the computing modules, parallel computing is impossible to implement with more precision. Also, due to the computing modules, implementation of modularization and hot swapping is impossible. Also, it is difficult to perform maintenance of the computing modules.