A data stream management system (DSMS) is a system that manages real-time transmission and processing of data streams.
The data stream management system is different from a database management system as described below. That is, first, streams are infinite. Second, a reaching order of data cannot be assumed. Third, a scale and a time limit make data stream elements difficult to be stored and processed after being reached. Therefore, processing data at a time is a general mechanism that handles the stream. In the data stream management system, various communication methods are required for efficiently processing the data.
A communication method of the data stream management system includes a pattern in which the data is transmitted and received through a channel (or one socket) made by a network protocol (including, for example, a TCP, a UDP, and the like) through a single path. The single channel has a lower throughput than a plurality of channels and the corresponding channel brings about a load of a made server. In response thereto, there is a split/collect technology that distributively transmits data to a plurality of channels and collects the distributively transmitted data and analyze the data. A representative example of the spilt/collect technology is a MapReduce technology of Hadoop used in distributively parallel processing. The MapReduce technology of Hadoop transmits data to a plurality of distributed servers (split), processes the data in a server that receives the corresponding data, and collects a result of processing the data to acquire the same result as in the single channel.
In an open source based data stream management system (including, for example, Yahoo S4, Strom, and the like), a plurality of tasks is placed in a plurality of server nodes, and as a result, the plurality of tasks transmits and receives channel communication with each other. Therefore, extensibility is assured in a pattern (alternatively, granularity to assure the extensibility is all of the tasks) in which all of the tasks are statically relocated through a specific parameter (for example, a parallelism coefficient of Storm). However, even in a plurality of data stream management systems that performs the channel communication, communication amounts of channels are transmitted to a master node and the tasks are relocated by using corresponding information. As a result, in the open source based data stream management system, an overhead is present while all channels that belong to the tasks are relocated again.
As described above, in a Hadoop system or the open source based data stream management system, relocation needs to be performed for each task, and as a result, system operating efficiency deteriorates.