1. Technical Field
The present invention relates generally to an apparatus and method for managing a data stream distributed parallel processing service and, more particularly, to an apparatus and method for managing a data stream distributed parallel processing service, which efficiently arrange a plurality of tasks in conformity with load characteristics and use communication means suitable for the structure of a task arrangement when successively processing a plurality of input data streams by executing a plurality of tasks in a distributed and parallel manner, thereby being able to reduce data Input/Output (I/O) load between the tasks of the service attributable to an explosive increase in the input data streams of tasks constituting the service and also to improve I/O performance and data processing performance.
2. Description of the Related Art
With the advent of a ubiquitous computing environment and the rapid development of the user-oriented Internet service market, the number of data streams to be processed has rapidly increased and the types of data streams have been also further diversified. Accordingly, research into the semi-real-time data analysis of massive data streams and data stream distributed parallel processing for providing a processing service has been actively conducted.
FIG. 1 is a diagram showing an example of a conventional distributed parallel processing service.
Referring to FIG. 1, a distributed parallel processing service 110 is connected to an input source 100 and an output source 130. A method of processing input data in the distributed parallel processing service 110 is represented by the definitions of several operations 116 to 120 which describe queries or processing methods for data streams and a Directed Acyclic Graph (DAG) which describes data flows between the operations. The operations 116 to 120 in the conventional distributed parallel processing service 110 are distributed and arranged among several nodes 111 to 115 within a cluster comprised of multiple nodes, and are commonly executed in the form of processes. When the operations start to be executed, the operations are successively executed without terminating the processes. Accordingly, massive data streams may be successively processed rapidly and in parallel.
Although in a conventional distributed parallel processing system based on the above-described conventional distributed parallel processing service, the registered operations of a service are successively executed based on input data streams, the conventional distributed parallel processing system is problematic in that when data streams are explosively increased, successive query processing is delayed because of the shortage of available resources and the distributed data stream processing system experiences an error or stops because of the exhaustion of node resources. In order to overcome these problems, the conventional distributed parallel processing system adopts a method of allocating more node resources, such as memory and a Central Processing Unit (CPU), to problematic operations, a load shedding method of selectively discarding input data streams, or a load migration method of moving an operation from a current node to another node.
However, the method of allocating more node resources is problematic in that it is difficult to utilize it on-line while a service is being performed.
The load shedding method of deleting input data streams is problematic in that it deteriorates the accuracy of the results of successive query processing.
Furthermore, the load migration method is problematic in that when a specific operation cannot be processed at a single node because of an explosive increase in input data streams, this cannot be overcome even though the specific operation is migrated to another node. In particularly, if operations having various load characteristics are simply considered to be computation-intensive operations and are migrated by allocating the redundant CPU resources of other nodes thereto, I/O-intensive operations are subjected to a problem in that a bottleneck phenomenon occurs in resources, such as a disk or network resources, which are more limited than a CPU in a recent hardware environment in terms of performance.