1. Technical Field
The present invention relates generally to an apparatus and method for managing stream processing tasks and, more particularly, to an apparatus and method for managing stream processing tasks that are capable of efficiently managing a massive number of stream processing tasks, which are consecutively executed in a stream processing system for processing explosively increasing data streams in real time, in an in-memory state depending on their frequency of execution.
2. Description of the Related Art
With the advent of the big data era, the quantity of data streams to be processed has rapidly increased, and the types of data streams have become more various. Accordingly, active research is being carried out into data stream distributed parallel processing capable of providing real-time data analysis and processing services for a large quantity of data streams.
An example of a data stream distributed parallel processing method is a data flow-based distributed parallel processing structure. As illustrated in FIG. 1, in the data flow-based distributed parallel processing method, a service 100 receives an input source 110, performs distributed parallel processing on the input source, and provides an output source 140. In this case, the input data processing method is represented by a Directed Acyclic Graph (DAG) that describes the definitions of multiple operations 121 to 125 in which queries or processing methods for data streams have been described and data flows which are generated between the operations. The operations 121 to 125 within the service 100 are split into multiple tasks 131 to 135, and are distributed, arranged, and executed in multiple nodes 111 to 115 within a cluster formed of multiple nodes.
In general, tasks are executed in the form of task threads that are managed by the task management process of each node in the node. Once the tasks are executed, the tasks are consecutively executed without terminating the threads, thereby enabling a large number of data streams to be rapidly processed consecutively in parallel.
A conventional data flow-based distributed parallel processing structure consecutively executes the tasks of a registered service based on input data streams. If a large number of tasks are executed at the same time depending on a application service, a problem arises in that files generated within a node are exhausted or a context switching load occurs between threads.
In order to solve such a problem regarding the generation of an excessive number of threads, a recent distributed stream processing system generates only a specific number of task executor threads that are managed by a task management process in each node. Tasks assigned to each node are managed in object form, not thread form, using a task object pool. The distributed stream processing system uses a method in which task executor threads fetch tasks from a plurality of task pools and transfer the input data streams to the node, so that the respective nodes process the input data streams.
For example, a word number calculation service for continuously counting the numbers of words included in sentences using a distributed stream processing system will be described below. The word number calculation service is an application service that counts the latest numbers of appearances of input words by counting the numbers of input words extracted from continuously input sentences. As illustrated in FIG. 2, each of tasks 221 to 225 counts the number of appearances of each word from input words 211 to 219 that are included in sentences. Only when the tasks 221 to 225 that count the numbers of appearances of input words to the present are maintained in memory, it is possible to count the numbers of all input words without omission.
Although a task is managed as an object (i.e., a task object), not a thread, as described above, the state of the task object needs to be continuously maintained in the memory. In this case, millions of task objects should be maintained in the memory even within a single application service.
Accordingly, the data distributed parallel processing system is problematic in that it is difficult to maintain all task objects in memory because an application service should process several hundreds of task objects at the same time if the number of task objects suddenly increases. In particular, a problem arises in that all tasks that are consecutively executed cannot be maintained in an in-memory device in a computing node having a very limited amount of memory.
In order to prevent the exhaustion of memory resources attributable to a large number of task objects, conventional distributed parallel processing systems utilize a method of increasing resources within a node. That is, the distributed parallel processing systems adopt a load shedding method for assigning more node resources, such as the resources of memory and a central processing unit (CPU), selectively discarding input streams, or selectively discarding stream processing tasks, in connection with a problematic operation. An example of the load shedding method is disclosed in an article issued in the U.S.A. entitled “S4: Distributed Stream Computing Platform” Proceedings of the ICDM Workshops (published on Dec. 12, 2010, related parts: pp. 170˜177).
The distributed parallel processing systems perform a load migration method of migrating the task objects of an operation from a current node to another node and executing the task objects. An example of the load migration method is Korean Patent Application Publication No. 10-2013-0010314 entitled “Method and Apparatus for Processing Exploding Data Streams.”
However, the method of increasing resources within a node is problematic in that it is difficult to apply the method to an on-line situation while service is being performed, and the method of deleting input data streams or stream processing tasks using load shedding is problematic in that it reduces the accuracy of consecutive query processing results.
Furthermore, the method of migrating the tasks of an operation to another node is not desirable in terms of cost because a plurality of nodes is required, and is also problematic in that the data transfer mapping relationship between a massive number of tasks is required to be set again.