1. Field
Embodiments of the invention relate to an asynchronous data structure pull Application Programming Interface (API) for stream systems.
2. Description of the Related Art
A process may be described as a data flow diagram. A process may be constructed from the following components: a data flow diagram, operators, and arcs. A data flow diagram may be described as a directed graph where the vertices/boxes of the graph are called operators and the arcs describe directional flow of data. The data flow diagram describes the data as the data flows from various data sources through the different operators to various data targets. Operators are able to read data from an external resource, write data to an external resource, and/or apply data transformations while doing so. In general, operators are able to consume data from every incoming arc and can produce data on every outgoing arc. Many operators are provided as built-in operators to provide common data access and transformations, while other operators may be created by the user and easily integrated into the system. Arcs represent flow of data between two connected operators.
A continuous process may be described as a process that reads from continuous data sources (i.e., data sources that provide data continually) and generates result data corresponding to input data as the input data becomes available. A system that runs as a continuous process is a “stream system”. A stream system may be represented by a data flow diagram.
A scheduler may be described as a runtime component that activates the operators of the process. The scheduler's job is to allow the process to produce data while minimizing consumed resources, such as memory and CPU, and while maximizing Quality of Service (QoS) measurements, such as latency and throughput.
FIG. 1 illustrates a fragment of a Process 100 with four operators, Operator A, Operator B, Operator C, and Operator D. In FIG. 1, Operators A and B consume data from their incoming queues and produce data into the queues that are consumed by operator C. Operator C is consuming and processing that data in its incoming queues and is producing more data that is sent via another queue to operator D.
While the data is streaming into and out of the operators, the scheduler needs to decide in every step which operator (or operators) to activate. In particular, the execution time of a process is composed from a finite number of scheduler steps. In the beginning of each step, the scheduler decides which operators will be activated during that step.
In event-based methods, a routine or method is invoked for each data item received or possibly for each available output location made available. However, some such event-based methods do not provide a desired coarse granularity. A drawback of the event-based approach is overhead due to lack of granularity control. A method invocation is required for each data item delivered to the operator, and the operator code restores and then saves back any state needed between the receipt of every data item received.
With multi-threading, use of multiple threads incurs the overhead of stack allocation and switching, which is more costly than ordinary procedure calling. Moreover, multi-threading is disallowed in some execution frameworks (e.g., Java® 2 Platform, Enterprise Edition (J2EE™) application servers (Java and J2EE are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both)).
In general, use of threads may be considered problematic because such use destroys most composition properties of programs. For example, in order to use a library of software, the developer needs to know whether or not the library uses threads and how the threads are used in order to know whether the library can be used from another context that uses threads.
Finally, some third party code that needs to be included in operators may simply not be thread safe. Hosting this code requires either an entirely separate operating-system-level process or requires a single-threaded operator framework implementation.
One of the desirable properties of a dataflow system (i.e., a system that processes data flow diagrams) is the ability to avoid use of thread-based concurrency. For example, J2EE™ application server-based deployments disallow use of threads by applications. Hence, it is the nature of single-threaded systems that once the scheduler activates an operator and passes control to the operator, it is up to the operator to decide when to return control to the scheduler. That is, the scheduler cannot interrupt the operator, or cancel activation of the operator, or even initiate communication with the operator until the operator decides to end the current activation cycle. Furthermore, the scheduler has no knowledge of the nature of the logic implemented by the operators. Thus, a kind of cooperative multitasking is needed.