1. Technical Field
The present disclosure relates generally to distributed stream processing, and more specifically to an operator of a streaming application, which can efficiently adapt to the environment in which it is executed, and systems for executing such operators.
2. Discussion of Related Art
The availability of large-scale affordable computational infrastructure is making it possible to implement large, real-world continuous streaming data analysis applications. The implementation of continuous streaming applications where data is ingested from physical sensors (e.g., hurricane tracking stations) or consumed from business platform sources (e.g., trading information from stock exchanges), creates interesting challenges in how to develop these large-scale applications in a scalable fashion. For example it can be difficult for application developers and administrators to ensure that the processing infrastructure will cope with increased needs (e.g., adding new sensors for tracking weather changes) and variation in resource availability that happens once these long-running applications are deployed.
Additional intelligence and appropriate abstractions in the programming model supporting these applications can help. Another limitation of the current methods is that a priori traditional capacity planning techniques can be of limited use in providing guidance and system support for situations where workloads are unknown or when the runtime dynamics are not well understood. For example, in certain environments spikes in data rates need to be dealt with expeditiously at runtime (e.g., announcements by the Federal Reserve in the US usually affect trading patterns and trading transaction volumes almost immediately). Further, some of the typical streaming applications are hypothesis-driven (e.g., can I assess whether a competitor hedge fund is attempting to unload a particular asset?), which also implies spikes in processing needs.
Distributed stream processing applications currently do not include ways of adapting to runtime resource availability variations in processing cycles, memory, or I/O bandwidth. It can be difficult to design streaming applications such that its multiple components are properly placed onto the runtime environment to best utilize the computational resources. Data analysts and programmers may organize a streaming application using structured data analysis tasks. Each data analysis task may includes a logical task (e.g., how a data analysis application should be implemented in terms of fundamental building blocks) and a physical task (e.g., how the logical task should be mapped onto the physical resources). Cognitively, the logical task is much closer to analysts and developers as it is in their domain of expertise. The physical task requires deeper understanding of processor architectures, networking, and interactions of other system components. However, only very well seasoned systems developers handle the physical task, and even they are only effective when dealing with reasonably small applications.
Further, modern chips may include varying number of processing cores. Nowadays, clusters of workstations (COWs) currently deployed in most high-performance computing installations typically have from 2 to 32 cores per node. The current trend in chip design is to ramp up the number of cores to even higher numbers. However, since an application may be executed on a system with an unknown number of cores, developers cannot easily optimize streaming applications.
Thus, there is a need for methods of generating applications that can adapt to the resources of the environments to which they are executed on and systems for executing such applications.