1. Technical Field
The present invention relates to assembling stream processing applications, and more particularly, to a method and system for automatically assembling stream processing graphs in stream processing systems.
2. Discussion of the Related Art
Stream processing applications ingest large volumes of streaming data from one or more sources, process it using a variety of components, and produce results that satisfy user queries.
Stream processing systems are needed in situations where source data is too voluminous to store and analyze. Such data, observed on high capacity streams, must be processed on-the-fly by stream processing applications in response to user queries. These applications are typically expressed as processing graphs (or workflows) of components that can extract meaningful information from mostly unstructured, streaming data. A processing graph is a stream-interconnected collection of data sources and processing elements (PEs). Data sources produce the (possibly unstructured) streaming data to be observed. PEs are deployable software components that can perform various kinds of operations on the data to produce new, derived data streams.
A key challenge for stream processing systems lies in the construction of processing graphs that can satisfy user queries. With many thousands of disparate data sources and PEs to choose from, we cannot expect the end-user to craft these graphs manually. These users are typically not skilled programmers, and they may not have knowledge of the functions performed by different components.
We can also not rely on programmers or experts to construct these graphs. With the large numbers of data sources and PEs to consider, the number of possible graphs is enormous. Different users can have different queries, requiring different graphs to be constructed. Thus, it is not feasible to pre-construct all possible graphs to satisfy the wide variety of end-user queries manually.
Also, for a given query, a number of alternative processing graphs can be assembled, each achieving a similar result, each consuming possibly different amounts of computational resources, and each producing different levels of quality. Depending on deployment-time resource utilization, a particular graph may not be deployable, but some alternate graph, consuming fewer resources at some sacrifice in result quality, might be deployable. Typically, however, users will not know how to construct the right graph to produce the highest quality result with resource limitations at deployment time.