The stream processing computational paradigm includes assimilating data readings from collections of software or hardware sensors in stream form (that is, an infinite collection of tuples carrying the information produced by the external data sources), analyzing the data, and producing actionable results, possibly in stream format as well. Stream processing applications may be comprised of components, each of which may produce a data stream that may be consumed by another component in the application. For example, in traffic management systems, it is conceivable that every driver carrying a cell phone becomes a data source that feeds information about its location and speed into a congestion control system. In such a situation, the data collection portion of a distributed traffic management platform may register its interest in subscribing to all possible data sources (for example, all drivers carrying a cell phone in a particular region) to increase the accuracy of congestion predictions it might make. Therefore, there is a need to provide automatic routing of data sources. In the above case, this could include routing the instantaneous readings capturing the driver speed and location, in addition to traffic accident locations, and road maintenance schedules to data consumers (in this example, the traffic and congestion management system platform).
Typically, a stream processing application can include dozens to hundreds of analytic operators, deployed on systems hosting many other, potentially interconnected, stream applications, distributed over a large number of processing nodes. In existing approaches, implementing the flow graphs that will interconnect an application or allow multiple applications to be integrated is usually achieved in an ad hoc way, by hard coding the inter- and intra-application connections or by relying on publisher-subscribe or enterprise bus type of mechanisms that have many shortcomings concerning scalability issues.
The production of data by a producer may be intermittent relative to the execution of the application. For example, large-scale distributed applications being developed in the realm of infrastructure monitoring (for example, traffic management systems, energy distribution systems, large-retailer supply chain management systems, distributed fraud and anomaly detection systems, surveillance systems, etc.) tend to be long-running applications designed to stay up continuously except perhaps during well-planned maintenance outages. Moreover, these applications are often designed to cooperate amongst themselves, for example, by having traffic sensors drive automated traffic controls for reducing congestion.
In many cases, the raw data sources that feed existing applications can become available and unavailable continuously with varying time granularities (that is, from a few seconds to a few days), as some of these sources are transient. In such a situation, the data collection portion of a platform may register its interest in subscribing to all possible raw data sources. Therefore, a need exists to provide automatic routing of data sources to data consumers.
Existing approaches include publish/subscribe systems. However, publish/subscribe systems describe data characteristics at the granularity of individual data items generated by the producers, rather than at the granularity of a producer. Such an approach is impractical due to the inefficiencies associated with annotating individual data items as opposed to annotating at the granularity of a source of data.