Present invention embodiments relate to stream computing and, in particular, to streamlining a stream computing environment by distributing tuple attributes to associated operators in the stream environment.
Generally, stream computing processes continuously flowing or streaming data through an operator graph, which may also be referred to as a stream processing application graph. Stream computing enables continuous and fast analysis of massive volumes of moving data to help improve the speed of business insight and decision-making and is often most appropriate where there are very large volumes of data that need to be processed in very short amounts of time. In order to effectuate stream computing, a set of operators (processes) are organized in a stream operator graph so that the stream operators may work in parallel. Generally, stream operator graphs are made up of: (1) operators that apply some logic to a stream input and generate a stream output; (2) streams that carry data from one operator to another; and (3) tuples, which are segments of data that flow through a stream.
An operator can generate some data and pass it to another operator to perform a task before this second operator passes the data (possibly modified) to a third operator and so on. In stream processing, many such operators work together to implement a larger algorithm, with their data exchanges forming the stream operator graph. For example, a first stream operator may operate on a first portion of a tuple passing therethrough, a second operator may operate on a second portion of the tuple. Since the first operator only operates on a portion of the tuple, the first operator may copy and forward unused data (e.g., tuple attributes) and pass the unused data downstream to subsequent operators. Put another way, the operators perform incremental processing on data (e.g., the tuples) that arrive on their input ports and then forward the results on their output ports to downstream operators.
This incremental aspect naturally encourages data tuples to be created with many attributes, some of which could be large in size. This is natural because the data object (the tuple) represents a whole entity to be processed by the graph (although a single operator may not process all aspects of the object). Consequently, in some instances, large amounts of data (attribute data) must be read, copied, and forwarded several times before reaching an operator that actually consumes the data of that attribute in some way. This unnecessary reading, copying, and/or forwarding is a considerable computation/communication waste that may easily cause network congestion. Moreover, many streams applications are larger than a single computer can handle, so their processes may be spread across multiple processing nodes in a cluster or cloud. As a stream computing environment becomes more distributed (e.g., in the cloud), the resources executing the operators may not be in the same data center, and the unnecessary read, copy, and forwarding operations may cause increases in bottlenecks and communication costs (e.g., clouds charging per byte entering/exiting a site).