Embodiments presented herein generally relate to distributed computing, and more specifically, to optimizing an operator graph of a distributed application (e.g., of a streams processing environment).
In a streams processing environment, multiple nodes in a computing cluster execute a distributed application. The distributed application retrieves a stream of input data from a variety of data sources and analyzes the stream. A stream is composed of data units called “tuples,” which is a list of attributes. Further, the distributed application includes processing elements that are distributed across the cluster nodes. Each processing element, or operator, may perform a specified task associated with a tuple. Each processing element receives one or more tuples as input and processes the tuples through the operators. Once performed, the processing element may output one or more resulting tuples to another processing element, which in turn performs a specified task on those tuples, and so on.
Further, a developer of the distributed application may design an operator graph using an integrated design environment (IDE) tool. The operator graph specifies a desired configuration of processing elements and operators in the streams processing environment. The IDE tool may provide pre-defined operators for use in the operator graph. For example, a source processing element may read and extract information from files obtained from a data source. As another example, a functor processing element may manipulate the information extracted from files. In addition, a developer for the distributed application can create custom processing elements to perform a given task. In addition, the developer may create custom operators, specifying functions for a given operator to perform. The functions can specify a given task to perform and a destination processing element for tuple output.