A pipeline is a software infrastructure that defines and links one or more stages of a process, such as a complex business or processing problem. The stages of the pipeline are run in sequence to complete a specific task whereby the output of a given stage is serially provided as input to a subsequent stage, at which point the given stage fetches a subsequent data item to process while the subsequent stage is executing. The stages into which a given pipeline is divided provide processing for the incoming data according to the data processing functions that a given stage is operative to execute, as well as determining the sequence in which processing occurs on the data entering the pipeline to generate an end result. Although a given stage of a pipeline may be local or remote to other stages in the pipeline, the relationship between the stages is static and data must flow through all stages in the pipeline.
One advantage of using a pipeline for data processing is that once all stages in the pipeline are loaded, an end result of the processing or output of the pipeline is produced every cycle. For example, where processing stages A, B and C are connected in a pipeline, and each stage takes one minute to complete, the end result of the pipeline is produced once a minute after all stages are loaded with data, as opposed to once every three minutes where the stages are not connected in a pipeline. A software pipeline may be compared to a manufacturing assembly line in which different parts of a product are being assembled at the same time although ultimately there may be some parts that have to be assembled before others are. Even where some sequential dependency exists, the pipeline takes advantage of those operations that proceed concurrently.
Many disparate techniques for processing software pipelines are known to those of skill in the art. For example, the map-reduce programming model is an attempt to reduce the complexity of the distributed computation of a problem into smaller functional components that can be easily developed. The map-reduce model is a way of expressing the demultiplexing and multiplexing of operational pairs (i.e., map and reduce) so as to automatically allow processing of data to be partitioned among a cluster of computing resources.
One advantage of map-reduce is that it allows for the easy development of a distributed computations task. The model, however, suffers from a number of problems. For example, map-reduce handles parallelization at the level of each map-reduce pair, which is only sufficient for simple tasks and becomes problematic for more complex tasks in information retrieval and machine learning, e.g., focused crawling, ngram generation, etc. In order to accomplish complex tasks such as these, the model requires a priori knowledge regarding how to parallelize the task as a whole, including which mapped pairs should be serial and which should be parallel due to the static nature of the mapping and reduction. Map-reduce also fails to provide higher order language constructs for achieving complex processing, such as looping and conditional constructs, due to the static nature of the map-reduction pairs. Furthermore, the map-reduce model neither provides sufficient extensibility for developing a body of reusable components for data processing nor a mechanism for cooperation between map-reduce pairs.
Another technique, messaging system frameworks, provides the ability to perform distributed computation using loosely coupled asynchronous computations units. One disadvantage of using messaging system frameworks for pipeline processing of data, however, is that these systems do not provide a means to declare groups of components that cooperate for a single task—there is no contractual agreement established between messaging components linking the components together. Also, messaging system frameworks do not provide interfaces for handling the receipt and transmission of data from and to multiple sources and destinations.
Another alternative for pipeline data processing known to those of skill in the art is the use of workflow engines. Applications such as these use standardized languages, such as Business Process Execution Language (“BPEL”), to describe processes in terms of workflow between interconnected computational units. In addition to other limitations, however, none of these languages or implementations are suited for describing distributed computational processes.
In addition to other drawbacks, the alternatives for pipeline data processing known to those of skill in the art fail to provide memoization, parallelization of execution, optimization of process distribution and asynchronous processing in a service oriented framework. Thus, there is a need for new systems and methods for allowing the declaration and execution of data processing pipelines that overcome limitations with existing techniques.