In computerized applications such as warehousing large amounts of data that might be continuously generated as a result of, e.g., large scale manufacturing processes, grocery store sales, and the like, it is essential to provide some way to present summaries of the data. The summaries may have to be presented periodically, e.g., daily. To present such summaries, a large amount of raw data must be processed. So-called data “pipelines” have been provided for this purpose.
Essentially, a data pipeline is a collection of software modules, each one of which executes a particular operation or sequence of operations on data that passes through it. The modules typically are arranged in series, i.e., a first module receives the raw data stream, processes it, and then sends its output to the next module down the line. The last module typically is an output module.
Because data summarization requirements vary widely depending on the particular application, current pipeline architectures are specifically designed to meet the demands of whatever application happens to be envisioned. Ordinarily a pipeline can't be used for an application it was not designed for. This is because each pipeline has constraints that are unique to its application, e.g., how to filter outlier data points, how to summarize by group, even what input and output streams are to be involved.
Consequently, because of the above considerations it is difficult to provide an open pipeline architecture that is flexible and easily configured for more than a set of applications. While some pipeline architectures might permit using standard libraries, they are time-consuming to develop, require individual debugging, module by module, and tend to be difficult to maintain, since information such as SQL query statements are embedded in the pipeline code, and each programmer tends to have his or her own style. Moreover, pipelines such as UNIX are restricted to one input and one output, further decreasing the flexibility of the architecture. Still further, most if not all pipelines require the modules to work in series, as mentioned above, but as recognized herein it is sometimes desirable that a particular module process only a portion of a stream of data without having to sort through the entire stream.
In addition to the above, the present invention recognizes that existing pipelines suffer additional disadvantages. Among them is that the interfaces that connect modules to the pipe are either not defined or are too restrictively defined to be flexible for more than a set of applications. Also, existing pipelines envision data flow in one direction—input to output—which renders them incapable of certain summarization tasks, such as incremental mean computation which requires access to previously computed means from the output of the pipe. Recognizing the above drawbacks, the solutions herein to one or more of them are provided.