With the emergence of the Internet and the interconnection of devices utilized in most every aspect of modern life, a wide range of data has become available of an almost limitless diversity. Internet content may be thought of as data that has intrinsic value to a subset of users of web sites, internet client devices, and the like. This data can be configured to more efficiently address and therefore be of greater value to the subset of users. In many cases, this greater value is created as a result of some type of data processing, typically in the form of a sequence of stages or steps, which may be implemented through use of a pipeline. A pipeline includes one or more stages, which may provide manipulation of sets of data, combine multiple sets of data into a single set of data through interlinking related data, and the like. Often, an output of a stage of a pipeline will serve as input to multiple subsequent stages, each of which may represent a beginning of a new pipeline and/or a continuation of the same pipeline. Since each pipeline stages relies on the availability of data from a preceding stage in the pipeline, it is very important to have a reliable system for consuming input data and producing output data for subsequent stages.
Because of the wide range of data available from the Internet, systems utilizing a large number of pipelines may be utilized to manipulate the data through use of the various stages. In some systems, for example, pipelines are interconnected with other pipelines through interconnected stages, resulting in a large and intricate system of pipelines, such that execution of the pipelines demands a significant amount of computer resources. Execution of pipelines may include performing services included in stages of the pipeline, such as interlinking related data, and the like. Because of business demands such as timeliness due to a competitive nature of a particular industry, execution of stages for a pipeline is accomplished as fast as possible with a high reliability to remain competitive in the industry.
Previous systems employ attach and detach operations to propagate data across stages in a pipeline. However, these previous systems do not scale well to large pipelines. For example, the detach operation detaches the whole database because the detach operation does not support partial data propagation. Because a subsequent stage may not require all the database objects, maintaining multiple copies of unwanted data is a waste of disk resources. Further, the previous system consumes extra network bandwidth to copy the large files across stages. There is a need for a system that avoids transferring unneeded data by allowing a subset of the objects in a database to be selected and propagated. Further, the detach and attach operations require use of the file system to copy the data to the next stage. Accessing the file system to copy data slows the data propagation and requires additional input/output bandwidth.
In addition, previous systems allow multiple input pipes but only one output pipe per stage because of the previous propagation mechanisms. To support multiple output pipes in the previous systems, the number of databases needed increases significantly. Since each input pipe is an attached database and each output pipe is a detached database, there is one database needed for each input pipe and one database needed for each output pipe. So if a particular stage in a pipeline consumes N input pipes and produces M output pipes, there may be N+M+1 databases needed to process a single stage. When running large quantities of stages on a computing device, the complexity and disk space requirements for the pipeline increase significantly. Further, because all the input pipes are attached as a database, inefficient cross-database queries are needed to process the input data and produce the output.
Accordingly, a scalable and efficient system for propagating data from one content pipeline to another is desired to address one or more of these and other disadvantages.