As computing technology evolves, data collection and analysis have proliferated. The amount of data collected has risen exponentially, but processing ability has not kept pace. With the vast amount of collected data, in one or more data sources, the ability for conventional systems to efficiently query the data sources can be resource intensive and costly. A query of or more data sources is a precise request for information retrieval. Filtering is one of the fundamental operations that is carried out during query processing. A given query can specify any arbitrary filter expression for application on the data sources. Some of the predicates in the filter expression could be more expensive than others in terms of system resource utilization, time, etc. For example, a predicate that does a regular expression (regex) search (e.g. name like “john %”) over data would be much more resource intensive than a predicate that does a numeric equivalency check (e.g. income>=10000).
In some cases, there may be a need to carry out multiple passes over the same data set for answering different queries. In such cases, performance can be improved by doing a single pass over the dataset and handing over the data to multiple execution engines. This model of single source and multiple consumers also works well where pulling data out of the source is very expensive. This could be a disk or tape or any other slower media. It can even be a fast medium which is behind a slower network pipe.
In cases where parallel queries are being processed for a same data set, it can be very challenging to apply filters in the most optimized way. The most naïve method is to let each query processor apply the filters independently. The major disadvantage of such a technique is that each query processor has to apply filter set on the complete data set, and, in some cases, where there may be common filters, it gets evaluated multiple times.