Traditional query processors have favored dealing with data that does not fit in faster main memory, but is stored on slower mass storage devices. However, it is expensive in terms of performance to process large volumes of data from a hard disk. With the evolution in hardware capabilities of computers, the operating system and hardware now support larger capacities in the faster main memory thereby allowing the storage of tables completely in memory.
In order to efficiently process data, the location of the data needs to be taken into consideration. A typical data warehouse query involves querying data in one large table called a fact table and a group of smaller tables called dimension tables. Typically, during processing, the data from each dimension table are stored in a hash table in memory. If the dimension hash tables do not fit in memory the data in the fact table is repartitioned and the processing is performed partition by partition. If the hash tables fit into memory then there is no need to repartition the fact table as the hash tables can be easily accessed by other threads in a multiprocessing environment. Not having to repartition the data is especially beneficial with batched processing because moving batches across various threads is much slower.
However, in systems with multiple types of data stores the query processor has no influence on the storage schema of the tables involved in a query. Therefore, the query processor needs to be able to accommodate disparate types of data stores where data may be stored column-wise or row-wise.