Multistore systems represent a natural evolution for big data analytics, where query processing may span both stores, transferring data and computation. One approach to multistore processing is to transfer and load all of the big data into the RDBMS (i.e., up-front data loading) in order to take advantage of its superior query processing performance relative to the big data store. However, the large size of big data and the high cost of an ETL process (Extract-Transform-Load) may make this approach impractical. Another approach is to utilize both stores during query processing by enabling a query to transfer data on-the-fly (i.e., on-demand data loading). However, this results in redundant work if the big data workload has some overlap across queries, as the same data may be repeatedly transferred between the stores. A more effective strategy for multistore processing is to make a tradeoff between up-front and on-demand data loading. This is challenging since exploratory queries are ad-hoc in nature and the relevant data is changing over time. A crucial problem for a multistore system is determining what data to materialize in which store at what time. We refer to this problem as tuning the physical design of a multistore system.
On a parallel note, multistore systems utilize multiple distinct data stores such as Hadoop's HDFS and an RDBMS for query processing by allowing a query to access data and computation in both stores. Current approaches to multistore query processing fail to achieve the full potential benefits of utilizing both systems due to the high cost of data movement and loading between the stores. Tuning the multistore physical design, i.e., deciding what data resides in which store, can reduce the amount of data movement during query processing, which is crucial for good multistore performance. Because the stores have very asymmetric performance properties, the data placement problem is not straightforward. Roughly speaking, store 1 is large and slow relative to store 2, which is smaller and faster. Specifically there are 2 issues: Store 1 has worse query processing performance but provides much better data loading times, and Store 2 has better query processing performance but suffers from very high data load times. These issues require understanding the tradeoffs when considering which data to place in which store.