Historically, business intelligence (BI) systems have been an input/output (I/O) bound workload. Business data is stored on the disks of a data warehouse, and retrieving data from these disks is the main cost in query execution. The state of the art in BI has been defined by this I/O bottleneck: low-end systems spend most of their time waiting for disk I/O, while high-end systems use large numbers of disks to achieve high throughput at great financial cost.
Researchers have developed several techniques to alleviate this bottleneck by reducing the amount of data a query processor needs to touch. These techniques include aggressive compression, column stores, and materialized views. With the advent of large main memories, these techniques often allow the entire working set of a BI system to fit in random access memory (RAM), bypassing the traditional disk I/O bottleneck. For the first time, BI has become CPU-bound.
Unfortunately, recent trends in hardware are bringing this new era quickly to an end. Processor manufacturers are putting ever increasing numbers of cores onto a CPU die, and main memory bandwidth is not keeping pace. Now, in a query using multiple cores, accessing main memory often becomes the bottleneck. With manufacturers soon to put 6 and 8 cores on a single chip, this problem will only become worse.
Today, major processor vendors are shipping processors equipped with 4 separate processing cores, with 6- and 8-core processors in the pipeline. Each core in a multi-core processor is an independent CPU; this CPU sits at the top of a memory hierarchy consisting of 2-3 levels of cache and a relatively slow main memory. Each core has a private level-1 (L1) cache that is very fast, but very small. Larger level-2 (L2) and, often, level-3 (L3) caches provide slower access to larger amounts of memory. Typically, the largest cache is shared across all cores on the processor die, while each processor maintains its own private cache at the higher caching layers. For example, the Advanced Micro Devices® Opteron™ processor has a shared L3 cache and private L1 and L2 caches.
At each level of the hierarchy, performance drops by one to two orders of magnitude. Storage capacity follows a different trajectory, increasing by a factor of 2-4 at each cache layer, with a dramatic jump in capacity at the main memory layer. Even the largest processor caches represent less than half of one percent of a modern computer's memory.
This cache/memory hierarchy is somewhat similar to the memory/disk hierarchy for which mainstream database systems were designed, with cache taking the place of the buffer pool and main memory taking the place of disk. However, there are two important differences.
First of all, control of this memory hierarchy is implemented mostly in hardware, with the cache and memory controllers making most low-level decisions about which regions of memory reside in which level of the hierarchy. Modern CPUs provide a few instructions to “suggest” policy changes to the hardware (e.g., Advanced Micro Devices® x86-64 prefetch instructions), but these mechanisms do not provide the flexibility and control that a typical database buffer pool enjoys. In addition, many of the low-level synchronization primitives needed to implement a buffer pool within the L2 cache are themselves as expensive as a cache miss.
The second difference is one of scale. Even large L2 and L3 caches are typically less than 10 MB in size, which is smaller than database buffer pools have been for many years. Business intelligence (BI) queries are highly complex, and running them efficiently requires keeping a large “working set” in cache, including indexes, intermediate data structures, and executable code.
Database management systems (DBMSs) have always aimed to share the results of I/O among concurrent tasks, through the buffer manager. Many recent systems explicitly synchronize concurrent queries to improve the amount of I/O that can be shared at the buffer pool, by grouping together queries that run at similar speeds. Unlike the previous systems, for main-memory DBMSs, the sharing occurs in L2 cache and not in memory. This buffer pool model does not lend itself well to the implementation within the L2 cache. The much smaller cache sizes (when compared to memory) means that the combined working set of the queries often fails to fit. The thrashing of the working set leads to significant I/O that competes with the table I/O.