Computer systems generally include one or more processors interfaced to a temporary data storage device such as a memory device and one or more persistent data storage devices such as disk drives. Each disk drive generally has an associated disk controller. Data is transferred from the disk drives to the disk controller and is then transferred to the memory device over a communication bus or similar. Once data has been transferred from the disk drives to a memory device accessible by a processor, database software is able to examine the data to determine if it satisfies the conditions of a query.
In data mining and decision support applications, it is often necessary to scan large amounts of data to include or exclude relational data in an answer set. Often the answer set contains data that is stored in more than one table. Where data is stored in more than one table a join plan is typically needed. In a relational database management system a cost based optimizer evaluates possible join plans. Some join plans may use Cartesian products of base tables or intermediate tables in the steps necessary to generate the answer set.
One difficulty is that the use of a Cartesian product table has the potential to result in a large amount of data. An intermediate table arising from a Cartesian product will generally include the product of the selected rows from each of the tables from which the intermediate table is generated. As an example, if one table results in ten selected rows, a second table has 100 selected rows and a third table has 1000 selected rows, the intermediate table generated from a Cartesian product of the three tables will have 1,000,000 rows.
Large intermediate tables require a large amount of random access memory (RAM) and/or secondary storage such as disk space. Large tables also result in more data needing to be transferred over the communications bus and therefore have the potential to reduce performance.