Relational database systems store data in relational structures, such as tables and indexes. However, the actual format in which the data is stored, retrieved, and manipulated, often bears little relationship to the logical structure of a table.
For example, FIG. 1 is a block diagram of a conventional relational table 100. Relational table 100 includes rows r1 to r8, and columns c1 to c5. Each row typically corresponds to a logical entity, and each column corresponds to an attribute of that entity. For example, in an employee table, each row may correspond to an employee, and each column may correspond to a distinct employee attribute (name, age, salary, phone number, etc.).
The intersection of a row and column defines a cell. The value stored in a given cell is the value, for the attribute associated with the column, for the entity associated with the row. For example, assume that row r2 is associated with an employee named “Mary”, and that column c2 is for the attribute “age”. Under these circumstances, the value (illustrated as r2c2) stored in the cell formed by row r2 and column c2 may be 17, to indicate that Mary is 17 years old.
Various database languages have been developed to easily access data that is managed by relational database systems. One common database language is SQL. Such languages allow users to form queries that reference the data as if the data were actually stored in relational structures. However, the actual structures in which the relational data is stored and accessed is often significantly more complicated than simple two-dimensional tables.
For example, FIG. 2 illustrates how the data from table 100 may actually be stored on a disk 200 by a relational database system. As illustrated in FIG. 2, the data for the table is spread among several disk blocks 202, 212 and 222. Within a disk block, the data for any given row is typically stored contiguously. A storage format in which data for each row is stored contiguously is referred to as a “row-major format”. For example, the values for the various columns of row r1 are stored contiguously within disk block 202.
However, values for certain rows, such as rows r3 and r5, may actually span two disk blocks. For example, the values for columns c1 and c2 of row r3 are in disk block 202, while the values for columns c3, c4 and c5 of row r3 are in disk block 212.
The disk blocks that store data for any given table may be spread throughout a disk, in no particular order. Each disk block typically has a section that stores the actual values that logically populate the table (the “relational data”), and a header that stores metadata about the relational data. For example, disk blocks 202, 212 and 222 respectively have headers 204, 214 and 224, and relational data portions 206, 216, and 226. An example of how relational data may be organized within disk blocks is illustrated in U.S. Pat. No. 5,870,758 entitled “Method and Apparatus for Providing Isolation Levels in a Database System”, filed on Mar. 11, 1996.
FIG. 3 is a block diagram of a typical database system 300 that may manage access to table 100. Referring to FIG. 3, a typical database system 300 includes a storage subsystem 320 on which the data managed by the database system 300 is durably stored. In the database system 300 illustrated in FIG. 3, the storage subsystem 320 includes the disk 200 that contains the disk blocks 202, 212 and 222 that store the relational data of table 100.
In addition to storage subsystem 320, database system 300 includes volatile memory 330 and a CPU 340. Typically, the size of the persistent storage used to store relational data is significantly greater than the size of the volatile memory 330 into which that data must be loaded when the data is the target of a database command. Consequently, relational database servers typically employ a buffer cache 332 in which a database server 334 temporarily store copies of the disk blocks that contain the data that is the target of a database operation. As illustrated in FIG. 3, a copy of disk block 202 has been loaded into buffer cache 332 in response to some data within disk block 202 being the target of an operation.
As database server 334 performs operations that target data stored in different disk blocks, the buffer cache 332 fills up with copies of those disk blocks. Once buffer cache 332 is full, loading a copy of a not-yet-loaded disk block into volatile memory involves replacing a copy of a disk block that was previously loaded into volatile memory. If the copy of the disk block that is being replaced has been changed, then the data from the disk block copy typically needs to be flushed back to disk.
Because moving data between volatile memory 330 and disk over I/O path 360 is orders of magnitude slower than moving data between volatile memory 330 and CPU 340 over I/O path 350, relational database systems have focused on minimizing the amount of I/O that occurs between storage subsystem 320 and volatile memory 330. For example, various techniques for intelligently caching relational data are described in U.S. patent application Ser. No. 12/691,146, entitled “Selectively Reading Data From Cache And Primary Storage”. As another example, various techniques for compressing the relational data on disk 200, so that the same amount of I/O traffic path 360 can move more data, are described in U.S. patent application Ser. No. 12/617,669, entitled “Structure Of Hierarchical Compressed Data Structure For Tabular Data”.
Unfortunately, techniques that minimize the traffic on I/O path 360 can lead to less-than-optimal use of I/O path 350. For example, because the copies of disk blocks that are stored in buffer cache 332 have substantially the same size, content, and organization as the disk blocks on disk 200, relational data can be transferred between volatile memory 330 and disk 200 with minimal conversion. However, once in volatile memory 330, that same disk-block format can be an obstacle to efficient processing of database operations.
For example, the query “select c1 from emp where c2>30” requests retrieval of values from column c1 for all rows that have a value in c2 that is greater than 30. The only data that is actually required to execute this query is the data from columns c1 and c2 of the emp table. In other words, to perform the requested operation, the values from columns c1 and c2 are the only values that CPU 340 needs to see. However, because the relational data is stored as disk block copies within buffer cache 332, and within those disk block copies the data is stored in row-major format, a large amount of irrelevant data will flow through I/O path 350 for the CPU 340 to obtain the data required to perform the operation perform the operation. In some situations, the values for potentially hundreds of columns may have to flow through I/O path 350 simply to enable CPU 340 to perform an operation that involves values from only one or two of those columns.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.