The present invention relates to computer systems and more particularly to efficiently retrieving information from databases.
Relational databases store information in indexed tables that are organized into rows and columns. A user retrieves information from the tables by entering a request that is converted to queries by a database application, which then submits the queries to a database server. In response to the queries, the database server accesses the tables specified by the query to determine which information within the tables satisfies the queries. The information that satisfies the queries is then retrieved by the database server and transmitted to the database application and ultimately to the user.
FIG. 2(a) illustrates a logical layout of an exemplary table T (200) within a relational database. Table 200 comprises three user columns, column A 204, column B 206, and column C 208, and eleven rows 210-230. Table 200 also contains an internal column or pseudocolumn 202, referred to as a rowid. A table""s rowid is retrievable by query and uniquely identifies a row in the table, but is not normally displayed when the structure of the table is listed. For example, a rowid of 221 uniquely identifies row 210, which contains the values of 3 in column A 204, 5 in column B 206, and 2 in column C 208. In this example, the values of the columns A 204, B 206, and C 208 are integers, but it is to be understood that columns of a database table can hold values of any of a variety of types including floating point numbers and variable length strings of characters.
For any given database application, the queries to retrieve information from a table must conform to the rules of a particular query language. Most query languages provide users with a variety of ways to specify the information to be retrieved. For example, in the Structured Query Language (SQL), the query, select A from T where A less than 5, requests the retrieval of the information contained in column A of specified rows of table T that satisfies a specified condition. The conditions in the where clause specify one or more predicates, in this example A less than 5, which must be satisfied by matching rows. In the example, rows 210, 214, 216, 218, 220, 226, and 230 of table T 200 satisfy this query because the corresponding values of column A 204 are 3, 3, 2, 1, 4, 2, and 4, respectively. On the other hand, rows 212, 222, 224, and 228 of table T 200 do not satisfy this query because the corresponding values of column A 204 are 6, 5, 7, and 8, respectively.
One approach to access the rows of a table in processing a query is called a xe2x80x9cfull table scan,xe2x80x9d in which a database server fetches every row of the table and inspects every column named in the where clause. FIG. 2(b) illustrates one possible physical layout 200xe2x80x2 of table T wherein the corresponding row data 210xe2x80x2-230xe2x80x2 is stored in one or more units (or xe2x80x9cextentsxe2x80x9d) of contiguous blocks. A block is the smallest quantity of data that can be read from a persistent store such as a disk into dynamic memory. If a database system requires any information stored in a particular block, the database system must read the entire block into memory. To retrieve values for a particular column of a table, the database system must read all the blocks that have any data from that column of the table. Since values for the column may be present in all or almost all the blocks of a table, the entire base table or significant portion thereof must be read into memory in order to retrieve the column values. This retrieval can be very costly, as the column data itself may be a small percentage of the data stored in the table.
The sizes of the blocks in the examples and figures herein are simplified for purposes of illustration. Typically, however, the size of the blocks is generally from 512 (29) bytes to 16,384 (214) bytes, but the size of the extents are much larger and of any size, e.g. 30 megabytes. Storing row data in contiguous blocks enables the use of efficient multi-block input/output techniques, for example allowing overhead of reading from disk storage, such as seeking to a block, to be pipelined.
Over time, as rows are added and dropped, the physical order of the row data usually does not correspond to the logical order of the rows in the table, if there is a logical order. Accordingly, the order of the row data in the blocks may appear to be random. In the example of FIG. 2(b), the first unit of contiguous blocks (extent 240xe2x80x2) contains row data 218xe2x80x2, 228xe2x80x2, 224xe2x80x2, and 216xe2x80x2, and 222xe2x80x2; the second unit of contiguous blocks (extent 242xe2x80x2) contains row data 210xe2x80x2, 212xe2x80x2, 214xe2x80x2, 220xe2x80x2, and 230xe2x80x2; and the third unit of contiguous blocks (extent 244xe2x80x2) contain row data 226xe2x80x2.
In order to process the exemplary query select A from T where A less than 5, a full table scan reads all of extent 240xe2x80x2 containing row data 218xe2x80x2, 228xe2x80x2, 224xe2x80x2, and 216xe2x80x2, and 222xe2x80x2, all of extent 242xe2x80x2 containing row data 210xe2x80x2, 212xe2x80x2, 214xe2x80x2, 220xe2x80x2, and 230xe2x80x2, and the used portion of extent 244xe2x80x2 containing row data 226xe2x80x2. Thus, the full table scan reads the data for all the columns in table T 200, even though only the information from column A 204 was necessary to process the query. If the amount of the data for the columns not used in the query is very large, then the full table scan methodology becomes very inefficient because of the unnecessary amount of disk input/output.
Accordingly, many database systems provide indexes to increase the speed of the data retrieval process. A database index is conceptually similar to a normal index found at the end of a book, in that both kinds of indexes comprise an ordered list of information accompanied with the location of the information. Values in one or more columns of a table are stored in an index, which is maintained separately from the actual database table.
One implementation of a database index is a B-tree, whose logical layout is illustrated in FIG. 3(a). A B-tree index is a hierarchical arrangement of two types of nodes: leaf nodes and branch nodes. Leaf nodes reside at the lowest level of the B-tree hierarchy and contain values from the actual column or columns upon which the index is built and the rowid of the corresponding rows. Leaf nodes may contain data for many rows, e.g. 100 rows, but, for purposes of example, leaf nodes are illustrated herein as containing a single row. For example, B-tree index 300, being built upon column A 204 of table T 200, has leaf nodes 310-330 collectively holding the values of column A 204. Specifically, leaf node 310 holds the value 1 from column A 204 and the rowid 118, which identifies row 218 of table T 200. As another example, leaf node 330 contains an index value of 8 from column A 220 and a rowid of 123, identifying row 228 of table T 200. Each leaf node contains a pointer or other link to the subsequent leaf node. For example, leaf node 328, which contains an index value of 7, points to leaf node 330, which contains an index value of 8.
The non-leaf nodes of a B-tree index are branch nodes. Branch nodes contain information that indicate a range of values. In the illustrated B-tree index 300, nodes 302, 304, 306, and 308 are branch nodes and therefore each corresponds to a range of values. The range of values indicated by each branch node is such that all nodes that reside below a given branch node correspond to the values that full within the range of values for the given branch node. For example, node 306 is a branch node that corresponds to the numerical range from 4 to 6. Consequently, nodes 320, 322, 324, and 326, which all reside below node 306 in the hierarchy, correspond to values that fall within the range from 4 to 6.
A database server can perform an xe2x80x9cindex range scanxe2x80x9d on B-tree index 300 to process the exemplary query select A from T where A less than 5, because B-tree index 300 is built on the a column referenced in a predicate (A less than 5) of the where clause of the exemplary query. In an index range scan, the first leaf node within the range of values indicated by the predicate of the where clause is located, and the subsequent leaf nodes are visited until the range of values is exhausted. In this example, because the predicate is A less than 5, the lowest valued node in the B-tree index 300 is identified by traversing from branch node 302 to branch node 304 to leaf node 310. Beginning with the first leaf node, as long as each leaf node contains an appropriate column value that satisfies the predicate, the corresponding row is identified and the subsequent leaf node is inspected. In the example, since leaf node 310 matches the predicate, its link to the next leaf node 312 is followed. In this manner, leaf nodes 312, 314, 316, 318, 320, 322 are successively visited until leaf node 324 is reached, terminating the scan because leaf node 324 contains data for a row with a column value of 5 that does not match the predicate. Since the B-tree index 300 does not store data for the columns upon which the index is not built, and since a selected subset of the index entries are read, the index range scan, in appropriate queries, can serve to reduce the amount of data read from the disk and, hence, improve the efficiency of reading values.
One drawback to the index range scan procedure is illustrated with reference to FIG. 3(b), showing a physical layout 300xe2x80x2 of B-tree index 300. After many insertions and deletions to the corresponding table, the physical order of the leaf nodes in the physical layout 300xe2x80x2 might not correspond to the logical (sorted) order of the leaf nodes. In order to perform the above-described index range scan, the first block at Block 1 in Extent 1 (leaf node data 302xe2x80x2) is first visited for the root branch node 302. The left-most pointer 1:4, indicating Block 4 of Extent 1, is consulted because the predicate is a less than inequality. For purposes of example, a disk pointer is presented herein as a combination of a block number and an extent number. However, in typical systems, a disk pointer also includes an offset within the block, because a block can contain data for a plurality of rows. In some database systems, a separate extent number is not used because each block within the disk or portion thereof has a unique number.
Accordingly, block 4 of the same extent 1 (block 304xe2x80x2 at 1:4) is read, retrieving branch node 304, indicating that the first leaf node 310 is stored in block 310xe2x80x2 at 3:1. To read the data for the leaf nodes 310-322, the following respective blocks are read in order: 310xe2x80x2 at 3:1, 312xe2x80x2 at 2:2, 314xe2x80x2 at 4:3, 316xe2x80x2 at 4:1, 318xe2x80x2 at 1:2, 320xe2x80x2 at 3:4, and 322xe2x80x2 at 4:2. Therefore, traversing a B-tree index involves many random accesses to the disk, because the entries in the B-tree index are not necessarily stored in any particular order on the disk. Random access of non-contiguous blocks incurs a large amount of overhead, for example, due to the xe2x80x9cseek timexe2x80x9d for repositioning the head of the disk for each noncontiguous block.
There is a need for improving the efficiency of retrieving values from a column, for example while processing a query that references the column in a predicate. There also exists a need for reducing the amount of data read from a secondary storage unit such as a disk when retrieving information from a table. There is also a need for retrieving values from an index without incurring the disk input/output overhead involved in randomly accessing non-contiguous blocks.
These and other needs are addressed by the present invention, which performs a block-order scan of the index by sequentially reading one or more units of contiguous blocks of the B-tree index. Branch nodes in the blocks are ignored, and column values from within leaf nodes are retrieved. Advantageously, the overhead associated with random access of blocks as per a conventional index range scan is avoided because the seek time for accessing the next contiguous block is smaller than for accessing a non-contiguous block. In addition, reading data from an index involves less disk input/output than a conventional full table scan since the index typically stores only part of the information in the corresponding table. The index may be partitioned to reduce the amount of blocks accessed in the block-order index scan.
One aspect of the invention pertains to a computer-implemented method and a computer-readable medium bearing instructions for retrieving values from a column in a table. Accordingly, an index on the table that contains values from at least the column is identified. The index is stored in one or more units of contiguous blocks and contains within its leaf nodes the values from at least the column. Data is sequentially read from at least one of the units of contiguous blocks, and the values are retrieved for the column from the leaf nodes from the sequentially read data. For example, if the index is partitioned, then values from only selected partitions of the index need be retrieved.
Another aspect of the invention involves a computer-implemented method and a computer-readable medium bearing instructions for retrieving data stored in a table, which includes receiving a query having a predicate that references a column within the table. An index built upon at least the column is identified, and a block-order scan of at least some of the index is performed to retrieve values that belong to said column. For example, if the index is partitioned, then values from only selected partitions of the index need be retrieved.
Still another aspect of the invention relates to a computer-implemented method and a computer-readable medium bearing instructions for retrieving data from an index-organized table. The table is stored in a computer-readable medium and organized as a B-tree data structure upon at least a first column. The B-tree data structure has branch nodes and leaf nodes, in which the leaf nodes contain values for the first column and for other columns of the table. In response to receiving a query having a predicate that references the column, a block-order scan of at least some of the B-tree data structure is performed to retrieve the values for the first column. For example, if the index-organized table is partitioned, then values from only selected partitions of the index need be retrieved.
Still other objects and advantages of the present invention will become readily apparent from the following detailed description, simply by way of illustration of the best mode contemplated of carrying out the invention. As will be realized, the invention is capable of other and different embodiments and its several details are capable of modifications in various obvious respects, all without departing from the invention. Accordingly, the drawing and description are to be regarded as illustrative in nature, and not as restrictive.