The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
Relational databases store information in tables that are organized into rows and columns. In non-relational terminology, a row is a record and a column is a field. A user retrieves information from the tables by entering a request that is converted to queries by a database application, which then submits the queries to a database server. In response to the queries, the database server accesses the tables specified by the query to determine which information within the tables satisfies the queries. The information that satisfies the queries is then retrieved by the database server and transmitted to the database application and ultimately to the user.
FIG. 1 illustrates a logical layout of an exemplary table 100 within a relational database. Table 100 comprises three user columns, column A 104, column B 106, and column C 108, and eleven rows 110-128. Table 100 also contains an internal column or pseudocolumn 102, referred to as a rowid. A table's rowid is retrievable by query and uniquely identifies a row in the table, but is not normally displayed when the structure of the table is listed. For example, a rowid of 221 uniquely identifies row 110, which contains the values of 3 in column A 104, 5 in column B 106, and 2 in column C 108. In this example, the values of the columns A 104, B 106, and C 108 are integers, but it is to be understood that columns of a database table can hold values of any of a variety of types including floating point numbers and variable length strings of characters, dates, etc.
For any given database application, the queries to retrieve information from a table must conform to the rules of a particular query language. Most query languages provide users with a variety of ways to specify the information to be retrieved. For example, in the Structured Query Language (SQL), the query, select A from T where A<5, requests the retrieval of the information contained in column A of specified rows of table T that satisfies a specified condition. The conditions in the where clause specify one or more predicates, in this example A<5, which must be satisfied by matching rows. In the example, rows 110, 114, 116, 118, 120, 124, and 128 of table 100 satisfy this query because the corresponding values of column A 104 are 3, 3, 2, 1, 4, 2, and 4, respectively. On the other hand, rows 112, 122, and 126 of table 100 do not satisfy this query because the corresponding values of column A 104 are 6, 7, and 8, respectively.
In order to process the exemplary query, a full table scan may be implemented that reads all the rows from table 100. Thus, the full table scan reads the data for all the columns in table 100, even though only the information from column A 104 was necessary to process the query. If the amount of the rows in a table and/or the data for the columns not used in the query is very large, then the full table scan methodology becomes very inefficient because of the unnecessary amount of disk input/output.
Accordingly, many database systems provide indexes to increase the speed of the data retrieval process. An index is a body of entries ordered by key values of a key. An index key may be, for example, a column of a table. Such a key is referred to herein as a key column. Each index entry identifies a record (e.g. row).
One implementation of a database index is a B-tree, whose logical layout is illustrated in FIG. 2. A B-tree index is a hierarchical arrangement of two types of nodes: leaf nodes and branch nodes. Leaf nodes reside at the lowest level of the B-tree hierarchy and contain values from the actual column or columns upon which the index is built and the rowid of the corresponding rows. Leaf nodes may comprise multiple entries that contain data for many rows, e.g. 100 entries corresponding to 100 rows, but, for purposes of example, leaf nodes are illustrated as containing a single entry. For example, B-tree index 200, being built upon column A 104 of table 100, has leaf nodes 210-228 collectively holding the values of column A 104. Specifically, leaf node 210 holds the index value 1 from column A 104 and the rowid 118, which identifies row 118 of table 100. As another example, leaf node 228 contains an index value of 8 from column A 104 and a rowid of 123, identifying row 126 of table 100. Each leaf node contains a pointer or other link to a previous and subsequent leaf node if such a leaf node exists. For example, leaf node 226, which contains an index value of 7, points to leaf node 228, which contains an index value of 8, and to leaf node 224, which contains an index value of 6.
The non-leaf nodes of a B-tree index are branch nodes. Branch nodes contain information that indicate a range of values. In the illustrated B-tree index 200, nodes 202, 204, 206, and 208 are branch nodes and therefore each corresponds to a range of values. The range of values indicated by each branch node is such that all nodes that reside below a given branch node correspond to the values that fall within the range of values for the given branch node. For example, node 206 is a branch node that corresponds to the numerical range from 4 to 6. Consequently, nodes 220, 222, and 224, which all reside below node 206 in the hierarchy, correspond to values that fall within the range from 4 to 6.
A database server can perform an “index range scan” on B-tree index 200 to process the exemplary query select A from T where A<5, because B-tree index 200 is built on a column referenced in a predicate (A<5) of the where clause of the exemplary query. In an index range scan, the first leaf node within the range of values indicated by the predicate of the where clause is located, and the subsequent leaf nodes are visited until the range of values is exhausted. In this example, because the predicate is A<5, the lowest valued node in the B-tree index 200 is identified by traversing from branch node 202 to branch node 204 to leaf node 210. Beginning with the first leaf node, as long as each leaf node contains an index value that satisfies the predicate, the corresponding row is identified and the subsequent leaf node is inspected. In the example, since leaf node 210 matches the predicate, its link to the next leaf node 212 is followed. In this manner, leaf nodes 212, 214, 216, 218, 220, 222 are successively visited until leaf node 224 is reached, terminating the scan because leaf node 224 contains data for a row with an index value of 6 that does not match the predicate. Since the B-tree index 200 does not store data for the columns upon which the index is not built, and since a selected subset of the index entries are read, the index range scan, in appropriate queries, can serve to reduce the amount of data read from the disk and, hence, improve the efficiency of reading values.
However, there are many instances where a user does not know, a priori, the distribution of data of a particular column in a particular table. Because, the distribution of data determines the range, the user, therefore, does not know what range to include in a query of the database. The user may have an idea of a particular value to search for, but most likely the user will not know if the applicable table contains that particular value.
For example, a user may want to determine which employees have a salary nearest to a target value of $50,000, but the user is unaware of the data distribution (i.e. unaware of whether, or how many, employees have salaries relatively near $50,000). In that case, a range scan is not efficient because the user may only desire the single nearest match (i.e. one row from an employee information table). More importantly, a range scan may not return the closest match since the closest match may be outside of the range provided by the user. For instance, if the user submitted a query indicating rows in the employee information table that have salary values greater than $45,000, the nearest match to $50,000 may be outside that range, such as $44,900.
One way to return the nearest match to the target value of $50,000 is to submit two queries:
select EmpID, Salary into :ID1, :Sal1from Empwhere Salary >= 50000 and rownum < 2;select EmpID, Salary into :ID2, :Sal2from Empwhere Salary < 50000 and rownum < 2;
Each query limits the number of returned rows to one (“rownum<2”). A comparison must then be performed between the absolute differences of the salaries of the returned rows in order to determine which salary is closest to the target value. An example of such a comparison is the following:
if abs(:Sal1 − 50000) <= abs(:Sal2 − 50000) thenreturn :ID1elsereturn :ID2
The above approach suffers from two significant problems—usability and performance. First, the three-step process is cumbersome. Second, because two queries are being executed, the index is accessed twice. Thus, there is a need to provide a more efficient mechanism for retrieving a row in a table with a value nearest a target value.