The present invention relates to database management systems and more particularly to updating indexes in response to parallel execution of data manipulation operations.
To fully utilize the computing power of a multi-processing system, a larger task (a xe2x80x9cparent taskxe2x80x9d) may be divided into smaller tasks (xe2x80x9cwork granulesxe2x80x9d) which are then distributed to processes (xe2x80x9cslavesxe2x80x9d) running on one or more processing nodes. The slaves execute their assigned work granules in parallel with the other slaves, causing the parent task to complete faster than if it were executed by a single process. Each node may contain multiple processors and multiple concurrent processes. The process that divides parent tasks into work granules and distributes the work granules to processes on the various processing nodes is referred to herein as the coordinator process or xe2x80x9cmasterxe2x80x9d.
Multi-processing computer systems typically fall into three categories: shared everything systems, shared-disk systems, and shared-nothing systems. The constraints placed on the coordinator process during the work granule distribution process vary based on the type of multi-processing system involved. In shared-everything systems, processes on all processors have direct access to all dynamic, i.e., volatile, memory devices (hereinafter generally referred to as xe2x80x9cmemoryxe2x80x9d) and to all static, i.e., persistent, memory devices (hereinafter generally referred to as xe2x80x9cdisksxe2x80x9d) in the system.
In shared-disk systems, processors and memories are grouped into nodes. Each node in a shared-disk system may itself constitute a shared-everything system that includes multiple processors and multiple memories. Processes on all processors can access all disks in the system, but only the processes on processors that belong to a particular node can directly access the memory within the particular node.
In shared-nothing systems, all processors, memories and disks are grouped into nodes. In shared-nothing systems as in shared-disk systems, each node may itself constitute a shared-everything system or a shared-disk system. Only the processes running on a particular node can directly access the memories and disks within the particular node. Of the three general types of multi-processing systems, shared-nothing systems typically require the least amount of wiring between the various system components.
FIG. 1 illustrates an example shared-nothing multiprocessor system with four nodes 110 including three banks of disks 150. Disk banks 151, 152 and 153 are local to nodes 111, 112 and 113, respectively. A coordinator process, master 120 running on node 113, has spawned four slaves, slave W 131, slave X 132, slave Y 133 and slave Z 134 running on nodes 111, 112, 113 and 114, respectively.
Databases that run on multi-processing systems typically fall into two categories: shared-disk databases and shared-nothing databases. A shared-disk database expects all disks in the computer system to be visible to all processing nodes. Consequently, a coordinator process in a shared-disk database may assign any work granule to a process on any node, regardless of the location of the disk that contains the data that will be accessed during the work granule.
Shared-disk databases may be run on both shared-nothing and shared-disk computer systems. To run a shared-disk database on a shared-nothing computer system, as shown in FIG. 1, software support may be added to the operating system or additional hardware may be provided to allow processes to have direct access to remote disks. For example, software support may allow slave W 131 on node 111 to have direct access to disk bank 153 on node 113.
In general, however, a node""s access to its local disks may be more efficient than its access to remote disks. For example, node 111""s access to disk bank 151 is more efficient than node 111""s access to disk bank 153. A node is said to have an xe2x80x9caffinityxe2x80x9d for the data stored on the node""s local disks.
Relational databases store information in indexed tables that are organized into rows and columns. FIG. 2 illustrates a sample table for an example relational database. Each row as 210 in the table 200 represents an individual record. Each column 220 in the table represents a different kind of information or xe2x80x9cattributexe2x80x9d that is of interest for all the records. For example, Table Emp 200 stores employee records which contain columns 220 that correspond to the following attributes: employee name (Empname), employee number (Empno), social security number (SSN), department number (Deptno), Salary, job title (Title), date of employment (Startday), etc., in columns 221, 222, 223, 224, 225, 226 and 227, respectively. Each row 210 in the Table Emp 200 stores the same attributes for each individual employee, one attribute per column 220, but the values of an attribute stored in a column 220 may change. In this example, an employee record is uniquely identified by a social security number, SSN, column 223, or an employee number, Empno, column 222. In tables where none of the attributes are unique, a unique attribute called a ROWID may be generated for the record by the database. A user retrieves information from the tables by entering a request that is converted to queries by a database application program, which then submits the queries to a database server. In response to the queries, the database server accesses the tables specified by the query to determine which information within the tables satisfies the queries. The information that satisfies the queries is then retrieved by the database server and transmitted to the database application and ultimately to the user.
In a typical database system, data is stored in a table in an unordered form. As records are entered into a table, they are inserted into the next available location in a non-volatile, i.e., persistent, storage device, such as a fixed disk drive or optical device. Such a location can often be, relative to the location of the previous record, at a non-contiguous storage sector of the persistent storage device. Over time, as records are added or dropped, the physical arrangement of the data in the persistent storage device usually does not correspond to the order of values in any of the attributes of the table. The data for consecutive rows may appear to be randomly spread over a number of blocks in the persistent storage device. Consequently, it is not always possible to directly access or retrieve the record or range of records that satisfy a given set of search criteria. For example, in order to locate all rows in a table that have a given value in a column A, every row of the table must be fetched and column A of each row examined. Even when a row with the target value in column A is found, the remainder rows in the table must be fetched and column A examined, unless the values in column A are established to be unique.
Another problem associated with data retrieval is that, typically, data for a particular row is stored in one or more units of contiguous blocks in a persistent storage device. A block is the smallest quantity of data that can be read from a persistent store into dynamic memory. If a database system requires any information stored in a particular block, the database system must read the entire block into memory. To retrieve values for a target column of a table, the database system must read the entire block or all the blocks that have any data from that column of the table, rather than reading only that portion of the block or blocks that contain values from the target column of the table. Since values for the target column may be present in all or almost all the blocks of a table, the entire table or significant portion thereof must be read into memory in order to retrieve the column values. In such a case, the database server, in response to a query, performs a xe2x80x9cfull table scanxe2x80x9d by fetching every row of the table and examining the column or columns referenced in the search criteria specified in the query. This retrieval can be very costly because, if the amount of the data for the columns not used in the query is very large, then the full table scan methodology becomes very inefficient due to the unnecessary amount of disk input/output.
Accordingly, in one approach to improving the efficiency of data retrieval, database systems provide indexes to increase the speed of the data retrieval process. A database index is conceptually similar to a normal index found at the end of a book, in that both kinds of indexes comprise an ordered list of information accompanied with the location of the information. Values in one or more columns of a table are stored in an index, which is maintained separately from the actual database table (the underlying base table). The ordered list of information in an index allows for quick scanning to find a target value or range of values. Moreover, since a conventional index stores only the values from one or more columns that serve as the key to the index, a pointer and a unique row identifier if necessary, the number of blocks of data being read into memory is significantly reduced as compared to a full table scan.
One structure that may be used for an index is a B-tree structure. The logical layout of a sample B-tree 300 for Table Emp 200 is illustrated in FIG. 3 for a case in which the key to the index, i.e., the index key, is Deptno, column 224. A B-tree index is a hierarchical arrangement of two types of elements: leaves and branches. Leaves reside at the lowest level of the B-tree hierarchy and are associated with a range of index key values. A leaf contains index entries for the rows that have index key values in the key value range associated with the leaf. Each entry in a leaf contains a key value and a unique row identifier that is used to locate, within the table, the row associated with the entry.
For example, FIG. 3 shows four leaves 310, 320, 330 and 340 that collectively hold the index entries associated with index key values (department numbers) ranging from 10 to 40. Specifically, leaf 310 holds the index entries for the rows with index key (Deptno) values 10 and 11. Associated with each index key value 311 is an employee number, Empno, from column 222, which serves as the unique row identifier 312. In other approaches, the rowid itself serves as the unique row identifier.
In some database implementations, one B-tree leaf is stored per block of persistent storage. However, the block used to store a leaf need not be filled by the index entries of the leaf. For example, part of the block may be empty to accommodate later additions to the leaf. When the leaf does not fill the block, the bits in the unused portion of the block are simply ignored. When an attempt is made to insert into a leaf more information than will fit on a block, the value ranges associated with the leaf and one or more other leaves are revised, and additional leaves may be created, to redistribute the index entries across the leaves in a way that prevents any individual leaf from being overloaded. Each leaf also contains a pointer or other link to the subsequent leaf. For example, leaf 310 points to leaf 320.
The other elements in the B-tree hierarchy are the branches. Branches contain information that indicates at least one range of values for the index key, and, for each range, a pointer or other link to the next lower element of the hierarchy, whether a leaf or another branch. The branch at the top of the hierarchy is called the root of the B-tree. By following the links from the root branch using the pointers associated with the range that includes the index key value of interest, the leaf containing that value can be found. The beginning of the ranges may be implied, and only the end of the ranges need be specified in a branch, as in FIG. 3. For example, in FIG. 3, the root 302 indicates two ranges, the first for index key Deptno values from 0 to 20, and the second for values from 21 to 40. If a Deptno value of 11 is of interest, the database server would follow the link associated with the range from 0 to 20, which indicates branch 304. Branch 304 indicates that the first range is from 0 to 11 and points-to a block of storage containing leaf node 310. Since the first range contains the value of interest, block 310 is retrieved. The Deptno value of 11 is found beginning in the second entry from the end of the leaf. Although FIG. 3 depicts leaf nodes that have separate index entries for every row (even though some rows have the same index key value), other arrangements are possible. For example, a single index entry may be used for each key value, e.g. 11, where index entry for any key value is followed by a list of row identifiers for the rows that have that key value.
Data in indexed tables, such as Table Emp 200 in FIG. 2, are manipulated with a set of commands which can be called Data Manipulation Language (DML) commands. Typically, the DML commands supported by database systems include, for example, commands to delete rows, insert rows, and update rows. The update row operation is often implemented as a delete row followed by an insert row. The delete row, insert row, and update row operations are referred to hereinafter as insert, delete, and update.
In a multiple-node system, to make use of the multiple nodes, database servers have been designed to support partitioned tables and parallel processing of DML commands. In a partitioned table, the rows of the table are grouped into distinct partitions that may be spread over multiple nodes. For example, FIG. 1 illustrates a database table that is partitioned into three partitions, Partition A 161, Partition B 162 and Partition C 163 on the disk banks 151, 152 and 153, respectively, local to nodes 111, 112, and 113, respectively.
The division of rows into partitions is usually based on the value of a table partition key defined by one or more columns of the table. For example, the Table Emp 200 can be divided among the three disk banks using employee number, Empno, in column 222 as the partition key. For purposes of illustration, Empno values from 1-100 may be stored in Partition A on disks 151 local to node 111, records of employees 101 to 200 may be stored in Partition B on disks 152 local to node 112, and rows for employees 201 to 300 may be stored in Partition C on disks 153 local to node 113.
Each node can maintain an index for the rows in its own partition of the table. Such indexes, which only index the rows that belong to a partition of a table and not the entire table, are referred to as local indexes. At least one of the nodes, e.g. node 113, may maintain a global index 170 for all the rows of the Emp table. A global index is an index that contains index entries for all rows of the table.
In a parallel processing system, DML operations are often divided into work granules which are spread across the nodes of the system for parallel execution. Such parallel DML (PDML) operations may be used where the database itself is partitioned, or where the operations are performed in bulk, or both. A bulk DML operation is a set of related DML operations that affects a large number of rows, such as when the salary of every employee in a 40,000 employee company is raised a given amount in an employee database.
During PDML operations, the PDML master process, herein called the Coordinator Process, CP, distributes DML operations among several slaves on corresponding nodes based, at least partly, on a key which is herein called a partition key. If the table involved in the PDML operation is statically partitioned, the partition key used to divide the work granules of the PDML operation may be the same as the key that was used to partition the table. If the table is not partitioned, then the partition key refers to the columns, if any, used to divide the PDML work granules for distribution among slaves.
An index must also be updated to reflect changes to the table upon which it is built when (1) data in the one or more columns serving as an index key of the index are changed, or (2) data in the one or more columns serving as the unique row identifier used by the index are changed. For example, index 300 would have to be revised if values in the Deptno column 224 of table 200 are changed.
When the operation being performed on a table is a PDML operation, each slave involved in the PDML operation is responsible for modifying the global index to reflect the changes made by the slave to its assigned partition of the table. However, if the key upon which the global index is built is not the same as the partition key used divide the PDML operation, then maintaining global index becomes a bottleneck. For example, updating index 300 may become a bottleneck in a PDML operation performed on table 200 if Deptno is not the partitioning key used to distribute the work granules of the PDML operation.
Specifically, under these conditions, maintaining a global index during a PDML operation results in a loss of clustering, contention for resource locks, and block pinging. The proportional gains expected from parallel operations, e.g., halving the duration of operations when the number of processors used is doubled, are not observed. The observed gains are less, i.e., scaleable parallelism is not attained.
Clustering refers to the physical nearness to each other of the data items upon which a process operates. In general, the more closely clustered the data items that are accessed during an operation, the fewer the disk accesses required by the operation and, consequently, the more efficient the operation. Unfortunately, when the index key of an index is not the partitioning key used distribute the work of a PDML operation, the changes that any given slave has to make to the global index are not likely to be clustered. For example, a particular slave may be assigned rows that correspond to a particular range of Empno values. The rows that fall within that range of Empno values may have Deptno values that correspond to index entries that are randomly distributed within the leaves of index 200. Without clustering, the benefits of caching are reduced.
With respect to contention, different slaves may have to contend for the same global index block at the same time. For example, different slaves may be updating table rows that correspond to index entries in the same index block, and thus different slaves may have to update the same index block. On a shared-everything system, memory is shared and block contention causes slaves to wait for each other to have access to the block, e.g., for a latch/lock manager to make the block available to the slave for the desired operation. On such a system, the second slave requesting the block must wait for the first node to release the latch/lock on the block.
On a shared-disk or shared-nothing system, block contention can cause block pinging. On such a system, the request for the block by the second node is granted only after the first node flushes the block from its cache to persistent storage and releases the lock. Then the node of the second slave reads the persistent storage device to retrieve the block and places the block in its cache for the second slave. The first node will then request the lock manager to get the block back. After the second slave makes a single or limited number of changes, the second node will flush the block from its cache onto the persistent storage. Then the lock manager will allow the first node to obtain a lock on the block. The same block can thus be shuttled back and forth between the persistent storage device and the nodes that are executing the alternating two (or more) contending slaves. Further, it is possible that none of the contending slaves are on the node for which the persistent storage device is local. If none are local, transmissions and extra overhead to read or write to a non-local disk are also required, consuming even more system resources.
For example, when the persistent storage device is a disk, each block ping implies two disk input/output (I/O) operations, one flush to the disk from the cache of the first node, and one read back from the disk into the cache for the second node. Block pinging is worse than the delays experienced by a shared-everything system because block pings not only delay results but also consume system I/O resources. These block pings make the PDML with a global index typically worse for shared-nothing and shared disk systems than for a shared everything system.
For example, assume that table 200, partitioned by Empno as shown in FIG. 1, requires a bulk update setting Deptno=Deptno+1 for all records. To perform the update, the master process, CP, forms three work granules to distribute to three PDML slaves. Slave W 131 on node 111 updates Partition A that contains rows with Empno values from 1 to 100, slave X 132 on node 112 updates Partition B that contains rows with Empno values from 101 to 200, and slave Y 133 on node 113 updates Partition C that contains rows with Empno values from 201 to 300. The same slaves will update the local indexes associated with their partitions.
If a global index 170 maintained on disk bank 152 of node 112 uses Deptno as an index key, that index also must be updated because the values of Deptno have been changed. Using conventional PDML techniques, slave W 131 updates the global index 170 for employees 1 to 100, slave X 132 updates the global index 170 for employees 101 to 200, and slave Y updates global index 170 for employees 201-300. FIG. 3 shows that to update Deptno for employee 100, slave W 131 must read leaf block 310; and, to update employee 101, a different slave, slave X 132, also must update leaf block 310. Because slave W 131 does not know about the update being made by slave X 132, the two updates are not coordinated and two separate reads of block 310 into cache and two separate flushes from cache occur. The benefits of caching are lost.
On a shared-nothing system, as in FIG. 1, after slave X 132 has completed the update of Deptno for Empno xe2x80x9c101 xe2x80x9d and before slave X 132 has updated Deptno for Empno xe2x80x9c102,xe2x80x9d the block may be requested by slave W on node 111. Then node 112 flushes the block from its cache and writes the block to disk. Then node 111 causes a read of the block from disk and transmission of the block to node 111 where it is placed in the cache of node 111; and the Deptno value for Empno xe2x80x9c1xe2x80x9d is updated by slave W 131. But before slave W 131 can update Deptno for Empno xe2x80x9c2,xe2x80x9d slave X may have requested block 310 for updating Deptno for Empno xe2x80x9c102.xe2x80x9d Consequently, node 111 flushes the block from its cache, transmits it to node 112 which stores it on disk 152, then node 112 reads the block from disk into the cache on node 112 for slave X 132. Slave X updates Deptno for Empno xe2x80x9c102.xe2x80x9d Slave W may then request the block again and node 112 again flushes the block from cache to disk 152 so node 111 can access it. Thus block 310 pings repeatedly from cache to disk to cache, heavily consuming I/O resources in the process.
The need to achieve scaleable parallelism thus translates to a need to update a global index as a result of PDML operations without suffering the deficiencies of lost clustering, or contention for the same block, the latter leading to excessive waits or to block pinging.
Techniques are provided for coordinating an update of a global index of an indexed table. According to one technique, a coordinator process receives index maintenance records from a plurality of data manipulation slaves for the indexed table. Each index maintenance record includes a value for an index key of a global index of the table. The coordinator process computes a plurality of ranges of index key values, and assigns to each of a plurality of index update slaves those records that fall in each range of the plurality of ranges. Each index update slave updates the global index based on the records assigned to it.
In another aspect of the invention, each slave reads an index key value from a current index maintenance record, and updates the global index using the current index maintenance record if the index key value falls within the range of values assigned to the slave.
In another aspect of the invention, techniques are provided in which an index update distribution table is sent from a coordinator process to an index update slave process. The table maps index key value ranges to slave identifications. The index update slave reads an index key value from a current index maintenance record, and locates the range in the index update distribution table that encompasses the value read. If the slave identification associated with the located range corresponds to the identification of the slave, the slave updates the global index using the current index maintenance record.
In another aspect of the invention, techniques for maintaining a global index of a table during parallel data manipulation operations involve a coordinator process, data manipulation slaves and index update slaves. The coordinator process distributes data manipulation operations among a plurality of data manipulation slaves. Each data manipulation slave performs data manipulation operations on rows of the table and sends an index maintenance record containing an index key value to the coordinator process. The coordinator process receives index maintenance records from the plurality of data manipulation slaves, computes a plurality of index key value ranges, and sends each range to a respective index update slave. Each index update slave reads a index key value from a current index maintenance record, and updates the global index using the current index maintenance record if the index key value falls within the range of values assigned to the index update slave.
In another aspect of the invention, techniques for using a global index includes performing a data query using the global index to provide a query result. If an out-of-date flag indicates the global index is not current, then a set of sorted index maintenance records, each record containing a value of an index key for the global index and an index operation, is searched for a match to the index key values used in the data query. The query result is modified according to the index operation.