1. Field of the Invention
This invention relates generally to relational data base systems and, more particularly, to evaluation of assignment statements on large data objects in such systems.
2. Description of the Related Art
Advances in computers and data storage devices have provided users with increasingly faster data access times and with the ability to manipulate increasingly large blocks of information. The storage, retrieval, and manipulation of information is typically accomplished with a data base management system. The information manipulated by users typically comprises data values in the form of numerals and characters. More recently, the manipulated data values have come to include graphic and video forms of data. Graphic and video data are especially prone to including large blocks that must be manipulated.
One type of data base that organizes information for more efficient user manipulation is the relational data base. A relational data base organizes data values into tables with userdefined interrelationships between the tables. A relational data base management system permits users to construct statements that the system will use to extract, insert, and combine data values from the tables. The selected data values can be assigned to new strings, which comprise table entries, or can replace existing strings. For example, users can use a substring operator in an assignment statement to extract a portion of a character string or video image and assign the extracted portion to a new string (or to replace the existing string) for further manipulation. Users can use a concatenate operator to join two separate strings into a single string. Further examples of string operators will occur readily to those skilled in the art.
In addition to being organized into tables of related data values, the data values are stored in a relational data base in accordance with storage units defined by a data device hardware configuration. Typically, a relational data base stores a single data value entirely within a single storage unit called a page. A page usually includes between 512 bytes and 32768 bytes (32.times.1024 bytes, referred to as 32 kilobytes and abbreviated 32 KB) of data values. Storing data values in pages limits the maximum size of a single data item stored within a page to the size of the page itself. To provide less restrictive limits on the data values stored, some relational data base management systems include a specially-defined data type called a long field or large object, generally referred to as a LOB.
In older data base products, LOBs were limited to a maximum of 32 KB, which some systems were able to store in a single page. More recently, data base products permit LOBs to have size limits on the order of many gigabytes (10.sup.6 KB). A data value having a size of several gigabytes potentially could produce significantly slower storage access operations if typical operating techniques are employed. As a result, LOBs are generally managed by a special LOB storage mechanism different from the mechanism used to manage other data types.
Many data base management systems do not support string operations that permit manipulation of LOBs directly, but instead offer only relatively simple store and retrieve access operations, regardless of the actual size of a LOB. That is, once a data base user has defined a data type to be a LOB, potentially having a size of many gigabytes, the operations that can be performed on the LOB will be limited to storing and retrieving the LOB from the relational data base even if the LOB is, in fact, only several kilobytes in size.
Data values of a relational data base typically are stored on one or more data base disk drives. An access operation that retrieves a LOB data value permits the LOB to be read from the disk drives in chunks and placed into either disk files or memory buffers comprising intermediate storage. An intermediate storage disk file is separate from the data base disk drive storage and a memory buffer typically comprises a portion of electronic random access memory (RAM). An access operation that stores a LOB data value permits the LOB to be copied from a disk file or memory buffer and placed into a storage location of the data base. In systems that support only simple store and retrieve operations, any more complicated string manipulation of the LOB data value must be performed on the disk file or memory buffer copy of the data value.
Placing a LOB in a disk file intermediate storage can require potentially many disk drive storage access operations (I/O accesses) that can significantly impede data manipulations and incur a severe performance penalty. The performance penalty exists even if a relatively minor change is made to a LOB. For example, even if just a single byte is appended to a LOB, every byte of the LOB must be read from the data base disk and written before the append operation is complete.
Placing a LOB in a memory buffer intermediate storage is somewhat faster than using disk file intermediate storage, due to much faster access times for RAM as compared with disk drive files. Most computer systems, however, do not have sufficient RAM to contain LOBs of any great size. It is unusual for even relatively large mainframe systems to have more than 256 megabytes (MB) of RAM available. As noted above, modern relational data base management systems can permit LOBs to have a size of many gigabytes (thousands of MB).
Some relational data base management systems support more than relatively simple store and retrieve access operations on LOBs. Such systems have the capability of automatically performing LOB handling and manipulation. For example, some relational data base management systems permit a data base user to interactively enter an assignment statement comprising a sequence of string operators and LOB operands specified by names of data values. The system can automatically retrieve the LOB data values needed for the first specified string operation, perform the string operation, and proceed to retrieve the next group of LOB operands and perform the next specified string operation. As noted above, the LOBs can be extremely large and such processing can become intractable for LOBs beyond several hundred megabytes.
It is known to simplify the handling and manipulation of LOBs in assignment statements using a technique called deferred evaluation that links data structures together. In deferred evaluation, the evaluation of predetermined string operators in an assignment statement is deferred until the entire assignment statement is received, rather than the more typical immediate execution of string operators as they are encountered. Typically, a data structure is created for each operand of an assignment statement. Each data structure includes a specification of what string operations are to be performed. The data base management system analyzes the data structures and the string operations and delays actually retrieving any data values from the data base until string operations have been simplified. That is, intermediate results are not written back to the data base disk if they can be used for the next string operation. In this way, disk access operations are reduced. The following example illustrates the advantages of deferred evaluation.
Consider an assignment statement using the "substring" and "concatenate" string operators and having the following form: EQU C1=[substring (C1, 1, 50 000 000)] concatenate [C2],
which indicates that a substring will be extracted from a LOB called C1, the substring comprising the first 50 million bytes of C1, and that the extracted substring will be concatenated with a LOB called C2. The final, concatenated result will be stored into the data base disk location that originally contained C1. Without deferred evaluation, the relational data base management system would immediately evaluate the assignment statement by first reading the C1 data value from the data base disk into an intermediate storage file or memory buffer. The C1 intermediate copy then would be truncated, leaving only the first 50 million bytes. The truncated C1 copy would be stored back into the data base disk, completing the immediate evaluation of the first operator (the substring operator). The concatenate operation would then be encountered, so the now-truncated C1 data value would be re-read from the data base disk back into a file or memory buffer and the C2 data value would be read into another file or memory buffer. The two data value copies would then be concatenated and the result would be stored back into the data base disk at the C1 data value location.
In the example above, if C1 has an initial size of 100 million bytes and C2 has an initial size of 1000 bytes, then a total of 150,001,000 bytes would be retrieved from the data base disk (original C1, truncated C1, and C2) and a total of 100,001,000 bytes would be stored (truncated C1, and concatenated C1 and C2). Thus, a total of 250,002,000 bytes of storage access operations would be performed using an immediate evaluation scheme.
A relational data base management system using deferred evaluation would evaluate the assignment statement above by receiving the entire assignment statement before performing any evaluation and recognizing that the result of the substring operation is used by the concatenate operation. The system would still perform the substring operation, but the intermediate storing of the substring result and the subsequent retrieval of that result from the data base disk would be avoided, as follows.
In the initial step, the relational data base management system would retrieve only the first 50 million bytes of C1 from the data base disk and store them in a temporary file or memory buffer, producing a truncated copy of C1 in the file or memory buffer. Next, having recognized that the next operation (concatenate) makes use of the intermediate result, the system would avoid storing the truncated C1 back into the data base disk. Instead, the system would leave the truncated C1 in the file or memory buffer and retrieve C2 from the data base disk, storing the C2 copy in another file or memory buffer. The system then would perform the concatenation of the truncated C1 and the C2 copy, storing the result back into the data base disk at the C1 location. In this deferred evaluation example, a total of 50,001,000 bytes would be retrieved and a total of 50,001,000 bytes would be stored. Thus, a total of 100,002,000 bytes of storage access operations would be performed. It should be apparent that storage access operations have been reduced by one-half over the immediate evaluation processing scheme.
Those skilled in the art will appreciate that the linked data structures of the deferred evaluation technique are but one method of simplifying the processing of LOB assignment statements. Other simplification techniques using data lists or arrays will readily occur to those skilled in the art.
Although deferred evaluation provides significant processing time savings and reduced disk access operations over immediate evaluation, analysis of the operations in the example above shows that further time savings and reductions in disk access operations are possible. It would be advantageous if the relational data base management system could recognize, for example, that the first 50 million bytes of the substring and concatenate operations can be written back into the data base disk in the same locations from which they were retrieved and that operating efficiency could be improved if the amount of retrieval and subsequent restorage were reduced. In the substring/concatenation example above, if the C1 substring were not moved back and forth from intermediate file or buffer memory storage to the data base disk at all, then the reduction in disk access operations over the immediate evaluation scheme would be from 250,002,000 bytes to 2000 bytes, a reduction on the order of 50,000 times.
From the discussion above, it should be apparent that there is a need for a relational data base management system that permits reduced disk access operations in evaluating assignment statements by recognizing opportunities for efficiency beyond those afforded by conventional deferred evaluation techniques. The present invention satisfies this need.