The present invention relates to an apparatus and method for processing data in a data base in a computer system.
It should be noted that a data base is arranged according to a relation viewed by a user as a two-dimensional table type, and a line of this two-dimensional table corresponds to a tuple. Also, the tuple is constructed by more than one attribute (referred to as a column).
In data base processing, for instance, a relational data base process, data to be processed is present in a secondary storage (an external memory device, such as a magnetic disk device). A large quantity of data must be read, and transferred with respect to each of a number of data base operations. When a large amount of data is required to be transferred in such a data base system, the required amount of data transfer time deteriorates the performance of the data base system.
To this end, one method has been proposed for efficiently utilizing the time period during which the data is transferred from the secondary storage. That is, the data transfer time period is overlapped with another time period during which the data base is processed. This technique is well known in the art. In general, since the capacities of both a program and data greatly exceed an extension of a main storage capacity, a concept of a current storage architecture is indispensible.
Also, in most data base management systems, in particular a relational data base management system (simply referred in as a "DBMS" hereinafter), a buffer (data base buffer) in a main storage is utilized to hold a copy of a partial data base which has been stored into a secondary storage in order that the input/output processing time is shortened. If the data to be processed, which has already been fetched into the data base buffer of the above-described main memory, still remains after the data process processing, in case the data is again needed, the input/output processing to the secondary storage with the higher access cost is no longer required, resulting in an improvement in the performance of the entire system. However, as the memory capacity of the data base buffer is restricted, a replacement of the content is required, so that the replacement algorithm produces an influence on the system performance. As a consequence, a selection basis for determining which data should be maintained in the data base buffer is important when the data is replaced therein. As one selection basis, for instance, the LRU algorithm (Least Recently Use Algorithm) has been proposed. This algorithm is based upon the assumption that the longer a time period for the data from the last access to the present access becomes, the smaller is the possibility for further access of the data. However, in the normal data base processing, since there is data sequential access processing and data random access processing, the above-described assumption may not be satisfied in all specific cases. As a result, it is not useful in the data base buffer to rely upon the simple LRU algorithm. Since the input interrogation is analized to form an internal processing procedure in the relational DBMS, it is possible to determine the data base reference characteristics. As conventional techniques to solve the above-described problems by using the characteristics information, DBMIN, HOT set model or the like have been proposed.
For example, reference is made to Chou, H., and DeWitt, D. J. "An Evaluation of Buffer Management Strategies for Relational Database Systems" Proc. of the 11th Conf. on VLDB 1985, pages 127-141, and Sacco, G. M., and Schkolnick, M. "A Technique for Managing the Buffer Pool in a Relational System Using the Hot Set Model" Proc. of the 8th Conf. on VLDB 1982, pages 257-262.
In addition, another conventional approach has been well known such that an input/output device including a cache storage is arranged between a main storage the a secondary storage so as to shorten an input/output process time. This cache storage operates to read data of a secondary storage, as a whole, which is stored closely to a record that is input/output in a plurality of record units. Then, if a processing type may be assumed in which there is a high probability to store the record just after the access demand in the cache storage, the input/output processing time can be considerably reduced. That is to say, it is very useful in the central processing unit of DBMS if predictable data can be prefetched and stored into the cache storage when accessed. As previously described, however, since the sequential access processing mode is mixed with the random access processing mode, it is found that the input/output processing time becomes long due to the various read miss operations according to the conventional control method of the cache storage.
In accordance with the above-described conventional techniques, there is no clear proposal that the cache storage is actively scheduled while utilizing the access characteristics predicable in DBMS, so as to shorten the input/output time.
Also, in a system in which a plurality of users simultaneously refer to a data base, the data predictably being referred to is input-processed in the cache storage by overlapping the data transfer time period (containing the seek time period and read time period of the secondary storage) with a time period required for data-base-processing the data in the central processing unit (CPU), whereby the input/output processing time can be essentially shortened. However there is a demand that a high-speed response is required to an interrogation from a user. In addition, in accordance with an internal processing procedure formed by this user's interrogation, since the input/output processing times for the prefetching process which is predicted by the access characteristic of the internal processing procedure are greatly different, there is another problem in the conventional method where the data is read based upon a predetermined prefetching record unit that unnecessary data is prefetched uselessly and also necessary data is not prefetched. Furthermore, it is found that the throughput of the entire system cannot be guaranteed unless the prefetching record number as the input/output process demand unit, or the page number is determined in accordance with the system characteristics representative of a size, a multiplicity and a CPU performance of the cache storage, and a traffic of each input/output processing device.
On the other hand, in a DBMS, there has been proposed a buffer management system based on QLSM (Query Locality Set Model), in which buffers are divided and managed in response to a data base query demand. In this buffer management system, a buffer having a proper size is divided from all of the buffers with respect to the data base query, a replacement algorithm suitable for the data base reference pattern of the query thereof is determined, and the input/output operations of the data are managed for each of the divided buffer groups, i.e., locality sets. If the data page in the locality set of a certain data base query is desired to refer to another data base query, this page is transferred to a locality set of another data base query other than the demanded one. Accordingly, there is a drawback that in this case, the buffer which has stored the data page in question destroys the replacement algorithm of a locality set of a certain data base query.
In the above-described conventional techniques, there are problems that in case that the same data page is referred to in response to a plurality of data base queries, the processing time required for receiving and transferring the data between the buffers of the respective locality sets is increased, the buffer search processing time within other locality sets is increased when there is no data in a locality set itself, and the system cannot depend upon the replacement algorithm of the respective locality sets.