1. Technical Field
This invention relates to distributed data processing systems, and more particularly to methods for sorting data distributed across multiple processors in such a system.
2. Description of the Prior Art
A database management system is a computer system for recording and maintaining data. A relational database management system stores data as rows in tables, and refers to a collection of tables as a database. Users or application programs manipulate (select, update, insert, or delete) the data by issuing requests or commands (called queries) against the database. In a relational database management system's data query and manipulation language, such as SQL, queries are nonprocedural (also referred to as nonnavigational). That is, users or application programs simply specify what is wanted, rather than specifying a procedure for accomplishing it.
Most relational database systems make heavy use of sorting as a technique for satisfying complex queries. Since the user communicates with the system only via the query language, the system can choose any internal data storage mechanism it wishes. In particular it can partition the data, which is why this invention finds its most important applications in the relational system arena.
In most cases today, records to be sorted are stored in a single file on a single system, but this is changing. In order to increase the amount of data that can be read per second, a technique that is gaining popularity and will become more and more important, is to distribute data among several subfiles. These subfiles can then be stored on different disks, and different processors can read and process them in parallel. This technique of distributing data plays an increasingly important role in relational database management systems.
As a background to the discussion which follows, we define certain key terms and concepts. A "stream" is a sequence or records or values that is the output of a producer task, and the input of a consumer task, as follows:
producer--stream.fwdarw. consumer
Behind the notion of the stream, there is a mechanism of "blocking". The producer produces several records or values, assembles them into a blocks of data. For example, 500 records of 80 bytes might be blocked into ten 4000 byte pages. Each block is sent to the consumer task as soon as it is produced. When the consumer needs to read a record or a value from the stream, it takes the next record from the block. If the block has been fully consumed it is discarded and the next block is used if it has already been received. If the next block has not yet been received the consumer task waits until the next block is received from the producer. This blocking mechanism is used for both intra-processor communication such as disk I/O (in which data is transferred between a disk and a processor), and intra-processor communication in which data is transferred between processors. We will therefore refer to "reading from a stream", "receiving from a stream", "writing to a stream", and "sending to a stream"--it being understood that a blocking mechanism is used to create the stream.
Another key concept is that of a "single site sort", in which a single file is sorted by its is a single, attached processor. In brief, a single site sort is performed as follows. First, records are read from the file, producing a stream of records which accumulate in a preallocated area in memory. When that area is full the records that it contains are sorted and written, in order, to a working area. This process of reading, sorting and writing is repeated until the whole file has been processed, resulting in several sorted working areas on disk. These steps all together constitute the "sort phase".
After the sort phase, the records are read (as streams) from the working areas. The first record from each of the working areas is compared with other like records to find the smallest one (the one with the smallest, or first, sort key value). That record is then written to the output stream S and consumed from the working area to which it is belonged, leaving the next record from that working area as the first in that area's stream. These steps of comparison, writing and reading are repeated until all of the working areas are emptied, at which time the output stream has received the entire sorted file. This process is termed the "merge phase". For sake of simplicity, it may be assumed that all working areas can be handled in a single merge phase, as frequently occurs in practice.
To provide insight into sorting performance, we use a very simplified but meaningful example. Assume a processor with a speed of 5 million instructions per second (5 MIPS). Assume we want to sort 1 million (1M) records of 80 bytes each. Plausible performance (in seconds) are given in Table 1 for each step of the sort and merge phases. The column CPU reflects the time needed by the central processing unit (CPU) to execute the instructions. The column I/O reflects the time needed to read or write blocks of data from or to the disk. Current processing techniques allow these operations to overlap, so that the total elapsed time is the maximum of the CPU and I/O.
TABLE 1 ______________________________________ Response Time of Single Site Sort CPU I/O Elapsed ______________________________________ read file: 20 200 200 sort: 200 200 write file: 20 200 200 read file: 20 200 200 merge file: 200 200 return file: 100 100 total: 1100 seconds ______________________________________
The problem of achieving acceptable response time when sorting in distributed systems has been known for some time, and several attempts to provide solutions have been made in the prior art. It is illuminating to compare three such prior art methods. In the following discussion, will use the base values underlying the example of Table 1 (5 MIPS processors, and 1M records of 80 bytes each). We will assume that the distributed system comprises 100 processors, each storing 10,000 records on attached dedicated disks. Because we are concerned with reducing the response time for the sort operation, we do not consider the fact that idle processors could be working on other operations unrelated to the sort.
The singles of the prior art methods consists of sending all records to a single one of the processors, and letting it do the sort. This method may be termed the "central sort and merge" method. For the sake of simplicity we will ignore the time needed to send the data over the communication network, by assuming that the communication hardware/software is efficient enough to not be a limiting factor. The response time for the steps of this method are shown in Table 2.
TABLE 2 ______________________________________ Response Time of Central Sort and Merge CPU I/O Elapsed ______________________________________ at each site: read file: 0.2 2 2 central sort and merge, at the sort site: sort: 200 200 write file: 20 200 200 read file: 20 200 200 merge file: 200 200 return file: 100 100 total: 902 seconds ______________________________________
The total response time for the central sort and merge method is thus 902 sec. Almost nothing is done in parallel.
A second prior art method consists of sorting in each processor those records in its respective subfile, then sending the sorted subfiles to a single processor for merging. This method thus consists of a distributed sort phase, and a central merge phase. The response times for this method are given in Table 3.
TABLE 3 ______________________________________ Response Time of Distributed Sort, Central Merge CPU I/O Elapsed ______________________________________ distributed sort, at each site: read file: 0.2 2 2 sort: 2 2 write file: 0.2 2 2 read file: 0.2 2 2 merge file: 2 2 central merge, at the merge site: merge file 200 200 return file: 100 100 total: 310 seconds ______________________________________
The total response time for the distributed sort, central merge method is thus 310 sec. Again, a large portion of the work (the merge phase) is not done in parallel.
The third prior art method may be termed a "distributed sort and merge, serial return" sort. In this method, a partition of the overall key range is computed by some technique into subranges such that a roughly equal number of records falls in each subrange. Any well-known distributed random sampling technique may be used. Each subrange ri is assigned to one of the distributed processors processor in a one-to-one fashion. Each processor sends each of its records to that processor assigned to handle the subrange in which the record falls. When all records have been sent to the appropriate processor, each processors has all of the records in its assigned subrange (in fact only copies or the records are sent, since the original records stay where they were stored originally). Each processor then sorts and merges the records that it has, and sends its sorted records to the host. This return of the sorted records is done serially, with the processor assigned the lowest subrange going first, then the next lowest, and so on. The response time of the "distributed sort and merge, serial return" method is shown in Table 4.
TABLE 4 ______________________________________ Response Time of Distributed Sort and Merge, Serial Return CPU I/O Elapsed ______________________________________ distributed sort and merge, at each site: read file: 0.2 2 2 sort: 2 2 write file: 0.2 2 2 read file: 0.2 2 2 merge file: 2 2 write file: 0.2 2 2 return subfiles serially from each site: return file: 100 200 200 total: 212 seconds ______________________________________
Even with this method too much work is done in a nonparallel manner, and the 100 processors are far from being used efficiently, each processor being idle for 198 of the last 200 seconds. The problems are as follows. First, the return of the results to the host is serialized, meaning that the second processor cannot begin until the first finishes. This is responsible for most of the elapsed time for the method. Second, the entire records must be sent once before sorting, and once after sorting. In both cases, the load on the communication network is 160M bytes (2.times. number of records (1M).times. record length (80 bytes) ). Although this communications load is not considered in the response times above, the consequences of such two full transmissions of the actual data significantly affects the performance of the "distributed sort and merge, serial return" method.
As distributed database management system become widespread, and maximum acceptable response times drop, the need for a fast method for sorting in a distributed database management system grows increasingly urgent. The present invention addresses that need.