Many database systems involve an input stream of record data, where each incoming record has to be stored on a disk or other non-volatile memory. Future access will normally be to specified disjoint portions of incoming data stream. If all the data is stored on disk in the order received in the input stream, the records belonging to a particular subject, e.g., a particular customer, will be scattered at random locations on the disk. Retrieval of such data corresponding to a particular subject requires seeks to many different areas of the disk. Seek times on disk are very slow (tens of milliseconds), limiting the number of such operations that a disk can support per second.
To reduce the number of seeks required to retrieve all records corresponding to a particular subject, the records in the input stream are preferably stored in locations on the disk in such a way that records relating to the same subject are clustered in the same area. To achieve this result, the records in the input stream must be efficiently directed to their proper destination on disk. In the abstract, the problem involves routing each record from the input stream to one of several output streams to disk, based on some identifier stored in the record.
Sending each record directly to its destination location on disk can be very costly since each such write requires a seek, which greatly reduces the amount of data that can be handled by the system. On the other hand, writing to a large contiguous area such as a page or, better still, a sequence of pages without intervening seeks is much faster and can run at nearly the full capacity of the disk system. The purpose of the invention is to provide techniques to increase the effective data that can be stored on the disk system by reducing the number of writes and to increase the data handling capacity within the limits of the system hardware.