The invention relates to computer systems, and more particularly to a method and mechanism for reducing disk input/output operations (IOs) of a computing system by coalescing writes.
Storing and retrieving large amounts of data form some of the most important functions of today's computers. Database systems, such as an online transaction processing (“OLTP”) system, are examples of computing applications that retrieve and store large quantities of data in a computing system. Database systems have performed these functions very successfully, creating the ability to retrieve data at speeds and quantities previously unimagined, and bestowing an unprecedented level of access to information. The success of such systems has unleashed a demand for even faster and more efficient systems that process even greater quantities of data.
Many large-scale computing systems perform immense numbers of input/output operations (IOs), such as reads and writes, on the systems' storage disks. It is well-known that much of the time spent by an application in performing an IO operation is in moving a disk head to an appropriate location (i.e., the location of a data block) on a disk of the system. The time spent in moving the disk head to a particular data block location on the disk is referred to a seek time. Conventionally, IOs are executed as single-block reads or writes from/to the disks of the system. Since many computing systems, e.g., database systems, constantly perform large numbers of read and write IOs on the disks, these systems accordingly spend a significant portion of their execution time in moving the disk heads to appropriate locations on the corresponding disks.
Memory buffers are often used to cache data to improve a system's data access performance. However, if the data within the memory buffer is modified, those modifications must be reconciled at some point with its corresponding data persistently stored in the system storage disks. Several approaches can be taken with respect to the timing to reconcile these changes between the memory buffer and the system disks. One strategy is to flush the “dirty” or modified write buffers to disk after each and every modification within the buffer cache. This type of approach adopts a “no-wait” policy that causes changes in the write buffer caches to be immediately made to the disks. This “no-wait” approach, however, is generally quite inefficient because of the large frequency of small IOs that must be performed.
As a result, some database/computer systems that handle large amounts of data instead adopt a “delay-write” approach which accumulates several data changes in the write buffer caches before flushing to disk. Several approaches can be implemented for the delay-write strategy. In one approach, the timing of the delayed writes are planned to institute checkpointing for the system memory/storage system. In another approach, an aging policy can be established to performed the delayed writes of the buffer cache.
A system that adopts the “delay-write” approach nonetheless still faces the problem of spending too much of its execution time in moving the disk heads to appropriate locations on the disks of the database system. Several approaches have been used to tackle this problem in order to sustain a high IO throughput for the system. One approach is to include a large number of disks in the system and then store data across these disks. Storing data in different disks allows many IOs to be executed in parallel. As a result, the IO throughput and the peak performance of the system will be increased. Use of such a large number of disks, however, adds to the cost of the system.
Another approach is to create a log-structured file system. This approach requires that all writes are appended at the end of a log file. Accordingly, this approach tends to store the data in a same disk of the system. Executing a write IO by the database system with the log-structured file system therefore stores data in the disk with minimal disk head movements. However, the log-structured file system has many disadvantages. For example, it is very complicated to implement a log-structured file system in a database. Costs in designing and implementing such a complex database will be greatly increased. Moreover, a log-structured file system generally penalizes the performance of the disk reads of the system. The reason the performance is penalized is because in a database system with a log-structured file system, locations of data blocks in the disks are not fixed. As a result, additional lookup operations are also required to determine the exact locations of the data blocks.
Accordingly, the present invention provides an improved method, mechanism, and system for reducing disk head movements of a computer system, such as a database system, when performing input and/or output operations (IOs). In one embodiment, data blocks in a buffer cache to be written into a disk of a computer system, such as a database system, are coalesced so that they can be written into the disk by a single write IO. When a write command is issued by the system to write an identified data block from the buffer cache to disk, the system will search the buffer cache for additional dirty data blocks that have addresses adjacent to the identified data block. The search space is the entire buffer cache. The identified data block and its adjacent data blocks which have been found are coalesced and written with the same IO operation. As a result, the IO throughput and performance of the computer system can be improved.
The write coalescing process in one embodiment is performed based on physical locations of the data blocks to be stored in the disk of the computer system. Writing the coalesced data blocks into the disk with a single write IO reduces the disk head movements of the computer system. Fewer disk head movements also mean that fewer disks are required for the computer system while maintaining a high IO throughput and high peak performance. This approach therefore provides a simple and effective method that improves the IO throughput and performance for the compute system without requiring any changes to disk space and storage management of the computer system.
Further details of aspects, objects, and advantages of the invention are described below in the detailed description, drawings, and claims.