1. Field of the Invention
The present invention relates to microprocessors, and more particularly to a microprocessor apparatus and method for enabling variable width data transfers which solves the problem of slow writes to memory when sparse portions of contiguous write-combined memory space have been modified.
2. Description of the Related Art
Write combines and non-temporal stores are not kept in the microprocessor but instead are written out to the memory bus. In a present day quad-pumped bus, such as is exhibited by most x86-compatible microprocessors, data transfers to memory are performed either on a cache line basis (i.e., eight quadwords for a 64-byte cache line) or on an individual quadword basis. When transferring an entire cache line, two clock cycles are required to transfer the eight associated quadwords and four quadwords are transferred during each cycle of the bus clock, thus accounting for the descriptor “quad-pumped.” During this type of transfer, the entire 64-bytes are written to the bus; there is no mechanism to only write part of a cache line to memory. If only part of a cache line is to be written to memory, then the other type of data transfer must be employed, which allows for transfer of an individual quadword and, as part of the bus protocol, byte enable signals are set to indicate specific bytes within the transferred quadword that are to be written to memory. Individual quadword transfers take one bus clock cycle. In this manner, the state of the art allows for either 64 contiguous bytes to be written to memory in two clock cycles or for a single quadword to be written in a single clock cycle.
In reviewing present day microprocessor bus architectures and associated protocols, in conjunction with observations concerning how contiguous memory spaces are manipulated by application programs, the present inventor has noted that the bus protocols associated with writes of data to the memory bus, as alluded to above, are disadvantageous when sparse data within a contiguous memory space has been modified and is to be written to the bus. For example, it is common to modify checkerboard portions (e.g., every other double quadword, every other quadword, every other doubleword, etc.) within a video buffer to change some aspect of a display. Conventional microprocessors, however, do not provide a mechanism for selecting data that is to be written to memory at any granularity other than byte granularity on a quadword-by-quadword basis. A sparse write of contiguous memory is thus set up to be written to the bus and such a write is limited to individual quadword transfers.
Because the data associated with combined writes (e.g., write combines, non-temporal stores) is typically large, it is disadvantageous to not fully utilize the bandwidth of a data bus, whether that bus is quad-pumped or otherwise. Since data buses typically operate at clock speeds many times slower than that of microprocessor core clocks, it is crucial to execute combined writes to memory with optimum efficiency. It is therefore desirable to be able to write an entire cache line to memory where individual elements within that cache line can be enabled with variable width granularity.