1. Technical Field
The present invention relates in general to data processing and, in particular, to data prefetching in a data processing system. Still more particularly, the present invention relates to a data processing system, cache, and method of operation in which an O state for memory-consistent cache lines of unknown coherency is utilized to support data prefetching.
2. Description of the Related Art
A conventional multiprocessor data processing system may comprise a system bus to which a system memory and a number of processing units that each include a processor and one or more levels of cache memory are coupled. To obtain valid execution results in such a multiprocessor data processing system, a single view of the contents of memory must be provided to all of the processors by maintaining a coherent memory hierarchy.
A coherent memory hierarchy is maintained through the implementation of a selected coherency protocol, such as the conventional MESI protocol. According to the MESI protocol, an indication of a coherency state is stored in association with each coherency granule (e.g., cache line or sector) of at least all upper level (i.e., cache) memories. Each coherency granule can have one of four states, modified (M), exclusive (E), shared (S), or invalid (I), which is typically indicated by two bits in the cache directory.
The modified state indicates that a coherency granule is valid only in the cache storing the modified coherency granule and that the value of the modified coherency granule has not been written to (i.e., is inconsistent with) system memory. When a coherency granule is indicated as exclusive, the coherency granule is resident in, of all caches at that level of the memory hierarchy, only the cache having the coherency granule in the exclusive state. The data in the exclusive state is consistent with system memory, however. If a coherency granule is marked as shared in a cache directory, the coherency granule is resident in the associated cache and in at least one other cache at the same level of the memory hierarchy, all of the copies of the coherency granule being consistent with system memory. Finally, the invalid state generally indicates that the data and address tag associated with a coherency granule are both invalid.
The state to which each coherency granule is set can be dependent upon a previous state of the cache line, the type of memory access sought by processors to the associated memory address, and the state of the coherency granule in other caches. Accordingly, maintaining cache coherency in the multiprocessor data processing system requires that processors communicate messages across the system bus indicating an intention to read or write memory locations. For example, when a processing unit requires data not resident in its cache(s), the processing unit issues a read request on the system bus specifying a particular memory address. The read request is interpreted by its recipients as a request for only a single coherency granule in the lowest level cache in the processing unit. The requested cache is then provided to the requestor by a recipient determined by the coherency protocol, and the requestor typically caches the data in one of the valid states (i.e., M, E, or S) because of the probability that the cache line will again be accessed shortly.
The present invention recognizes that the conventional read request/response scenario for a multiprocessor data processing system outlined above is subject to a number of inefficiencies. First, given the large communication latency associated with accesses to lower levels of the memory hierarchy (particularly to system memory) in state of the art systems and the statistical likelihood that data adjacent to a requested cache line in lower level cache or system memory will subsequently be requested, it is inefficient to supply only the requested coherency granule in response to a request.
Second, a significant component of the overall access latency to system memory is the internal memory latency attributable to decoding the request address and activating the appropriate word and bit lines to read out the requested cache line. In addition, it is typically the case that the requested coherency granule is only a subset of a larger data set that must be accessed at a lower level cache or system memory in order to source the requested coherency granule. Thus, when system memory receives multiple sequential requests for adjacent cache lines, the internal memory latency is unnecessarily multiplied, since multiple adjacent cache lines of data could be sourced in response to a single request at approximately the same internal memory latency as a single cache line.
In view of the above and other shortcomings in the art recognized by the present invention, the present invention introduces an O cache consistency state that permits unrequested memory-consistent and possibly non-coherent data to be stored in a cache, thereby reducing a processor""s access latency to memory-consistent data.
A data processing system in accordance with the present invention includes an interconnect, a system memory and a number of snoopers coupled to the interconnect, and response logic. In response to a requesting snooper issuing a data request on the interconnect specifying a memory address, the snoopers provide snoop responses. The response logic compiles the snoop responses to obtain a combined response including an indication of a demand-source snooper that will source requested data associated with the memory address to the requesting snooper and an indication of whether additional non-requested data will be supplied to the requesting snooper. This combined response is then transmitted to the snoopers on the interconnect in order to direct the provision of the requested data, and possibly unrequested prefetch data, to the requesting snooper.
All objects, features, and advantages of the present invention will become apparent in the following detailed written description.
The novel features believed characteristic of the invention are set forth in the appended claims. The invention itself however, as well as a preferred mode of use, further objects and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
FIG. 1 depicts an illustrative embodiment of a first multiprocessor data processing system with which the present invention may advantageously be utilized;
FIG. 2 is a high level block diagram of a cache in accordance with the present invention;
FIG. 3 is a state transition table summarizing cache state transitions, snoop responses, and combined responses for various transactions on the system interconnect of the data processing system shown in FIG. 1; and
FIG. 4 is a block diagram depicting an illustrative embodiment of a second data processing system in accordance with the present invention, which has a hierarchical interconnect structure.