1. Field of the Invention
This invention relates in general to the field of memory coherency, and more particularly to a method and apparatus for managing the coherency of volatile data in a cache memory.
2. Description of the Related Art
The invention described in the Detailed Description below is directed at solving data coherency problems when managing volatile data in multiple bus master environments. However, before these problems are addressed, a brief overview of computer memory hierarchies is given. After the overview, a general explanation of the problems associated with utilizing multiple memory hierarchies, e.g., maintaining coherency, snooping and prefetching, is provided.
In modern computing systems, a number of distinct levels of memory hierarchy have been established. This is particularly illustrated with reference to FIG. 1 where a register file is shown at the top of a pyramid 100, as the smallest but fastest memory, with secondary memory at the bottom, as the slowest and least expensive memory, but having the greatest storage capacity. Each of these levels provide storage for instructions or data, as needed by a processor. However, they typically differ from each other in terms of performance, size and cost. Lower cost memory provides greater storage for a given budget, but such memory often requires the processor to halt, or delay processing, until data is read from or written to the memory. Higher cost memory responds to access requests by a processor faster than lower cost memory, but typically stores much less information. Thus, when designing a computing system, it is typical to utilize a mixture of different memory hierarchies to obtain an optimum solution.
A solution chosen by many computing system designers is implemented by populating main memory with relatively slow access, inexpensive memory, such as DRAM, while also incorporating a relatively small amount of high cost, fast access primary and/or secondary cache memory, such as SRAM.
Referring now to FIG. 2, a computer system 200 is shown illustrating five notable levels of memory hierarchy, labeled Levels 0-4. The computer system 200 contains a microprocessor 202 coupled to a number of different memories 210, 212 and 214. Within the microprocessor 202 is a CPU 204, containing a register file 206, and a primary cache 208.
The register file 206 is considered the memory closest to the CPU 204 and the easiest to access by the CPU 204. It is regarded as level 0 in the overall memory hierarchy and is a part of the CPU 204. The register file 206 is typically the smallest memory within the hierarchy, but also provides the fastest response to CPU operations.
The CPU 204 is connected to the primary cache 208 (Level 1). The primary cache 208 provides the fastest access of any memory level outside of the CPU 204. In many modern microprocessors the primary cache 208 is on the same chip with the CPU 204.
Outside of the microprocessor 202 is a secondary cache 210 (Level 2). The secondary cache 210 is typically much larger than the primary cache 208, but does not provide the same access performance as the primary cache 208. In fact, in many computer systems, access to information in the secondary cache 110 requires the microprocessor 202 to delay processing until the data is written to, or retrieved from the secondary cache 210.
Level 3 in the memory hierarchy is the main memory 212. The main memory is the one actually addressed by the CPU 204. It typically contains the code and data for currently running programs. However, it is in general of insufficient size to contain all information that may be required by users. That is why another level of memory is often needed.
Level 4 in the memory hierarchy is the secondary memory 214. The secondary memory 214 is used as the repository storage of information, and is typically associated with magnetic discs, optical discs, networks, or tapes.
When designing a computer system that includes a number of different memory hierarchies, such as those discussed above, two particular problems are of concern to a designer. The first problem deals with how to anticipate what data the processor will require in subsequent instruction cycles, and what steps are necessary to optimize the chance that data is present in the on-chip primary cache when needed. The second problem concerns establishing a coherency mechanism that assures that the data which is required, is the latest "valid" data, if located in the on-chip primary cache, or the secondary cache.
A number of solutions have been proposed to deal with both of the above-described problems. For anticipation, many cache systems automatically perform a "burst" read anytime a piece of data is requested, and that data is not already present in the cache. A burst read is a bus operation that transfers multiple bytes of data into a cache in one operation, in less time than multiple individual read requests. More specifically, when retrieving data from a main memory or secondary cache, a number of bus cycles are typically required to specify the address of the data to be retrieved, and to set up the secondary cache or main memory to deliver the data. This overhead is typically required for every read operation. After the overhead, the secondary cache or main memory delivers the data. Since the overhead time is costly for each data retrieval operation, burst reads are used. A burst read can be characterized as a single read transaction that causes multiple back to back data transfers to occur.
A burst read is based on the principal of spatial locality which states that programs and the data they request tend to reside in consecutive memory locations. This means that programs are likely to need code or data that are close to or adjacent to locations already in use. So, if a processor attempts to read data, and the data is not present in the primary cache, the primary cache causes a burst read operation. The burst read retrieves the requested data into the cache, along with adjacent bytes of data, for example. Thus, in subsequent operations, if accesses are made to data locations adjacent to the original request, they may be provided directly by the primary cache, without having to incur the overhead associated with individual read requests.
A second improvement in anticipation is the MOVE MULTIPLE processor instruction. This instruction has benefits similar to the burst read, and directs the processor to retrieve desired data and place the data within the processor'register file. Unfortunately, for most processors, use of such an instruction ties up valuable register space that cannot be allocated to data storage.
Another improvement in anticipation is the PREFETCH, or TOUCH, processor instruction. This instruction allows a programmer to direct the cache to retrieve a particular stream of data before it is needed by the processor. The instruction may be used at the beginning of a program loop, for example, to retrieve data that will be required during execution of the next loop iteration. If the data is not already in the cache, the cache system retrieves the specified data, and places it into the on-chip cache, while the processor is executing the first pass of the loop. Then, when the rest of the data in the stream is needed, it is already available in the cache. This instruction relies on the spatial locality principle, discussed above, and is typically performed using burst read operations.
To deal with the problem of coherency, a number of hardware, software and hardware/software methodologies have been used. In systems where multiple devices can modify the contents of main memory, such as a processor and a direct-memory-access (DMA) controller, a methodology is necessary to assure that changes to the main memory will either be noticed by the cache, or to assure that data requests to the cache will obtain the latest data within the main memory.
In some processors, "snoop" hardware is provided within the cache system to monitor the memory bus and flag or invalidate cache contents any time a write is performed to an area of main memory that contains data at the same memory location as data contained in the cache. However, snoop hardware is costly to implement, dictates particular and often complex system interfaces, and thus is not desirable in all processors.
Another method that deals with coherency treats as non-cacheable all data that may be changed in main memory by devices other than the processor. This method prevents the processor from retrieving "old" or "stale" data from the cache when newer data may reside in main memory. However, as should be apparent, it forces the processor to utilize slower main memory for all data accesses within the non-cacheable area. The speed advantages of a cache for such data are thereby rendered moot. In addition, advantages obtained through burst operations are also lost.
A software method that deals with coherency utilizes a specific software instruction to flush particular contents of the cache immediately before requesting data from an area which the programmer believes has been modified by another master device. This causes the processor to invalidate the cache contents and deliver any modified data to the main memory. Then, the more recent data is requested from the main memory to update the cache. Once the cache has been updated, the processor utilizes the cache for data processing. However, to perform the flush operation, data in the cache that has not yet been written back into main memory must first be written to main memory, then the requested data is retrieved. During this operation, the processor is halted, delaying it from dealing with its present task. It should be appreciated that, similar to the burst read operation discussed above, significant overhead is required to first flush the cache, then perform a burst read operation into the cache. Moreover, additional processor time is required to execute the flush instruction and cause the burst read operation.
In the first method, the programmer foregoes the advantages of using a high speed cache for those areas of memory that may be modified by devices other than the processor. In the second method, any delays associated with flushing the cache are incurred, prior to retrieving the desired data. When using the above-mentioned PREFETCH instruction within a multiple bus master system, it is common practice to explicitly flush the desired cache line prior to performing the prefetch. This insures that the program will take advantage of the latest data, and still provides the benefits of using a cache memory system. However, flushing the cache prior to performing the prefetch is time consuming, complicates the software, and adds additional processing delays.
With the above understanding of memory hierarchies, and the associated problems in using multiple memory hierarchies within a computing system, a brief example is provided that illustrates how these problems effect performance in a computing system. Referring to FIG. 3, a block diagram is shown which illustrates a multiple-bus-master computer system 300. Within the system 300 is a processor 302 connected to a main memory system 306 via a host bus 308. Inside the processor 302 is a primary cache 304. Also connected to the main memory 306 are a plurality of bus master devices 310, 312 and 314. For purposes of illustration, it is presumed that the processor 302 and all of the bus master devices 310-314 can modify data within the main system memory 306.
Now, if it is assumed that the processor 302 requires data at physical addresses B000FFF0h-B000FFFFh, it will attempt to retrieve the data from the primary cache 304. If the data is not stored within the cache 304, then the processor 302 will request access to the bus 308, and will initiate a burst read from the main system memory 306. The main system memory 306 will provide the data to the processor 302, and the processor 302 will store the data within the primary cache 304. Then the data is available to the processor 302 from the primary cache 304.
But, it may also be assumed that at some later time, one of the other bus master devices 310-314, perhaps a DMA controller associated with an I/O device, may overwrite the data within the main system memory 306 at addresses B000FFF0h-B000FFFFh. For example, data packets may be coming in from a network connection to a specified memory location. At this point, there is a coherency problem between the data in the main memory 306 and the data in the primary cache 304, i.e., the data is not the same between two memory devices, at the same address. Unless software explicitly flushes the data in the primary cache 304, as discussed above, the next time the processor 302 attempts to read data at any of the addresses B000FFF0h-B000FFFFh, the primary cache 304 will provide old data to the processor 302. Unfortunately, the data provided would not be the "latest" data, and erroneous execution would result. To overcome delays associated with having to utilize the main memory 306 to provide coherent data, or with having to flush the cache 304 prior to retrieving data, the present invention is provided, as described below.
What is needed is an apparatus and method that solves the above problems by providing a solution to the issues of coherency in a multi-bus master environment, while still supplying the advantages of prefetching and caching volatile data. More specifically, what is needed is a method and apparatus for prefetching specified data into a cache memory, while insuring that the data that is prefetched is the "latest" valid data.