1. Field of the Invention
The present invention relates in general to microprocessor systems and, more particularly, to a system, method, and mechanism providing cache coherency in microprocessor systems with cache support.
2. Relevant Background
Microprocessors manipulate data according to instructions specified by a computer program. The instructions and data in a conventional system are stored in main memory which is coupled to the processor by a memory bus. The ability of processors to execute instructions has typically outpaced the ability of memory subsystems to supply instructions and data to the processors. As used herein the terms “microprocessor” and “processor” include complete instruction set computers (CISC), reduced instruction set computers (RISC) and hybrids.
Most processors use a cache memory system to speed memory access. Cache memory comprises one or more levels of dedicated high-speed memory holding recently accessed data and instructions, designed to speed up subsequent access to the same data and instructions. Cache technology is based on a premise that programs frequently re-execute the same instructions and data. Also, instructions and data exhibit a trait called “spatial locality” which means that instructions and data to be used in the future tend to be located in the same general region of memory as recently used instructions and data. When data is read from main system memory, a copy is also saved in the cache memory, along with an index to the associated location in main memory. Often the cache entry includes not only the data specifically requested, but data surrounding the specifically requested data.
The cache then monitors subsequent requests for data to see if the information needed has already been stored in the cache. If the data had indeed been stored in the cache, the data is delivered immediately to the processor while the attempt to fetch the information from main memory is aborted (or not started). If, on the other hand, the data had not been previously stored in cache then it is fetched directly from main memory and also saved in cache for future access.
Microprocessor performance is greatly enhanced by the use of cache memory. Cache memory comprises memory devices that have lower latency than the main memory. In particular, one or more levels of on-chip cache memory provide particularly low-latency storage. On-chip cache memory can be implemented in memory structures and devices having latency of only one or two clock cycles. Cache memory, particularly on-chip cache memory, is particularly suited to being accessed by the microprocessor at high speed.
A task for the cache subsystem is to maintain cache coherency. Cache coherency refers to the task of ensuring that the contents of cache memory are consistent with the corresponding locations in main memory. When only the microprocessor can access main memory cache coherency is a relatively simple task. However, this restriction forces all accesses to main memory to be routed through the microprocessor. Many devices such as graphics modules, mulitimedia modules and network interface modules, for example, can make use of system memory for efficient operation. However, if these modules must tie up the processor in order to use system memory, overall performance is lowered.
To make more efficient use of the processor, many systems allow modules and peripherals other than the microprocessor to access main memory directly. The system bus in a typical computer system architecture couples to the microprocessor and to a direct memory access (DMA) controller. Other modules and peripherals coupled to the bus can access main memory without tying up the microprocessor using the DMA controller. This may also be referred to as a shared memory system as all or part of the main memory is shared amongst the variety of devices, including the microprocessor, that can access the memory.
Shared memory systems complicate the cache coherency task significantly. DMA devices access main memory directly, but usually do not access the cache memory directly. To ensure that the DMA device obtains correct data steps must be taken to verify that the contents of the shared memory location being accessed by a DMA device have not been changed in the cached copy of that location being used by the microprocessor. Moreover, the latency imposed by this coherency check cannot be such as to outweigh the benefits of either caching or direct memory access.
One solution is to partition the main memory into cacheable and uncacheable portions. DMA devices are restricted to using only uncacheable portions of memory. In this manner, the DMA device can be unconcerned with the cache contents. However, for the data stored in the uncacheable portions all of the benefits of cache technology are lost.
Another solution is to enable the DMA controller or other hardware coupled to the system bus to “snoop” the cache before the access to shared memory is allowed. An example of this is in the peripheral component interconnect (PCI) bus that enables the PCI bridge device to snoop the CPU cache automatically as a part of any DMA device transaction. This allows shared memory to be cached, however, also adds latency to every DMA transaction. Systems having a single system bus on which all DMA transactions are performed can implement snooping protocols efficiently. This is because a single bus system enables any device to broadcast a signal to all other devices quickly and efficiently to indicate that a shared memory access is occurring.
There is an increasing demand for systems with robust, complex, multi-path communications subsystems for interconnecting system components. Complex communications networks enable greater expansion potential and customization. Moreover, such systems enable existing, proven subsystem and module designs (often referred to as intellectual property or “IP”) to be reused. In systems with more complex bus networks that enable multiple independent paths a network broadcast can be slow making conventional snoop protocols impractical.
Another solution used for more complex networks uses a centralized or distributed directory structure to hold cache status information. These may be seen, for example, in multiprocessor architectures. Any device accessing shared memory first accesses the directory to determine whether the target memory address is currently cached. When the address is not cached, a direct access to the shared memory location is made. When the address is cached, the cached data is written back to main memory before the direct access is completed. Directory-based solutions are faster than snoop operations, but also add latency to each DMA access as well as hardware overhead to support the directory structure.
Existing solutions often rely on interrupt signals generated by the device and interrupt handler routines to implement cache operations such as a snoop. Interrupt mechanisms interrupt instruction flow and delay processing. Interrupt handlers involve multiple instructions and require state of the currently executing thread to be stored while the interrupt handler executes and restored after the interrupt handler executes. Hence, interrupt mechanisms for executing cache operations are an inefficient means to execute what may amount to a single cache instruction.
A need exists for a mechanism, method and system that enables efficient shared memory access in a cached memory system. A need specifically exists for a mechanism to perform cache coherency in a system have a complex, multipath system bus.