A basic problem occurring in digital system design is that of how to speed-up throughput and reduce the delays involved in providing processor access to memory data and instructions. The performance of the system is of course, dependent on the best or higher speed of access to memory data and thus is reduced by the liability of any delays that a processor would have to access data or instructions.
The presently described computing system with multiple cache functionality deals with architecture and functions for reducing the memory access time, reducing time for completion of read, write, or other operations and thus increasing the throughput efficiency.
Typically, one technique to reduce memory cycle time is that of using a cache memory which is attached to or adjacent to the processing unit. The adjacent cache memory has generally a high-speed fast memory data access cycle and functions to hold the more frequently used data so that it will be readily available to the processing unit.
The cache units or cache memory units are generally much smaller in addressability than the main system external memory, but since processing, most often is sequential or repetitive in nature, the algorithms for cache designs have already been derived for filling cache memory with those data words or instruction words that the processor is most likely to need on its next operation or within the next few operations.
The presently described processing system involves a processor which has, on the same chip, a microcode cache memory for providing frequently used microinstruction words and, additionally has a general cache memory which holds both data words and OPCODE words used to support the operations of the microcode cache. Thus the computer system, by using a general cache memory in conjunction with a microcode cache memory provides speedier data accessibility to the processing unit.
Each time a processor issues a Read or a Write, the cache memory organization checks to find out if it contains the data internally within the cache units. If the cache does contain a requested memory location, it is a cache "hit" and the data requested is then returned on the next clock to the processor.
If the cache memory system does not have the requested data, this is a cache "miss". In this case then the processing unit has to access a system interface in order to get the data from an external memory. However, this extra step results in a great delay which may take 8 or 9 more clock time periods.
The purpose of cache memory systems is to provide needful information to the processing unit on a quick basis. Another main task of the cache is to maintain "data coherency", that is to say, that the data in the cache will accurately match some data residing in main memory. If this is not the case, then the cache memory will need to invalidate any address location in cache memory that had been changed in main memory by a write to main memory.
Cache memories are placed in close proximity to the processor logic to allow for fast data access by the processor unit. Thus, instead of being burdened by a slow data retrieval cycle normally associated with accessing the external main memory, the processor can receive a copy of the data held by a faster cache memory. However, caches are generally much smaller then the main memory so they can only hold a subset of the data found in the external main memory. Thus all of the possible locations of main memory must be mapped into the smaller cache memory to permit maximum utilization of the limited cache size and also in minimizing the time it takes to determine if a required data copy is already present in the cache.
One technique used to achieve this is a four-way, set-associative cache. Such a cache may have its memory divided into four equal parts or sets and a word from external main memory can be mapped into any one of these four sets.
It is also desired to replace the least recently used (LRU) word in cache since it is less likely to be used again in the near future than the other three words that have been accessed more recently. A cache can keep track of the order in which the data has been used by utilizing a LRU RAM which has the same depth as a cache set, and stores a code of bits that can be decoded to the word in a set which has the "most stale" (or least recently used) data at a given address.
In earlier years, the design of the control portion of computer processors have gone through a transition by being converted from hard-wired control units to the more recently types of microcode-driven control units. The microcode is generally referred to as "firmware" and resides at a level below the machine instruction level. The microcode is generally fixed and presented by the manufacturer and is also inaccessible to the user who may not even be aware of its existence.
Microcode instructions must be stored in some type of memory structure which is available to the control hardware of the processor. In many processors, this is a Read Only Memory (ROM) unit which is generally inexpensive and fast, but has the limitation of being fixed and unalterable. Thus when inadequacies are found, or it is desired to change the definition of the instruction set that is implemented, this presents a problem which is very costly to change.
In other types of processors, the microcode instructions are stored in Random Access Memory (RAM). This makes it relatively easy to change the previously fixed type of microcode instructions, but on the other hand, it is much more costly and slower in operation. Additionally, in many VLSI implementations, the Random Access Memory also requires more silicon area per bit, thus reducing the amount of microcode available for use in a given silicon area.
In terms of other practical considerations, both RAM and ROM units are limited in size by certain practical considerations such as power consumption, cost, area required and performance.
With these type of problems presented by RAM and ROM memories, computer systems have been developed with the use of "caching" or cache memory assists in order to provide better service to a processor's need for instruction codes as rapidly as possible.
The present disclosure functions to obtain the benefits of a writable control store without the size constraints of Random Access Memory (RAM) or the lack of unalterability due to Read Only Memory (ROM).
Thus the improved concept that is indicated, is that, instead of attempting to store the entire microcode instruction set in either a RAM unit or a ROM unit, there can be implemented a specialized "microcode cache unit". When a "miss" occurs in an ordinary cache memory unit, the required item is then fetched from the main memory. Most processors are connected to memory systems that are very large compared to the memory space required for microcode storage.
A special problem for microcode cache units is that a cache "miss" is very expensive in terms of average performance. Thus very high "hit" rates are most desirable compared to most general cache applications. It is desirable that hits occur at least 99% of the time in many applications. There are several concepts that make this possible.
(i) First, the amount of microcode actually used in the "normal operation" of a processor is relatively very small. Many OP codes are seldom used, and many esoteric variances of common OP codes are used even less. One obvious example is the action taken under error conditions; PA1 (ii) Second, a microcode post-processor can be used to rearrange the microcode location accessibility to maximize the cache hit rate if the parameters of a caching algorithm and microcode use are known.
Microcode cache operations allow a large, complex, evolving instruction set to be implemented in a single-die package with options as to the whereabouts of the complete microcode in the memory subsystem depending on cost/performance requirements for the system.
Putting the control store off-chip would tend to require deeper pipelining because of the delay incurred. The requirements for computing the address of the next microcode word to be executed would make deeper pipelining of its prefetch very costly. Performance would suffer considerably. Thus the on-chip cache location eliminates much of the pipelining delays incurred if the on-chip caches were not available.
Putting both a general and microcode cache on-chip allows the processor to run for lengthy periods without having to access off-chip. Because of the performance cost of going off-chip (more costly the faster the processor with respect to the memory subsystem), it is desirable to do this as infrequently as possible. Thus, it is useful to implement larger caches as technology allows to further reduce the off-chip traffic.