Caching of data in close vicinity of a processing core is a common approach of speeding up the performance of the microprocessor or other processing system. The cache may be a small memory that keeps data and/or instructions that were recently or often used. By keeping the data and/or instructions close to the processing unit, the latency of access is small, and overall processing speed may be increased versus accessing a larger system memory directly, which may have a higher latency of access.
There are multiple known approaches to caching for microprocessor systems. Many systems use separated data and program caches. This approach tends to be inflexible, as each cache is dedicated to either data or programs. Since applications may vary in the ratio of each required for efficient operation, separated data and program caches may not be optimized for different applications, and in fact tend to be larger than needed for many applications to ensure all applications run fast. Separate caches also imply twice the control overhead, and duplicates memory-periphery overhead, such as internal control, wordline decoders and sense amps.
In order to keep chip area used by the cache as small as possible, many systems use combined data and program caches which tend to be more flexible as they can be used by both data and program sides of the microprocessor. Many systems also allow for dedication of certain ways of a multiple way cache to either cache the instructions for a program fetch unit or the data for a load/store unit. This dedication is needed to prevent constant replacement of timing critical cached code portions by accessing data that are scattered in the memory and hence have a low hit rate. In such systems, the user can individually, such as through software, determine the exact division of the cache portions to the individual microprocessor sub units—(load store vs. fetch unit). Performance may be poor in such systems as data and program accesses may be attempted simultaneously and in parallel, but the cache can only serve one access at a time. Thus, the bandwidth is only about one-half that of using separate caches. Using dual ported SRAMs requires much larger areas, and consumes more power. Such SRAMs would also create many difficult potential corner cases, which have to be solved by surrounding control logic.