From a DSP application perspective, a large amount of fast on-chip memory would be ideal. However, over the years the performance of processors has improved at a much faster pace than that of memory. As a result, there is now a performance gap between CPU and memory speed. High-speed memory is available but consumes much more size and is more expensive than slower memory.
FIG. 1 illustrates a comparison of a flat memory architecture versus hierarchical memory architecture. In the flat memory illustrated on the left, both CPU and internal memory 120 are clocked at 300 MHz so no memory stalls occur. However accesses to the slower external memory 130 causes stalls in CPU. If the CPU clock was now increased to 600 MHz, the internal memory 120 could only service CPU accesses every other CPU cycle and CPU would stall for one cycle on every memory access. The penalty would be particularly large for highly optimized inner loops that may access memory on every cycle. In this case, the effective CPU processing speed would approach the slower memory speed. Unfortunately, today's available memory technology is not able to keep up with increasing processor speeds, and a same size internal memory running at the same CPU speed would be far too expensive.
The solution is to use a memory hierarchy, as shown on the right of FIG. 1. A fast but small memory 150 is placed close to CPU 140 that can be accessed without stalls. In this example both CPU 140 and level one (L1) cache 150 operate at 600 MHz. The next lower memory levels are increasingly larger but also slower the further away they are from the CPU. These include level two (L2) cache 160 clocked at 300 MHz and external memory 170 clocked at 100 Hz. Addresses are mapped from a larger memory to a smaller but faster memory higher in the hierarchy. Typically, the higher-level memories are cache memories that are automatically managed by a cache controller. Through this type of architecture, the average memory access time will be closer to the access time of the fastest memory (level one cache 150) rather than to the access time of the slowest memory (external memory 170).
Caches reduce the average memory access time by exploiting the locality of memory accesses. The principle of locality assumes that once a memory location was referenced it is very likely that the same or a neighboring location will be referenced soon again. Referencing memory locations within some period of time is referred to as temporal locality. Referencing neighboring memory locations is referred to as spatial locality. A program typically reuses data from the same or adjacent memory locations within a small period of time. If the data is fetched from a slow memory into a fast cache memory and is accessed as often as possible before it is replaced with other data, the benefits become apparent.
The following example illustrates the concept of spatial and temporal locality. Consider the memory access pattern of a 6-tap FIR filter. The required computations for the first two outputs y[0] and y[1] are:y[0]=h[0]×x[0]+h[1]×x[1]+ . . . +h[5]×x[5]y[1]=h[0]×x[1]+h[1]×x[2]+ . . . +h[5]×x[6]Consequently, to compute one output we have to read six data samples from an input data buffer x[i]. The upper half of FIG. 2 shows the memory layout of this buffer and how its elements are accessed. When the first access is made to memory location 0, the cache controller fetches the data for the address accessed and also the data for a certain number of the following addresses from memory 200 into cache 210. FIG. 2 illustrates the logical overlap of addresses of memory 200 and cache 210. This range of addresses is called a cache line. The motivation for this behavior is that accesses are assumed to be spatially local. This is true for the FIR filter, since the next five samples are required as well. Then all accesses will go to the fast cache 210 instead of the slow lower-level memory 200.
Consider now the calculation of the next output y[1]. The access pattern again is illustrated in the lower half of FIG. 2. Five of the samples are reused from the previous computation and only one sample is new. All of them are already held in cache 210 and no CPU stalls occur. This access pattern exhibits high spatial and temporal locality. The same data used in the previous step was used again.
Cache exploits the fact that data accesses are spatially and temporally local. The number of accesses to a slower, lower-level memory are greatly reduced. The majority of accesses can be serviced at CPU speed from the high-level cache memory.
Digital signal processors are often used in real-time systems. In a real-time system the computation must be performed fast enough to keep up the real-time operation outside the digital signal processor. Memory accesses in data processing systems with cache cause real-time programming problems. The time required for memory access varies greatly depending on whether the access can be serviced from the cache or the access must go to a slower main memory. In non-real-time systems the primary speed metric is the average memory access time. Real-time systems must be designed so that the data processing always services the outside operation. This may mean always programming for the worst case data access time. This programming paradigm may miss most of the potential benefit of a cache by acting as though cache were never used. Thus there is a premium on being able to arrange a real-time process to make consistent, optimal use of cache.