Modern electronic computing systems, such as microprocessor systems, typically include a system bus connecting a memory, one or more processing units, and an input/output (I/O module). FIG. 1, for example, illustrates an exemplary multiprocessing system 100 that includes multiple processing units. System 100 includes an otherwise conventional system bus 102 coupled to an otherwise conventional memory 104. System 100 includes an otherwise conventional I/O module 106 coupled to bus 102.
System 100 also includes an otherwise conventional main processing unit (MPU) 110. As illustrated, system 100 includes an otherwise conventional main memory unit (MMU) 112 coupled to bus 102 and an otherwise conventional level two (L2) cache 114 coupled to MMU 112. Generally, MPU 110 accesses memory 104 through MMU 112 and L2 cache 114.
System 100 also includes a synergistic processing unit (SPU) 120. SPU 120 can be an auxiliary general purpose processing unit (PU) or a special purpose PU. System 100 also includes an otherwise conventional direct memory access (DMA)/MMU 122 coupled to bus 102 and an otherwise conventional local store 124. Generally, SPU 120 accesses memory 104 through DMA 122, and stores selected information in local store 124. Local store 124 is frequently much smaller than L2 cache 114. As illustrated, system 100 includes one additional processing unit, SPU 120. One skilled in the art will understand that many multi-processor systems employ more than two processing units.
As shown, a modern complex processor system includes a large, unified memory (memory 104), accessible by the main processing unit (MPU 110) through a hierarchy of coherent caches (e.g., L2 cache 114), which hide and reduce memory access latency. When MPU 110 references data in cache 114, it does so by “effective address.” Typically, the system hardware looks up the effective address in a cache directory, and if found, returns the data referenced by that effective address. If the requested effective address is not found in the cache directory (a “cache miss”), the system incurs a latency penalty in retrieving the requested data into the cache from some higher level in the memory hierarchy. The illustrated L2 cache 114 is an example of a “hardware cache,” where the caching and lookup functionality is implemented in hardware.
The secondary processing unit 120, however, does not have a hardware L2 cache. As such, some systems employ a “software managed cache,” which emulates a hardware cache in local store 124. In some systems, the operating system provides the cache functionality; in other systems the applications must provide their own memory management and cache functionality.
For example, a common cache-supported computational task is to sum a sequential array of data elements:
for (i=0; i<number_of_elements; i++)  sum += a[i];For a simple processing unit using conventional managed cache techniques, the sequential access above becomes:
for (i=0; i<number_of_elements; i++)  sum += cache_read(&a[i]);
But conventional managed cache techniques suffer from a number of disadvantages. First, as described above, conventional software caches mimic hardware caches and maintain a directory of “effective addresses.” As such, conventional software managed caches, before determining a cache hit/miss, must compute the effective address to pass to the cache lookup function:
/* STEP 1: compute EA */ea = a + i * sizeof (data);Using effective addresses in a software managed cache therefore introduces unnecessary overhead to the cache lookup function, in that the effective address must be calculated for each cache lookup. With the introduction of 64-bit addressing, this problem has become a significant contributor to latency. For example, in a 64-bit address space, calculating the effective address requires at least one 64-bit add. On a simple processing unit lacking a full set of 64-bit integer arithmetic instructions, this single addition may require multiple processing steps.
Having calculated the effective address, a typical “cache read” operation follows the following general structure, illustrated in exemplary C/C++ code:
cache_read(ea){/* STEP 2: clear low order bits to ensure cacheline alignment. */ea_aligned = ea & ~CACHELINE_MASK;/* STEP 3: hash ‘ea’ into cache directory. */set = (ea >> NR_SETS_SHIFT) & NR_SETS_MASK;/* STEP 4: compare against cache dir. */found = (cache_dir[set] == ea_aligned) ? TRUE : FALSE;/* STEP 5: get data, on assumption of hit */data = cache_data(set, ea & CACHELINE_MASK);/* STEP 6: handle miss, if needed. */if (!found) data = cache_miss (ea);return data;}One skilled in the art will understand that in some cases, and for 64-bit effective addresses especially, a simple processing unit may also require additional processing steps to break the effective address into high and low components, and to perform separate hash functions on each component. This introduces further latency and power consumption.
Thus, as systems continue to develop and employ longer bit sequences in addressing, the latencies and extra processing steps required by conventional software caching approaches using effective addresses will become a still greater problem.
Therefore, there is a need for a system and/or method for a software managed cache that addresses at least some of the problems and disadvantages associated with conventional systems and methods.