1. Field of Invention
This invention relates to a microprocessor having a reconfigurable n-way cache to provide increased bandwidth for signal processing as well as general purpose applications.
2. Description of Related Art
There is a fundamental difference in the way microprocessor and digital signal processors (DSP) are designed and used in system realization. Whereas microprocessors are designed to execute general purpose applications as efficiently as possible, digital signal processors (DSPs) are designed to execute only specific applications (such as speech processing) as efficiently as possible. Systems based on microprocessors are designed to run any general application. Some of these applications may not be run on the system until years after the system was shipped. On the other hand, systems based on a DSP are designed to run, in general, only a small set of specific applications, e.g., a telephone answering machine runs only a specific application throughout its lifetime. Once a system based on a DSP is shipped, typically, no new applications are run on it.
Due to this difference in the way microprocessors and DSPs are used, the design styles for these two types of processors have evolved quite differently. However, both processors are designed to provide high performance cost effectively.
Many conventional processors have multi-ported register files, and are therefore capable of providing two or more operands contained in registers to the execution unit (EU) every cycle. The register files are contained on the same integrated circuit as the arithmetic logic unit (ALU), and are very fast devices for providing the desired data. For example, referring to FIG. 1, a typical prior-art microprocessor 100 includes an instruction register 101 that supplies a first address (ADDR0) to a first register file 102, and a second address (ADDR1) to a second register file 103. The register files 102 and 103 illustratively have 32 entries of 32 bits each. The first register file 102 supplies a first operand to a first operand register 104. The second register file 103 supplies a second operand to a second operand register 105. The registers 104 and 105 supply the first and second operands to the arithmetic logic unit (ALU) 106, which may perform various arithmetic operations, illustratively including a multiply-accumulate (MAC) operation. The result is stored in the result register 107, and may be written back into the register files via a signal line 108. In an alternate embodiment, a single dual-ported register file (not shown)is used in lieu of the two register files 102 and 103. In that case, two read ports allow simultaneous access to any two entries in the register file.
Although a register file provides efficient temporary storage, memory organization plays a critical role in determining the performance of microprocessors and DSPs. This is because the performance is determined by how efficiently instructions and data are accessed from the memory. Since speed of discrete memories has not kept pace with the processor speeds, typically, on-chip storage is provided for both instructions and data. Microprocessors and DSPs differ in the way in which this on-chip memory is organized.
There are many instances where it is necessary to supply two operands, contained in memory, that are not already in the on-chip registers. An example is a multiply-accumulated instruction which is one of the basis primitives of signal processing. A typical instruction is EQU MAC x, y, a0
where MAC is the mnemonic for the instruction "multiply accumulate" and the operation specified is: EQU a0=a0+(x*y)
Typically, x and y belong to specific arrays in the memory. For example, x may be located in a coefficient array and y may be located in a data array.
The two memory operands x and y are typically contained in an on-chip data memory, if available, or in a memory external to the microprocessor chip. In either case, supplying two operands to the ALU every cycle implies dual-porting the data memory.
FIG. 2 shows an example of a DSP 200 having two banks of on-chip memory. An instruction register 201 supplies first and second addresses (ADDR0, ADDR1) to a first bank 202 and a second bank 203 of the RAM, where each bank 202 and 203 is illustratively 1 kilobyte in size. The data is written to the RAM via a write line 213. The first operand is read from the bank 202 and output to a multiplexer 204. Similarly, the second operand is read from the second bank 203 and output to a multiplexer 206. Assuming the multiplexers 204 and 206 select the outputs of the RAM banks 202 and 203, the first operand is then latched into a first operand register 205, while the second operand is then latched into a second operand register 207. Alternatively, the operands may be selected by the multiplexers 204 and 206 from an external memory bus 212.
The operands are then provided from the operand registers 205 and 207 to the ALU/MAC unit 208, where they are multiplied together and added to the previous result accessed from an accumulator file 210 via a second line 214. The result is provided to the result register 209 and stored in the accumulator file 210.
Although this technique provides for the multiply/accumulate function within a conventional DSP architecture, there are disadvantages of this approach. For example, since the on-chip memory is configured as RAM rather than as a cache memory, only selected applications can utilize it. All the data addresses in the memory have to be determined when the application program is developed. Thus, conventional microprocessor applications cannot make flexible use of this memory. Furthermore, it is difficult to run applications from different vendors that are installed in the field.
Since any application may be run on a microprocessor-based system, its characteristics are not known in advance. On-chip caches are conventionally used in microprocessors to improve performance. The cache works based on temporal locality and spatial locality. Temporal locality means that once a given memory location is used, it is likely that it may be used in the near future. Spatial locality means that once a memory location is used, it is likely that locations in the vicinity of that location may be used in the near future.
FIG. 3 shows a schematic diagram of a 2-way set-associative cache and how it is addressed, as described in Computer Architecture: A Quantitative Approach, J. L. Hennessy and D. A. Patterson, Morgan Kaufmann Publishers, Inc. pp. 408-414, 1990 (Computer Architecture) . The cache includes data portions 305 and 306 and tag portions 307 and 308. The cache has n blocks or lines. A block typically includes more than one byte of storage. A byte within a block is addressed by the block offset field 304 of the address 301. For example, if the block size is 8 bytes, block offset field is 3 bits. The index field 303 of the address 301 is used to select the set in the cache. Each set in a 2-way associative cache has two blocks. The block frame address 302 is stored in the tag portion associated with the data portion where the block is stored. When a cache block is first written, a set is specified by the index 303 portion of the address. The block within the set is determined by a selection algorithm, such as, random replacement or least recently used (LRU). Once a block is selected, the block frame address 302 is written in the tag portion 307 or 308 and the block from memory is written in the data portion 305 or 306 corresponding to the selected block. A special bit is provided in the tag portions 307 and 308 to indicate that a given entry in the cache contains valid data. In general, there are other control bits in the tag portions 307 and 308 to store other information, such as privilege level, etc.
At a later time, the processor may request data at a specified memory address 301. In order to check whether a specific data address "hits" in (i.e., is in) the cache, the index 303 portion of the address is used to select the set. For a 2-way associative cache, there are two sets of tags 307 and 308 and data 305 and 306, which are accessed simultaneously using the index 303. The two output tags 307 and 308 are compared with the block frame address 302 using the comparators 309 and 310. If neither tag 307 or 308 equals the block frame address 302, a cache miss has occurred. On the other hand, if one of the tags 307 or 308 is equal to the block frame address 302 and the valid bit is set, a cache hit has occurred, and the data corresponding to the matching tag is correct data that is selected by a multiplexer 311 using the hit signals. The appropriate byte(s) within the data 312 are then accessed using the block offset field 304.
A cache that has only one block per set is referred to as a direct mapped cache. Furthermore, a cache that has n blocks per set is referred to as a n-way set-associative cache.
Conventionally, virtual memory is used to appear to the application as much more memory than is physically available. This is achieved through secondary storage, such as a disk drive. Thus, an application generates virtual instruction and data addresses. These addresses are translated using page directory and page table entries and hardware table walk. For faster translation, virtual address-to-physical address translations are cached in an on-chip memory called a Translation Look-aside Buffer (TLB), as described in Computer Architecture, pp. 432-449.
A major advantage of caches is that they adapt to the dynamics of the application being run, based on temporal and spatial locality. A major disadvantage of caches is that there is some uncertainty about whether a given location is guaranteed to be in the cache. Events, such as an interrupt, may change the execution flow and "pollute" the cache. If required memory locations are not guaranteed to be on-chip, the computation may not be completed in the time allocated. This may not be acceptable for DSP applications.
Accordingly, DSPs conventionally do not use on-chip cache for instruction and data storage. Since a small set of applications run on a DSP, the instructions are typically contained in an on-chip ROM. Furthermore, since the data storage requirements for DSP applications are known in advance, the data is allocated in on-chip memory banks. On-chip cache differs from on-chip memory banks in that on-chip cache can store data at any absolute memory location, whereas an on-chip memory bank stores data only at specified memory locations.
Recently, a new class of devices, called Personal Communicators, are becoming available. These devices integrate communications capabilities, such as voice, data, and fax communications using a cellular phone, with personal organizers. These devices currently use a separate DSP for communications tasks and a general purpose microprocessor for the other tasks.