Cache has been widely used in digital signal processors (DSPs) as well as general purpose processors in achieving high operating clock frequency. Cache accesses are typically pipelined, thus load instructions can be issued every cycle. Due to the pipelined nature of the processing, there is a certain amount of latency for load instructions. Table 1 illustrates the pipeline stages and their functions for a typical digital signal processor.
TABLE 1NemonicNameActionPGProgramDetermine Address of Fetch PacketAddressGeneratePSProgramSend Address of Fetch Packet to memoryAddressSendPWProgramPerform Program memory accessWaitPRProgramFetch Packet reaches CPU boundaryDataReceiveDPDispatchDetermine next execute packet in fetchpacket and send to functional unit fordecodeDCDecodeDecode Instructions in functional unitsE1Execute1All instructions: evaluate conditionsand read operandsLoad and Store instructions: performaddress generation and write addressmodifications to register fileBranch instructions: branch fetch packetin PG phaseSingle cycle instructions: write resultsto register fileE2Execute2Load instructions: send address tomemoryStore instructions: send address anddata to memorySaturating single cycle instructions:update SAT bit in control statusregisterMultiply instructions: write results toregister fileE3Execute3Load and store instructions: performmemory accessesSaturating multiply instructions: updateSAT bit in control status registerE4Execute4Load instructions: bring data to CPUE5Execute5Load instructions: write data toregister file
Program fetch is performed in four pipeline stages PG, PS, PW and PR. Program decode is made up of the DP and DC pipeline stages. Execution takes place in pipeline stages E1 to E5. Note that differing instructions include differing number of execute pipeline stages. Single cycle instructions including add, subtract and logical operations complete in a single execute stage (E1) except for updating the SAT bit in the control status register. Multiply instructions complete in execute stage E2 except for updating the SAT bit in the control status register. Store instructions complete in execute stage E3. Load instructions complete in execute stage E5.
FIG. 1 illustrates the functions of an example VLIW DSP including the pipeline phases of the processor. Fetch phase 100 includes the PG pipeline stage 101, the PS pipeline stage 102, the PW pipeline stage 103 and the PR pipeline stage 104. In each of these pipeline stages the DSP can perform eight simultaneous commands. These commands are summarized in Table 2. The decode phase 110 includes the DP pipeline stage 105 and the DC pipeline stage 106. Decode phase 110 also performs commands from Table 2.
TABLE 2InstructionInstructionFunctional UnitMnemonicTypeMappingSTStoreD-UnitSADDSigned AddL-UnitSMPYHSigned MultiplyM-UnitSMPYSigned MultiplyM-UnitSUBSubtractL-Unit S-Unit; D-UnitBBranchS-UnitLDLoadD-UnitSHRShift RightS-UnitMVMoveL-Unit
FIG. 1 illustrates memory hardware external to the CPU. Program cache memory 111 stores the instructions to be performed and data cache memory 126 stores all operands in immediate use. Memory controller 125 performs program fetch control and memory controller 112 performs data transfer control. Bulk data storage resides in external memory 131. Level-2 cache 127 provides high-speed access to data in current use.
Execute phase 120 performs all other operations including: evaluation of conditions and status; Load-Store instructions; Branch instructions; and single cycle instructions. Execute stages E1 107, E2 108 prepare for the E3 109 stage cache memory access from data cache memory 126 to retrieve the required operand.
Upon a cache hit, processing proceeds to execute stage E4 117 and then to execute stage E5 124 with results stored in the register file. Upon a cache miss, memory controller 112 inserts a fixed number of stall cycles via path 128 allowing data to be retrieved from level-2 cache 127 or external memory 131. Data is returned to the pipeline via path 130.
FIG. 2 illustrates the manner in which the pipeline is filled in an example pipeline execution of a DSP that has a four cycle load latency. Successive fetch stages can occur every clock cycle. In a given fetch packet, such as fetch packet n 200, the fetch phase is completed in four clock cycles with the pipeline stages PG 201, PS 202, PW 203 and PR 204 as described in Table 1. In fetch packet n the next two clock cycles fifth 205 and sixth 206 are devoted to the program decode stage phase including dispatch stage 205 and decode stage 206. The seventh clock cycle 207 and succeeding clock cycles of fetch packet n are devoted to the execution of the commands of the instructions of the packet. Any additional processing that may be required in processing a given packet, if not executed in the first eleven clock cycles could result in pipeline stalls or data memory stalls.
Referring back to FIG. 1, cache memory accesses are initiated during pipeline stages E2 108 and E3 109 and the data is brought to the CPU boundary at pipeline stage E4 via path 130. At pipeline stage E5 124, data is written into the register file. For a cache miss 128, the pipeline stalls CPU at pipeline stage E3. FIG. 3 illustrates such a stall is at 301. This example includes two assumptions: a four cycle cache miss penalty; and the second load accesses the same cache line as the first load.
The overhead of stall cycles due to cache misses depends on the cache miss penalty and a cache hit ratio of the code executed. It is highly desirable to reduce or eliminate these stall cycles to better utilize the processing core.