Embodiments of this invention are applicable to the field of programmable digital logic circuitry; more specifically, embodiments of this invention are directed to memory architecture in digital signal processors.
The technology of digital signal processing has become commonplace in modern electronic systems and applications of such systems. Digital signal processing techniques are widely used in communications technologies, including the wireless technologies of cellular telephony, wireless networking ranging from short range approaches (e.g., “Bluetooth”), local area networking (wireless LANs, or “WiFi”), and “metro” area networks implemented via “WiFi” or the like; wireline communications, such as digital subscriber line (DSL), high-speed Internet access via cable networks, and Ethernet network communications also apply digital signal processing techniques. Digital signal processing is also widely used in such other various applications as digital audio systems, digital video systems, hearing aids, and numerous other real-time computing applications.
Special purpose microprocessors designed for efficiently handling certain arithmetic and logic operations that are repeatedly performed in digital signal processing (e.g., multiply-and-accumulate) are now widely used. Examples of such digital signal processors (“DSP”) that are popular in the industry include the TMS320XC64x family of digital signal processors (“DSPs”) available from Texas Instruments Incorporated. Modern DSPs, such as that “C64x” family, are realized by Very Long Instruction Word (VLIW) processor architectures. FIG. 1 illustrates the architecture of data memory and functional units in the C64x family of DSPs, according to which two sets 2 of four processing units each are provided. As shown in this example, each set 2 includes a logical unit (L1; L2), a shifter unit (S1; S2), a multiplier (M1; M2), and a data load/store unit (D1; D2). Set 21 (L1, S1, M1, D1) is associated with dedicated register file 41, and set 22 (L2, S2, M2, D2) is associated with dedicated register file 42. Global data memory 6 is available to both of sets 21, 22, and is accessible via their respective data units D1, D2. In this architecture, a maximum of eight instructions can be simultaneously executed per machine cycle, one instruction by each of the eight functional units. Of course, instruction execution at this maximum rate requires that the particular instructions being simultaneously executed match the functional unit types available (i.e., eight load/store operations cannot be performed simultaneously). In addition, the bandwidth of each of the register files 41, 42 must be shared among its associated functional units, although the latency of accesses to register files 4 will be much shorter than the latency for accesses to data memory 6.
Complex digital signal processing routines are now often involved in meeting the real-time demands of modern communications applications. One example of such critical path digital signal processing is the decoding involved in error correction of received signals. Low Density Parity Check (LDPC) decoding, “turbo” decoding, Viterbi decoding, and the like are examples of complicated and iterative processing routines that are now typically applied to relatively large data block sizes, and that can limit the overall data rates of the received communications. The Kasumi cipher, required for “3G” cellular communications, is another example of a complex and repetitive DSP routine. Other complex digital signal processing routines are involved in MIMO communications, and in channel estimation and equalization in several communications approaches. Discrete Fourier Transforms (DFTs) and Fast Fourier Transforms (FFTs) on large data block sizes are now commonplace in many applications.
The memory size and memory bandwidth in the conventional architecture of FIG. 1 has been observed, in connection with this invention, to especially constrain system performance in certain complex yet common DSP functions. For example, a typical 1200-point DFT requires up to 1200 separate “twiddle” factors, each of which must be retrieved from some memory resource and arithmetically applied to a data word. Another such function is a typical Kasumi cipher application, which involves two tables of random numbers, each of 128 to 512 elements. Local register files such as register files 4 of FIG. 1 are typically not sufficiently large to store such a large number of values; as such, data memory 6 must be accessed, repeatedly, in order for the architecture of FIG. 1 to perform its DFT or Kasumi task, in these examples. But the retrieval of these values from global data memory 6 adversely affects algorithm performance, considering the latency (i.e., number of machine cycles) for accessing these values from global data memory 6, and considering the necessity to involve the load/store function units D1, D2 along with the functional unit executing the instruction. In addition, global data memory 6 is shared by both sets 2 of functional units, and as such the bandwidth into and out of memory 6 is similarly shared, leading to further increases in latency and thus slower performance. Worse yet, some digital signal processing operations involved in LDPC decoding, matrix algebra, turbo decoding, and Kasumi processing require that data be read or written by way of some permuted sequence of addresses. Such permutations substantially reduce the efficiency memory access, because the ability to access contiguous memory addresses (i.e., in the same physical row of the memory) is not available in such cases.