This patent application claims priority from U.S. Provisional Patent Application No. 60/323,763, filed Sep. 17, 2001.
This patent application describes inventions related to a novel digital signal processor (DSP) architecture for third generation and beyond (3G+) wireless baseband processing. DSPs are programmable microcomputers whose hardware, software and instruction sets are optimized for high-speed numeric processing applications. DSPs are widely used in wireless communication systems for various applications such as speech encoder/decoders (CODECs), channel equalizers, MAC layer operation and system controllers.
Where possible, DSPs are preferred to other devices such as application specific integrated circuits (ASICs) and field programmable gate arrays (FPGAs) due to the DSPs inherent flexibility and ease of programming. With the advent of software defined radio (SDR) and the convergence of global wireless markets, new impetus has been given to programmable and flexible radio architectures that can support a variety of wireless standards. Therefore, programmable DSPs are increasingly used in wireless systems; with ever-increasing need to expand their application range to such computation-intensive areas as the baseband processing of the transmitter/receiver chain. However, the baseband units of the emerging 3G Wireless systems such as WCDMA require processing power that is not provided by any currently known DSP architectures.
Tremendous efforts are being put in designing the next generation DSPs to meet the growing processing demand of wireless applications. Many new multiprocessing architectures are used to increase the processing power of DSPs. Some of the examples of such architectures are Pipeline single-instruction multiple-data (SIMD), multiple-instructions multiple-Date (MIMD), and SIMD with array processing. These architectures are for the most part targeted at applications with inherent data-parallelism, high regularity, and high throughput requirements. In a wireless terminal, or handset, these applications include baseband processing, video compression (discrete cosine transforms, motion estimation), data encryption, and DSP transforms.
One problem is that conventional DSPs, once programmed, are not easily reconfigurable to handle a variety of applications, nor are they flexible enough for applications that process irregular or nonparallel data.
FIG. 1 is a simplified block diagram of a reconfigurable DSP (rDSP) chip designed by Morpho Technologies, Inc., of Irvine Calif. and the assignees hereof, which overcomes some of the shortcomings of conventional DSPs. The rDSP comprises a reconfigurable processing unit, a general-purpose reduced instruction set computer (RISC) processor and a set of I/O interfaces, all implemented as a single chip. At the center of the chip is an array of reconfigurable processing elements, also known as reconfigurable cells (RCs). Since most of the target applications possess word-level granularity, the RCs are also coarse-grain but also provide extensive support for key bit-level functions. The RISC processor controls the operation of the RC fabric. A set of input/output (I/O) interfaces handles data transfers between external devices and the rDSP chip. Dynamic reconfiguration of the RC fabric is done in one cycle by caching on the chip several contexts from the off-chip memory.
FIG. 2 illustrates an rDSP chip 200 in greater detail, showing: the RISC processor with its associated instruction, data cache and memory controller; an RC array comprised of an 8-row by 8-column array of RCs; a context memory (CM); a frame buffer (FB); and a direct memory access (DMA) with its coupled memory controller. Each RC has several functional units (e.g. MAC, ALU, etc.) and a small register file, and is configured through a 32-bit context word.
The FB is analogous to an internal data cache for the RC array, and is implemented as a two-port memory. It makes the memory accesses transparent to the RC array by overlapping computation processes with data load and store processes. The FB is organized as 8 banks of N×16 frame buffer cells, where N can be sized by the a developer. The FB can thus provide 8 RCs (1 row or 1 column) with data, either as two 8-bit operands or one 16-bit operand, on every clock cycle.
The CM is the local memory to store the configuration contexts of the RC array, much like an instruction cache. A context word from a context set is broadcast to all eight RCs in a row or column. All RCs in a row (or column) share a context word and perform the same operation, as shown in FIG. 3. Thus the RC array can operate in Single Instruction, Multiple Data form (SIMD). For each row and each column there are 256 context words that can be cached on the chip. The context memory has a 2-port interface, which enables the loading of new contexts from off-chip memory (e.g. flash memory) during execution on the RC array.
RC cells in the array can be connected in two levels of hierarchy. First, RCs within each quadrant of 4×4 RCs are fully connected in a row or column. Furthermore, RCs in adjacent quadrants are connected via fast lanes, which enable an RC in a quadrant to broadcast its results to the RCs in the adjacent quadrant.
The RISC processor handles general-purpose operations and also controls operation of the RC array. It initiates all data transfers to and from the FB, and configuration loads to the CM through the DMA Controller. When not executing normal RISC instructions, the RISC processor controls the execution of operations inside the RC array every cycle by issuing special instructions, which broadcast SIMD contexts to RCs or load data between the frame buffer and the RC array. This makes programming simple since one thread of control flow is running through the system at any given time.
The structure of the 8×8 RC array is optimized for two-dimensional symmetric operations, such as image processing. However, this structure is not optimal for some other operations, such as wireless baseband modem algorithms. These other operations lead to underutilization of some of the array elements and/or data movement bottlenecks. Most CDMA modem algorithms require high initial data throughput, followed by low output data movement (i.e. dispreading). In contrast, high-order modulations used in systems such as 802.11a (64 QAM), require higher data bandwidth at the output of the array after demodulation and detection. In both cases, a high data bandwidth is required to/from the RC array.
As discussed above, large data bandwidth is essential for most wireless modem applications. For example, WCDMA voice channel (30 kbit/s) has a spreading of 256. This effectively means that for every data symbol that is generated after 256 Multiply-Add-Accumulate (MAC) operations (nearly 4 clock cycles), 256 data samples need to be loaded into the RC array (32 clock cycle). So data movement overhead for dispreading is nearly 700%.
What is needed is a new reconfigurable processing architecture for wireless baseband processing. Preferably, such an architecture would utilize the same hardware resource of 64 RC cells, a given frame buffer size, and other structures that are found in the current reconfigurable processor design.