Semiconductor companies offering baseband ICs for handsets, face the following challenges: die size, power efficiency, performance, time to market, inter-/intra-RAT (Radio Access Technology) multimode support and evolving standards. Dedicated hardware designs have been used in the past, because they give the best die size and power efficiency. A recent trend is to employ software-defined radio (SDR) based implementations, because they offer a fast time to market. Which methodology is the most advantageous depends on the product requirements. Product requirements do however tend to change over time. Flexibility is typically very important for an early solution, whereas die size and power efficiency matter most for a mature solution. What is needed is a data processor that offers a straight forward migration path from a flexible solution to a die size and power efficient solution.
An SDR system is a radio communication system implemented by means of software on an embedded system. While the concept of SDR is not new, the rapidly evolving capabilities of digital electronics render practical many processes which previously were only theoretically possible.
The prior-art technologies that have been used to implement SDR solutions are Coarse Grain Reconfigurable Arrays (CGRA), Digital Signal Processors (DSP), and Reconfigurable Computing Platforms (RCP).
CGRAs offer high processing power, and high flexibility, with re-configurable Data Processing Units (rDPU), and a configurable communication fabric where the configurability is on a word or operand level (“Reconfigurable computing: architectures and design methods”, T. J. Todman, G. A. Constantinides, S. J. E. Wilton, O. Mencer, W. Luk and P. Y. K. Cheung, 2005).
There are many different CGRA implementations. A CGRA may utilise hundreds or only a handful of rDPUs, and may be homogenous, where all the rDPUs are the same, or heterogeneous where different rDPUs are specialised for different tasks. The rDPU capability can range from a minimal Arithmetic Logic Unit (ALU) or buffer, to a complete processor. The re-configurability can also differ greatly between different implementations.
The problem with CGRA is that it is very difficult to, e.g., implement software tools that map high level C code onto the platform in an efficient way. The software designer is instead required to understand the platform in detail and write specific software tailored to the CGRA.
DSPs have a long history in the field and are typically used for the implementation of GSM baseband receivers. DSPs are optimised to process large amounts of data rather than executing control code. High data throughput is achieved by exploiting parallelism in the algorithm. There are several architectural approaches to do this. The most die size efficient and hence most relevant for embedded systems are Very Long Instruction Word (VLIW) to exploit instruction parallelism, Single Instruction Multiple Data (SIMD) to exploit data parallelism, multithreading to exploit task parallelism and command execution pipelining to increase throughput.
In a VLIW architecture, several commands for several functional units are issued in parallel. The most straight forward implementation of the VLIW architecture is a 3-way VLIW that allows Data-Load, Arithmetic and Data-Store to happen concurrently. More advanced implementations allow concurrent independent operation of multiple ALUs. All interdependencies are statically resolved by the compiler and there is no dedicated hardware required to detect and handle interdependencies. This makes the VLIW architecture attractive for embedded solutions. One issue with VLIW architectures is that they are quite inefficient at implementing data parallelism from an instruction code size point of view. One separate instruction for each functional unit is always required, even if the same operation is performed multiple times, due to data parallelism. Code size is a very important issue for embedded solutions and the SIMD architecture is a straight forward solution to this problem.
In the SIMD architecture, the same command is executed on the different elements of a data vector. The processor operates on a vector file, rather than a register file, in which the vectors are typically relatively short. A vector register is usually between 64 and 512 bits wide, which corresponds to between 4 and 32 16-bit vector elements. Long data vectors have to be split up into shorter vectors of the natively supported size before they can be processed.
Almdahl's law states that there is an upper limit to the effectiveness of the speeding up of subtasks. One way of combating this limit is to exploit instruction parallelism. VLIW architecture is quite often employed to control a scalar unit, an SIMD unit and an address generation unit. VLIW allows the simultaneous data load, data store, control code execution and vector data processing.
Multithreading means that independent software tasks can be executed in parallel and is commonly implemented by time-sharing the processor core. This is not very helpful in the embedded context, because time-sharing increases overhead. For an SDR solution, there are typically hard time limits by which a certain number of tasks must have completed. It is normally not important that tasks can run concurrently. Multithreading can however be used to utilise functional units that would otherwise be idle. The VLIW architecture lends itself to this optimisation. Another way of implementing multithreading is to allow different tasks to occupy different stages of the command pipeline.
An RCP combines one or more conventional processors with one or more reconfigurable processing units (G. Estrin, “Reconfigurable computer origins: The UCLA fixed-plus-variable (F+V) structure computer,” IEEE Annals of the History of Computing, vol. 24, no. 4, pp. 3-9, 2002).
The most straight forward way of speeding up a system is via duplication. The task is split up in sub-tasks, which are executed in parallel. This is however quite expensive. The maximal speedup achievable in practice is limited by the interdependencies within the sub-tasks.
A reconfigurable architecture on the other hand allows the speeding up of the individual sub-functions. The cost for the extra hardware is limited due to the reuse. A big advantage of this approach is that interdependencies between subtasks do not matter, because they are still executed in sequence. Almdahl's law does however put an upper limit on the achievable overall speed increase.
One driver for the RCP architecture is that the overall modem processing consists of a sequence of computational kernels (e.g. FFT, Demodulation, and Decoding). A computational kernel typically implements a tight loop that processes a block of data. Data transfer between these kernels is relatively small. The problem can be separated into two tasks. One task is to implement the kernel. This is ideally done by reconfigurable hardware which executes each computational kernel efficiently. The second task is to reconfigure the hardware and schedule the execution of the kernels. This task is implemented by a conventional processor.
There are four basic options for connecting the reconfigurable hardware to the processor, externally, or via a processor bus, coprocessor or register file (see Reconfigurable Architectures for Embedded Systems, Henrik Svensson, 2008).
The most suitable connection depends on the granularity of the operation of the reconfigurable hardware. An external connection can be used if the reconfigurable hardware is implemented in a field-programmable gate array (FPGA) and does not need to access the processor memory directly. The processor bus is an option if the reconfigurable hardware implements a whole kernel and data can be transferred via Direct Memory Access (DMA) to and from the processor memory. Alternatively the reconfigurable hardware can be instantiated as a coprocessor with direct access to the processor cache. Finally the reconfigurable hardware can also be a functional unit of the processor and operate directly on the register file.
One advantage of dedicated hardware solutions is that the operand bit widths can be optimised for each specific application. This minimises the die size of the logic as well as the size of the associated memory. SDR solutions are commonly limited to using 8, 16, 32 and 64 bit operands. It is unlikely that this matches exactly with the actual needs of the processing being implemented. SDR solutions hence use larger than necessary operands and are hence less die size and memory area efficient than dedicated hardware.
Die size and power efficiency are crucial for mobile devices. For a processor based solution this is directly related to the code size and the operand width. Prior-art technologies that have been used to implement SDR solutions utilise processors that implement elementary operations, short vectors and standard operand widths. The Stream Data Processor (SDP) solution improves die size and power efficiency by using high level operations, data stream operands and tailored operand widths. A data stream is a data sample sequence of arbitrary length. High level operations and data stream operands reduce the number of instructions. Tailored operand widths reduce the size of the data memory and the processing function.
The SDP operates on data streams and can access data stream elements in random order. This allows the efficient implementation of a range of high level functions such as de-interleaving, sorting, and matrix operations. Interleaving is a standard technique to distribute errors evenly over large data packets and is typically used on multiple levels in telecommunication standards. The digital baseband has to perform de-interleaving operations, but this is difficult to implement efficiently on a processor that operates on short vectors, as multiple iterations of shuffle operations are required to process large data packets. The SDP can however de-interleave a large data packet efficiently in a single pass.
The SDP based architecture offers a fast time to market with a straight forward migration path to dedicated hardware, and can be prepared during algorithm development as soon as the basic processing functions are known. The dedicated hardware implementation of the processing elements and functions can be developed and tested in parallel with the algorithm development. The sequencing of the functions can remain programmable for an early platform. Once the algorithm has been frozen, the sequencing can be implemented in dedicated hardware which is more die size and power efficient. This enables the development of a processing engine with flexibility, tailored to the application to be targeted.