1. Field of the Invention
The present invention relates to ultra-low power (ULP) embedded systems, and to application specific instruction set processors (ASIPs) therefore. More particularly, the present invention relates to methods and devices for reducing power consumption in such ASIPs.
2. Description of the Related Technology
Low power implementation is a key element in embedded system design. The low power nature can be used in various domains, for example ranging from biomedical domain to consumer electronics. ASIP or Application Specific Instruction set Processors are used instead of general purpose processors to reduce the power consumption. These ASIP architectures have to be customized to reduce the power consumption.
In applications that exhibit a high degree of data-level parallelism, such as multimedia or wireless communications, single instruction multiple data (SIMD) technique is used to greatly reduce application execution time and energy consumption, thus increasing the performance of applications. In SIMD, multiple data (sub-words) are packed together and operated on as a single word, as illustrated in FIG. 1. The SIMD exploits the data parallelism that is contained in applications, for example in multimedia. Thereby, the cost in energy and time associated with the SIMD operation (operand load, execution, result storage) is shared by all the sub-words. Depending on the platform support, hardware SIMD (Hard-SIMD) and software SIMD (Soft-SIMD) are possible. By applying SIMD, Soft or Hard, the increase in parallelism leads to reduction in the number of operations performed on the data and of accesses to the instruction memory. As fewer operations have to be scheduled, performance can be improved.
Hardware SIMD (Hard-SIMD) is currently used by commercial energy efficient high performance processors, which provide special hardware in their data path that supports combinations of sub-words of the same length (e.g. a TIC64 of Texas Instruments supports 1×32 bit, 2×16 bit or 4×8 bit sub-word). However, the limited number of different sub-words (typically power-of-two) sets the ceiling of the data to the next available (power-of-2) hardware sub-word. This leads to unused bits and therefore a loss in efficiency. Moreover, different Hard-SIMD implementations are incompatible with each other, due to the fact that each implementation provides a number of SIMD and a set of pack and unpack instructions based on the target architecture, leading to low portability of the application. As an example, Hardware SIMD with 16 bits sub-words is illustrated in FIG. 2.
Software SIMD (Soft-SIMD) is independent of the target architecture providing portability of the application. It does not require operators with specific hardware support and as a result of this, additional instructions must be inserted in order to guarantee the functional correctness. To guarantee that sub-words remain independent is now a designer's task. This can be done, e.g. by inserting guards bits between sub-words which avoid bit rippling across sub-words. Such an approach enables high flexibility in the sub-word configuration. This enables sub-words of arbitrary length to be operated together as a single word. Soft-SIMD with different sub-words is illustrated in FIG. 3.
The Soft-SIMD speed up in performance is lower than for the Hard-SIMD, but due to the flexibility of the sub-words it can explore the bit-width (word-length) information by making more efficient combinations. The Soft-SIMD can improve the performance, the energy efficiency and the flexibility. However, care must be taken in order not to lose the obtained gain due to the insertion of the extra operations for correction, such as packing and masking operations. Soft-SIMD is able of packing different sub-words leading to more efficient parallelization and also enables the usage of SIMD in applications that cannot benefit from Hard-SIMD, e.g. the applications where the minimum sub-word is larger than the supported sub-words in hardware.
Soft-SIMD suffers from two major drawbacks. Firstly, only positive numbers can be operated. Guard bits need to be equal for all the sub-words, but due to the nature of 2's complement representation, 0's are required as guard bits for positive numbers and 1's for negative ones. This can be easily solved by adding an offset to the computed data. The latter should be adapted at different parts of the processing in order to maintain the positive sign while optimally utilizing the available bit-widths. The second drawback comes from the fact that multiplications require a prohibitive amount of guard bits. This seriously reduces efficiency and thus the application scope of the technique. The need for many guard bits is related to the bit-width growth experienced in a multiplication, as explained below.
Multiplications are ‘expensive’ operations with a high execution duty cycle and thus its optimization has a big impact on the overall implementation cost. It is generally known that a constant multiplication can be converted into a shift-add operation. A multiplication of a variable by a constant can be factorized into a fixed sequence of shifts and add/subtracts. The number of shifts and add/subtracts required directly depends on the number of non-zero bits comprised in the constant. As an example, the multiplication of a variable a by the constant 138 is considered (3 non-zero bits: 10001010). This constant multiplication is equivalent to the following:b=a·138=(a<<7)+(a<<3)+(a<<1)  (Eq. 1),where the constant multiplication is replaced with one shift per non-zero bit (shift by 7, 3 and 1) and the addition of the shifted results. Clearly, constants with minimum non-zero bits are preferred as they lead to a cheaper implementation. Accordingly, constants are coded in a CSD (Canonic Signed Digit) form. CSD representation contains the minimum possible number of non-zero bits, which consists of, on average, 33% fewer non-zero bits than two's complement number.
Despite Constant Multiplication to Shift-Add/subtract (CM2SA) conversion is extensively exploited in VLSI (Very Large Scale Integration) synthesis, its application in compilers is rather limited. The systematic application of strength reduction techniques could even lead to a less efficient implementation depending on the processor architecture. Thus, very conservative assumptions are normally considered for e.g. only CM2SA for constants that are a power of 2. However, in an ASIP implementation, the processor architecture is not fixed, and processor's data path can be designed to effectively support shift-add/subtract operations.
The CM2SA example given above is based on leftshift operations, which emulates the data flow of a typical multiplier. However, this suffers from a severe drawback when fixed-point data of multiple bit-widths is operated. Namely, output data bit-width grows only depending on the input (e.g. in an unsigned multiplier: outbit=in1bit+in2bit). In case the precision required in the result is lower, the bit-width can only be reduced at the output of the operation with a right-shift. To illustrate the principle the same example is considered, but now taking into account bit-width information: the variable a requires 7 bits, 138 requires 8 bits and the solution b requires 9 bits, respectively. FIG. 4(a) shows how this is operated on a CM2SA based on left-shift according to equation (Eq. 1) above. A variable a of all ‘1s’ (127) is considered as the worst case for dynamic range growth. The result b results from adding three shifted values of a. As only the higher 9 bit of the output are relevant, the result b is truncated by right-shifting 6 positions. However, the operator needs to be dimensioned for 15 bit output in order to avoid over-flow. Given that only 9 bits are required, this incurs in undesired overhead.
There is room for improved methods and devices for reducing power consumption in embedded systems.