In deeply embedded application spaces such as power metering, hardware support for high-dynamic range arithmetic operations is important to maximize system performance and minimize device power dissipation. Conventional general purpose processing cores are optimized for general purpose applications, and often cannot support the required computational performance for many deeply embedded application spaces due to a lack of hardware support for high-dynamic range arithmetic operations such as 64-bit arithmetic operations like divide, square root, multiply and saturated fractional signal processing.
The leading microcontroller unit (MCU) providers have addressed the need for hardware support within embedded applications for such high-dynamic range arithmetic operations by either providing more advanced processing core architectures, for example the ARM™ Cortex™-M4 with an FPU (floating point unit) module, or by integration of a dedicated, memory-mapped arithmetic hardware unit with a more general purpose processing core. Whilst the more advanced processing core architectures may be suitable for high-end applications, their higher unit costs typically make them prohibitively expensive for lower-end applications. Accordingly, integration of a dedicated, memory-mapped arithmetic hardware unit with a general purpose processing core is required for providing hardware support within lower-end embedded applications within the specified size and power constraints.
Conventionally, a dedicated memory-mapped arithmetic unit connects to a microcontroller core through a hardware interface. The arithmetic unit is typically implemented as a hardwired logic circuit designed to calculate basic operations such as multiply, multiply-accumulate and multiply-subtract in a single clock cycle, and more advanced operations such divide and square-root in several clock cycles. While the performance of a stand-alone arithmetic unit for high-dynamic range calculations can provide a several-fold increase versus the most common microcontroller cores, the ability to take advantage of such computational performance of the arithmetic hardware unit is typically limited by the ability of the hardware interface to interface the arithmetic unit to the general purpose microcontroller core, and vice-versa.
FIG. 1 illustrates a conventional basic hardware interface implementation for the memory-mapped arithmetic hardware unit 110. The memory-mapped arithmetic unit 110 is shown as 32-bit design. Accordingly, the width of the programming model registers is 32 bits. For 64-bit arithmetic operations being accelerated, two 32-bit registers are concatenated together to form the required 64-bit data operand. By convention, the referenced 32-bit registers are named <REG>_H{igh} and <REG>_L{ow}.
The basic hardware interface 120 illustrated in FIG. 1 comprises input operand registers (OP1_L, OP1_H, OP2_L and OP2_H) into which input operands are written for 64-bit arithmetic operations to be performed by the arithmetic unit 110, a control register used to select and initiate operations, result registers (RES_L and RES_H) from which the results of the 64-bit arithmetic operations performed by the arithmetic unit 110 are read.
A 64=64/64 divide operation is used as an example of a 64-bit arithmetic operation to be performed by the arithmetic unit 110. For completeness, a 64=64/64 divide operation comprises 64-bit numerator being divided by a 64-bit denominator and the resulting 64-bit quotient calculated. The 64=64/64 programming model realized by the basic hardware interface 120 illustrated in FIG. 1 comprises the following memory-mapped ‘accesses’ for a 64=64/64 divide operation:
1.ADDR(0x00) ← NUMERATOR_L// write least-significant 32 bits of numerator toOP1_L2.ADDR(0x04) ← NUMERATOR_H// write most-significant 32 bits of numeratorto OP1_H3.ADDR(0x08) ← DENOMINATOR_L// write least-significant 32 bits of denominatorto OP2_L4.ADDR(0x0C) ← DENOMINATOR_H// write most-significant 32 bits ofdenominator to// OP2_H5.ADDR(0x10) ← CONTROL(0x01)// write to control register to select & triggeroperation6.QUOTIENT_L ← ADDR(0x14)// read the least-significant 32 bits of resultquotient  // from RES_L7.QUOTIENT_H ← ADDR(0x18)// read the most-significant 32 bits of resultquotient  // from RES_H
The basic hardware interface 120 is designed to perform one action with each access. The control register is provided to select the required operation and trigger its execution. For example, the control register may comprise a bit map for the supported operations, whereby each supported operation is identified by a single bit within the control register, as shown.
All of the operand, control and result registers are mapped continuously in the address space, which allows use of memory load and store multiple instructions using indirect addressing with automatic post incrementing of the address register. Use of such indirect addressing load and store multiple instructions increases computational throughput of the basic hardware interface 120. However, performing a single 64-bit arithmetic operation using the basic hardware interface 120 requires an additional access to be performed by writing to the control register for selecting and triggering the arithmetic unit to perform the required arithmetic operation.
It should be noted that programming model examples provided herein correspond to a “little endian” memory convention. However, it will be appreciated that other implementations can follow alternative memory conventions and organisations, for example, big endian.
FIG. 2 illustrates a known advanced hardware interface implementation for the memory-mapped arithmetic hardware unit 110. The advanced hardware interface 220 illustrated in FIG. 2 comprises a series of pairs of first input operand registers (OP1) into which input operands are written, a pair of second input operand registers (OP2) into which input operands are written and a pair of result registers (RES_L and RES_H) from which the results of the 64-bit arithmetic operations performed by the arithmetic unit 110 are read. The advanced hardware interface 220 uses a write to a first OP1 operand register pair to also select the operation to be performed by the arithmetic unit 110 and a write to the second OP2 operand register pair to initiate the operation.
A 64=64/64 divide operation is again used as an example of a 64-bit arithmetic operation to be performed by the arithmetic unit 110. The 64=64/64 programming model realized by the advanced hardware interface 220 illustrated in FIG. 2 comprises the following memory-mapped ‘accesses’:
1.ADDR(0x00) ← NUMERATOR_L// write least-significant 32 bits of numerator toOP1_L// & select 64=64/64 operation2.ADDR(0x04) ← NUMERATOR_H// write most-significant 32 bits of numeratorto OP1_H3.ADDR(0x44) ← DENOMINATOR_L // write least-significant 32 bits of denominatorto OP2_L4.ADDR(0x48) ← DENOMINATOR_H// write most-significant 32 bits ofdenominator to// OP2_H & trigger 64=64/64 operation5.QUOTIENT_L ← ADDR(0x4C)// read the least-significant 32 bits of resultquotient// from RES_L6.QUOTIENT_H ← ADDR(0x50)// read the most-significant 32 bits of resultquotient// from RES_H
The advanced hardware interface 220 is designed to perform multiple actions along with each particular access; operation type being selected based on the first input operand register address and triggered by the last write to the second operand. In this manner, the separate access to the control register required for the basic hardware interface 120 illustrated in FIG. 1 is not required, reducing the number of accesses required to be made to perform a single 64-bit arithmetic operation and retrieve the result to six.
However, the register mappings of the advanced hardware interface 220 by principle cannot be sequentially addressed for all operations, and thus the use of instructions with indirect addressing with automatic post incrementing of the address register is very limited or even not applicable. Furthermore, the number of registers required to be implemented within the advanced hardware interface 220 is greatly increased as compared with the basic hardware interface 120 due to the need for separate first input operand registers (OP1) to be provided for each supported operation.
These known hardware interfaces to dedicated arithmetic units illustrated in FIGS. 1 and 2 introduce different performance inefficiencies caused by the use of a specific register to select and trigger arithmetic operations or the non-sequential register layout.