The present disclosure relates generally to pipeline processing of digital circuits and, more particularly, to methods and devices for minimizing power consumption in asynchronous dataflow architectures.
Current embedded computing systems have power efficiencies in the neighborhood of around 1-10 billion floating point operations per second (GFLOPS) per Watt. However, for future applications it is anticipated that desired computational capabilities will require at least 50 GFLOPS per Watt, and perhaps as much as 75 GFLOPS per Watt will be necessary in the near future.
In the past, at future sizes larger than 45 nm, computer architects could rely on increased computing performance with each processor generation. This was in accordance with both Moore's law (which resulted in a doubling of the number of transistors in each new generation) and Dennard's law (which resulted in increasing clock speeds by about 40 percent for each new generation without increasing power density). This scaling had previously allowed for increased performance without the penalty of increased power. In other words, the power per unit area (power density) had remained constant.
More recently however, Dennard's law has broken down and clock speed scaling with respect to constant power density has not held. Consequently, each recent generation of chip technology that has experienced increasing number of transistors (due to the continuation of Moore's Law) now comes with the cost of increased power (due to the breakdown of Dennard's law). This in turn has caused power efficiency to reach a limit of about 10 GFLOPS per Watt. Thus, recent and future applications that need lower size, weight, and power (SWaP) will need efficiencies beyond this limit in order to fulfill their mission needs.
Existing, solutions to the performance scaling problem have focused on various areas, including for example: (1) chip multiprocessors, (2) voltage scaling, (3) exploration of other energy-barrier devices, and (4) asynchronous or clockless techniques. These different approaches have both advantages and disadvantages associated with each. In the case of multicore processors or chip multiprocessors, the addition of more processors certainly increases chip performance. However, unless the power consumed per instruction is reduced, there will still be an increase in power density. In addition, multicore processors have proved to be very difficult to program and have failed to reach their utilization potential.
Dataflow based approaches are very effective for problems that can be laid out in a parallel manner. This approach localizes data movements and nearly eliminates all memory traffic not required for algorithmic-temporal purposes. Both FPGAs and other alternative architectures have been developed that combine a large number processing elements cross-connected with high-speed data paths. They offer the ability to perform parallel operations without constantly returning data to storage locations. Alternative reconfigurable architectures based on a word-level self-synchronized dataflow have been shown to have 10× power efficiency improvement for PO and RE DoD missions, when compared to conventional processors (see, e.g., Prager, et al., “World's First Polymorphic Computer—MONARCH,” in 11th Annual High Performance Embedded Computing (HPEC) Workshop, 2007.)
Recent Raytheon research into advanced reconfigurable approaches consider the close, binding of the dataflow synchronization with asynchronous logic and voltage scaling logic to get an additional 100× power advantage. In this approach, as data arrives at the cell, a regulator increases supply voltage to accelerate the operation. However, as the output queue fills, the regulator reduces voltage to reduce power when downstream elements cannot use the results. Thus power is automatically reduced to the lowest possible level for the input data rates and processing algorithms. Leakage power is reduced through the reduced voltages as well. Resilience to semiconductor performance variations due to doping or voltage is an additional benefit achieved by the asynchronous timing and local voltage regulation, allowing chips or portions of a chip to run as fast as possible and also slow producing power if other parts of the chip cannot sustain the higher speed (see, e.g., Marr, et al, “An Asynchronously Embedded Datapath for Performance Acceleration and Energy Efficiency,” in Proceedings of the International Symposium on Circuits and Systems, 2012.)