Scientific research has come to rely heavily on the use of computationally intensive numerical computations. In order to take full advantage of the parallelism offered by the custom hardware-based computing systems, the size of each individual arithmetic engine should be minimized. However, floating-point implementations (as most scientific software applications use) have a high hardware cost, which severely limits the parallelism and thus the performance of a hardware accelerator. For legacy scientific software, the use of high precision is motivated more out of the availability of the hardware in the existing microprocessors rather than the need for such high precision. For this reason, a reduced precision floating-point or fixed-point calculation will often suffice. At the same time, many scientific calculations rely on iterative methods, and convergence of such methods may suffer if precision is reduced. Specifically, a reduction in precision, which enables increased parallelism, would be worthwhile so long as the time to converge does not increase to the point that the overall computational throughput is reduced. In light of the above, a key factor in the success of a hardware accelerator is reaching the optimal trade-off between error, calculation time and hardware cost.
Approximation of real numbers with infinite range and precision by a finite set of numbers in a digital computer gives rise to approximation error which is managed in two fundamental ways. This leads to two fundamental representations: floating-point and fixed-point which limit (over their range) the relative and absolute approximation error respectively. The suitability of limited relative error in practice, which also enables representation of a much larger dynamic range of values than fixed-point of the same bit-width, makes floating-point a favorable choice for many numerical applications. Because of this, double precision floating-point arithmetic units have for a long time been included in general purpose computers, biasing software implementations toward this format. This feedback loop between floating-point applications and dedicated floating-point hardware units has produced a body of almost exclusively floating-point scientific software.
Although transistor scaling and architectural innovation have enabled increased power for the above mentioned general purpose computing platforms, recent advances in field programmable gate arrays (FPGAs) have tightened the performance gap to application specific integrated circuits (ASICs), motivating research into reconfigurable hardware acceleration (see Todman reference for a comprehensive survey of architectures and design methods). These platforms stand to provide greater computational power than general purpose computing, a feat accomplished by tailoring hardware calculation units to the application, and exploiting parallelism by replication in FPGAs.
To leverage the FPGA parallelism, calculation units should use as few resources as possible, standing in contrast to (resource demanding) floating-point implementations used in reference software. As mentioned above however, the use of floating-point in the software results mostly from having access to floating-point hardware rather than out of necessity of a high degree of precision. Thus by moving to a reduced yet sufficient precision floating-point or fixed-point scheme, a more resource efficient implementation can be used leading to increased parallelism and with it higher computational throughput. Therefore, due to its impact on resources and latency, choice of data representation (allocating the bit-width for the intermediate variables) is a key factor in the performance of such accelerators. Hence, developing structured approaches to automatically determine the data representation is becoming a central problem in high-level design automation of hardware accelerators; either during architectural exploration and behavioral synthesis, or as a pre-processing step to register transfer-level (RTL) synthesis. By reducing bit-width, all elements of the data-path will be reduced, most importantly memories, as a single bit saved reduces the cost of every single address. Furthermore, propagation delay of arithmetic units is often tied to bit-width, as well as latency in the case of sequential units (e.g. sequential multiplier or divider).
Given the impact of calculation unit size on performance, a key problem that arises in accelerator design is that of numerical representation. Since instruction processors at the heart of most computing platforms have included hardware support for IEEE 754 single or double precision floating-point operations [see IEEE reference listed at the end of this document] for over a decade, these floating-point data types are often used in applications where they far exceed the real precision requirements just because support is there. While no real gains stand to be made by using custom precision formats on an instruction based platform with floating-point hardware support, calculation units within an accelerator can benefit significantly from reducing the precision from the relatively costly double or single IEEE 754. FIG. 1 depicts the a system which executes a process of performing a calculation within tolerances, where the input to a calculation 100 known to a certain tolerance 101 produces a result 102 also within a tolerance that depends on the precision used to perform the calculation. Under IEEE 754 double precision, implementation of the calculation will have a certain cost 103 and produce a certain tolerance 104. Choosing a custom representation enables the tolerance of the result 105 to be expanded, yet remaining within the tolerance requirements 106, in exchange for a smaller implementation cost 107. Some existing methods for determining custom data representations to leverage these hardware gains are discussed below.
Two main types of numerical representation are in widespread use today: fixed and floating-point. Fixed-point data types consist of an integer and a constant (implied) scale factor, which results in representation of a range of numbers separated by fixed step size and thus bounded absolute error. Floating-point on the other hand consists (in simple terms) of a normalized mantissa (e.g. m with range 1≦m<2) and an encoded scale factor (e.g. as powers of two) which represents a range of numbers separated by step size dependent on the represented number and thus bounded relative error and larger dynamic range of numbers than fixed point.
Determining custom word sizes for the above mentioned fixed- and floating-point data types is often split into two complementary parts of 1) representation search which decides (based on an implementation cost model and feedback from bounds estimation) how to update a current candidate representation and 2) bounds estimation which evaluates the range and precision implications of a given representation choice in the design. An understanding about bounds on both the range and precision of intermediate variables is necessary to reach a conclusion about the quantization performance of a chosen representation scheme, and both aspects of the problem have received attention in the literature.
Bounds estimation for an intermediate variable from a calculation itself has two aspects. Bounds on both the range and precision required must be determined, from which can be inferred the required number of exponent and mantissa bits in floating-point, or integer and fraction bits in fixed-point. A large volume of work exists targeted at determining range and/or precision bounds in the context of both digital signal processing (DSP) and embedded systems domains [see Todman reference], yielding two classes of methods: 1) formal based primarily on affine arithmetic [see Stolfi reference and the Lee reference] or interval arithmetic [see Moore reference]; and 2) empirical based on simulation [see Shi reference and Mallik reference], which can be either naive or smart (depending on how the input simulation vectors are generated and used). Differences between these two fundamental approaches are discussed below.
Empirical methods require extensive compute times and produce non-robust bit-widths. They rely on a representative input data set and work by comparing the outcome of simulation of the reduced precision system to that of the “infinite” precision system, “infinite” being approximated by “very high”—e.g. double precision floating-point on a general purpose machine. Approaches including those noted in the Belanovic reference, the Gaffar reference, and the Mallik reference seek to determine the range of intermediate variables by direct simulation while the Shi reference creates a new system related to the difference between the infinite and reduced precision systems, reducing the volume of simulation data which must be applied. Although simulation tends to produce more compact data representations than analytical approaches, often the resultant system is not robust, i.e. situations not covered by the simulation stimuli can lead to overflow conditions resulting in incorrect behavior.
These methods are largely inadequate for scientific computing, due in part to differences between general scientific computing applications and the DSP/embedded systems application domain. Many DSP systems can be characterized very well (in terms of both their input and output) using statistics such as expected input distribution, input correlation, signal to noise ratio, bit error rate, etc. This enables efficient stimuli modelling providing a framework for simulation, especially if error (noise) is already a consideration in the system (as is often the case for DSP). Also, given the real-time nature of many DSP/embedded systems applications, the potential input space may be restricted enough to permit very good coverage during simulation. Contrast these scenarios to general scientific computing where there is often minimal error consideration provided and where stimuli characterization is often not as extensive as for DSP.
FIG. 2 illustrates the differences between simulation based (empirical) 200 and formal 201 methods when applied to analyze a scientific calculation 202. Simulation based methods 200 are characterized by a need for models 203 and stimuli 204, excessive run-times 205 and lack of robustness 206, due to which they cannot be relied upon in scientific computing implementations. In contrast, formal methods 201 depend on the calculation 202 only and provide robust bit-widths 207. An obvious formal approach to the problem is known as range or interval arithmetic (IA) [see Moore reference], which establishes worst-case bounds on each intermediate step of the calculation by establishing worst-case bounds at each step. Expressions can be derived for the elementary operations, and compounded starting from the range of the inputs. However, since dependencies between intermediate variables are not taken into account, the range explosion phenomenon results; the range obtained using IA is much larger than the actual possible range of values causing severe over-allocation of resources.
In order to combat this, affine arithmetic (AA) has arisen which keeps track (linearly) of interdependencies between variables (e.g. see Fang reference, Cong reference, Osborne reference, and Stolfi reference) and non-affine operations are replaced with an affine approximation often including introduction of a new variable (consult the Lopez reference for a summary of approximations used for common non-affine operations). While often much better than IA, AA can still result in an overestimate of an intermediate variable's potential range, particularly when strongly non-affine operations occur as a part of the calculation, a compelling example being division. As the Fang reference points out, this scenario is rare in DSP, accounting in part for the success of AA in DSP however it occurs frequently in scientific calculations.
FIG. 3 summarizes the landscape of techniques that address the bit-width allocation problem. It should be noted that FIG. 3 includes prior art methods as well as methods which fall within the purview of the present invention. Empirical 300 methods require extensive compute times and produce non-robust 301 bit-widths; while formal 302 methods guarantee robustness 303, but can over-allocate resources. Despite the success of the above techniques for a variety of DSP and embedded applications, interest has been mounting in custom acceleration of scientific computing. Examples include: computational fluid dynamics [see Sano reference], molecular dynamics [see Scrofano reference] or finite element modeling [see Mafi reference]. Scientific computing brings unique challenges because, in general, robust bit-widths are required in the scientific domain to guarantee correctness, which eliminates empirical methods. Further, ill-conditioned operations, such as division (common in numerical algorithms), can lead to severe over-allocation and even indeterminacy for the existing formal methods based on interval 304 or affine 305 arithmetic.
Exactly solving the data representation problem is tantamount to solving global optimization in general for which no scalable methods are known. It should be noted that, while relaxation to a convex problem is a common technique for solving some non-convex optimization problems [see Boyd reference], the resulting formulation for some scientific calculations can be extremely ill-conditioned. This leads, once more, to resource over-allocation.
While the above deals primarily with the bounds estimation aspect of the problem, the other key aspect of the bit-width allocation problem is the representation search, which relies on hardware cost models and error models, and has also been addressed in other works (e.g., [see Gaffar reference]). However, existing methods deal with only one of fixed- or floating-point representations for an entire datapath (thus requiring an up-front choice before representation assignment).
It should also be noted that present methods largely do not address iterative calculations. When iterative calculations have been addressed, such measures were primarily for DSP type applications (e.g. see Fang reference]) such as in a small stable infinite impulse response (IIR) filter).
There is therefore a need for novel methods and devices which mitigate if not overcome the shortcomings of the prior art as detailed above.