The present invention relates to a method of, and apparatus for, optimisation of data paths in parallel pipelined hardware. In one embodiment, the present invention relates to a method of, and apparatus for, optimisation of data paths of a stream processor for implementation on a Field-Programmable Gate Array (FPGA).
Computer systems are often used to solve complex numerical calculations. Often, a multiplicity of iterations of a calculation is required in order to compute variables in a computation such as the modelling of real-world phenomena such as wave propagation in a medium. Such calculations often comprise many dimensional arrays and vast numbers of data points. As such, these calculations require considerable computing resources to complete.
One approach to improve the speed of a computer system for specialist computing applications is to use additional or specialist hardware accelerators. These hardware accelerators increase the computing power available and concomitantly reduce the time required to perform the calculations.
A suitable system is a stream processing accelerator having a dedicated local memory. The accelerator may be located on an add-in card which is connected to the computer via a bus such as Peripheral Component Interconnect Express (PCI-E). The bulk of the numerical calculations can then be handled by the specialised accelerator.
Stream processor accelerators can be implemented using, for example, Field-Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs) and/or structured ASICs. In certain cases, such arrangement may increase the performance of highly parallel applications by over an order of magnitude or more.
A schematic example of an FPGA device is shown in FIG. 1. Different types of FPGA chips may be used; however the larger and more arithmetic function-rich FPGAs are more desirable. The FPGA 10 comprises a programmable semiconductor device which comprises a matrix of configurable logic blocks (CLBs) 12 connected via programmable reconfigurable interconnects 14 (shown here as the shaded area in FIG. 1). In order to get data into and out of the FPGA 10, a plurality of input pads 16 and output pads 18 are provided.
The CLBs 12 are the basic logic unit of the FPGA 10. A schematic diagram of a typical CLB 12 is shown in FIG. 2. The CLB 12 comprises a configurable switch matrix comprising typically a 4 or 6 input look up table (LUT) 20, which in some cases may also be configurable as a small buffer of up to about 32 bits, some specialist circuitry (such as, for example, a multiplexer), one or more flip-flop units 22 which act as temporary memory storage and an output 24. Additionally, an FPGA 10 comprises a plurality of block memory units 26. The block memory units 26 comprise addressable memory units which can be used as storage buffers in the FPGA 10.
The LUTs 20 of each CLB 12 can be configured to perform a variety of functions; for example, logic gates such as NAND and XOR, or more complex functions. A typical FPGA may comprise up to 105 LUTs 20. The CLBs 12 are able to operate in parallel, providing a powerful resource for numerically-intense calculations.
FPGA-based stream processors comprise calculation functions mapped into one or more hardware units along the path from input to output. The FPGA then performs the computation by streaming the data items through the hardware units. Once the computation is complete, the data then moves “downstream” to further hardware units or to an output. The streaming architecture makes efficient utilization of the computation device, as every part of the circuit is performing an operation on one corresponding data item in the data stream at any point during the calculation.
FIG. 3 shows an example of such a streaming architecture created using the CLBs 12 of the FPGA to implement a stream processor thereon. FIG. 3 shows a 4 input 16-i, 4 output 18-i stream computing engine which can be implemented on the FPGA stream processor 10.
Between the inputs 16-i and the outputs 18-i is provided a computational data path 30. The computational data path 30 is a graphical representation of an algorithm as it is expressed in hardware. The computational data path 30 is also referred to as a kernel. A typical FPGA 10 may comprise a multiplicity of parallel kernels.
The computational data path 30 is implemented using the CLBs 12 and other logic and comprises arithmetic operations 32 (performed in one or more LUTs 20) and buffer memories 26. In other words, each arithmetic unit 32 is implemented in hardware as a hardware element (which may comprise one or more hardware units) on the FPGA. The buffer memories 26 may comprise either block RAM (as provided by the block memory units 26) or distributed RAM (comprising the memory made available through use of the LUTs 20 or flip flops 22). As shown, the computational data path 30 is arranged to process data in parallel.
In operation, the data is streamed through the CLBs 12 of the FPGA stream processor 10 and the arithmetic operations 32 are carried out on the data as it is streamed.
Often, FPGA circuits are designed using circuit schematics or a hardware description language (HDL) such as, for example, Verilog. HDLs are used to write synthesisable specifications for FPGA hardware. A simulation program is run which enables simulation of the desired spatial and temporal configuration of the FPGA so that the operation of the FPGA can be modelled accurately before being physically created. HDLs include syntax for expressing parallelism (also known as concurrency) and may include an explicit notion of time.
A number of different numerical arithmetic formats are available for use by the accelerator hardware in order to perform numerical calculations. Two commonly used numerical formats are floating point arithmetic and fixed point arithmetic.
The floating-point data format representation is used to represent numbers which cannot be efficiently represented as integers. Floating-point format numbers are in general represented to a fixed number of significant digits and scaled using an exponent and, as such, is similar in concept to scientific notation. The base for the scaling is normally 2, 10 or 16. The typical number that can be represented exactly is of the form shown in Equation 1:M×Be  (Equation 1)Where M is known as the “mantissa”, B is the base and e is the exponent. The mantissa M comprises a digit string of a given length in a given base B′ (which may be the same as B or different). The radix point is not explicitly included, but is implicitly assumed to lie in a certain position within the mantissa—often just after or just before the most significant digit, or to the right of the rightmost digit. The length of the mantissa determines the precision to which numbers can be represented.
The mantissa is multiplied by the base B raised to the power of the exponent e. If B is the base for the mantissa, then this operation is equivalent to shifting the radix point from its implied position by a number of places equal to the value of the exponent—to the right if the exponent is positive or to the left if the exponent is negative. Alternatively, the B may differ from the base of the mantissa for example, the mantissa could be binary but B=4.
The term “floating-point” relates to the ability of the radix point (or decimal point) to “float”. By this is meant that the radix point can be placed anywhere relative to the significant digits of the mantissa.
A number of different floating-point representations have been used in computers. However, a widely adopted standard is that defined by the IEEE 754 Standard, the most common formats of which are single precision and double precision. Single precision is a binary format that occupies 32 bits (4 bytes) and its mantissa has a precision of 24 bits (about 7 decimal digits). Any integer with absolute value less than or equal to 224 can, therefore, be exactly represented in the single precision format. Double precision is a binary format that occupies 64 bits (8 bytes) and its mantissa has a precision of 53 bits (about 16 decimal digits). Therefore, any integer with absolute value less than or equal to 253 can be exactly represented in the double precision format.
In contrast, a fixed-point data format representation comprises a number having a fixed number of digits after (and sometimes also before) the radix point (or decimal point). The fixed-point data format comprises an integer scaled by a specific scale factor determined by the type. For example, in order to represent the binary value 1.01, the fixed point binary number 1010 could be used with binary scaling factor of 1/1000. Generally, the scaling factor is the same for all values of the same type, and does not change during a computation. In most computational uses, the scaling factor is generally a power of two for use with binary notation. However, other scaling factors may be used.
When implemented on an FPGA, fixed point arithmetic units have an advantage over floating point units in that they require significantly less logic area to implement in hardware. For example, consider a floating point adder consisting of an “add block” that combines two inputs to produce one output.
A floating point adder for IEEE 754 single precision 32-bit floating point requires three computational stages: 1) to align the smaller input to the same exponent as the larger input, requiring the use of a barrel shifter; 2) to add the two mantissas (a fixed-point add); and 3) to normalize the result to the top of the floating point mantissa and calculate the exponent for the result, requiring another barrel shifter. In total, around 500 LUTs may be required to perform this calculation in floating point format.
In contrast, the calculation performed using a fixed point adder requires considerably fewer hardware resources. For example, a fixed point adder combining two 24-bit inputs would require 24 look-up tables (LUTs). Alternatively, a fixed point adder combining two 32-bit inputs typically requires 32 LUTs.
Therefore, it is apparent that the use of fixed point calculations translates into a significant reduction in silicon area for a given number of functions implemented on-chip. Consequently, many types of stream processor, for example those based around field-programmable gate arrays (FPGAs), are generally able to deliver more computational performance for fixed point (fixed range) numbers than for floating point. Since floating point arithmetic units require significantly more logic area on chip to implement than fixed point arithmetic units, more fixed point units can be implemented in parallel in the same silicon area. It is therefore beneficial to use fixed point data representations and fixed point calculations in order to achieve maximum performance.
Further, in many computational applications, the approximate range of the data is understood. Therefore, the extra number range provided by the floating point format (which may, for example, in IEEE 754 single precision format provide a range between +/−1039) is unnecessary, and some of the available range is wasted. Consequently, for situations where the number range used is known, the fixed point format is more resource-efficient.
However, there are disadvantages to the use of the fixed-point data format which have, to date, prevented fixed-point numerical units from finding widespread use in iterative numerical computational hardware. Fixed point data, by its very nature, lacks range flexibility and may lead to errors in calculations. Two situations in which errors may occur are in the cases of underflow and overflow.
Underflow to zero occurs when a data value becomes too small to be represented differently from zero, and is rounded to zero. This leads to a loss of precision since small values that should still contribute to the result of the computation are lost. Underflow-to-zero has a negative effect on the accuracy of a computation that is related to the nature of the computation and the number of and significance of the data values lost—in general, it is advantageous to minimize underflow.
In contrast, overflow occurs when a data value becomes too large to be represented. In floating point, such a value will become infinite. However, in fixed point, unless special overflow detection is implemented, an invalid result will be produced. For example, if two large positive numbers are added, a negative result could be returned. Overflow is often catastrophic, introducing completely incorrect values into the computation that lead to the result of a computation being worthless.
Both underflow and overflow are more important issues for fixed point computation having a particular bit-width than for floating point computation. This is because of the inherently smaller dynamic range supported by the number representation. Therefore, in order to make use of fixed point data in scientific computation, overflow must be prevented and underflow minimised. These objectives are generally opposed to each other; for example, although overflow can be prevented by moving bits from the least-significant part of the number to the most significant, providing “space” for the number to get larger, this will increase underflow.
To summarise, floating-point units require significant silicon area, and may provide unnecessarily large ranges and/or precision which are not required for the computational task. This is wasteful of valuable on-chip area. On the other hand, fixed-point units require less silicon area but lack numerical range flexibility. Thus, fixed-point units may be prone to generate errors in numerical calculations if the range limits are exceeded.
It is possible to provide dynamic scaling of input, output and intermediate values of a calculation. By this is meant that the range of fixed point values for a particular step or stage in the calculation can be elected in dependence upon the values of the input variables and the output variables. However, there is an on-chip cost to providing dynamic scaling in that a sufficient number of registers or hardware elements must be used to provide the required flexibility. Many of these will go unused in practice because the fixed point representation will only use a subset of the available bitwidth.
For many calculations, the likely scaling factor is generally known at any particular stage in a calculation. For example, for hardware optimised for a particular task (e.g. for processing an iterative calculation such as a convolution), the maximum values for a particular stage in a calculation are generally known. Therefore, the need for dynamic scaling of values is reduced and static scaling (or static shifts) can be used instead.
Static shifts require, by their very nature, fewer resources. As a result, they have considerable benefits in terms of hardware resource management. However, static shifts (i.e. the maximum value and number of bits thereof) must be selected carefully due to their fixed nature which cannot be altered once the stream processor is committed to hardware.
Consequently, the use of static shifts requires the range of numbers used in a computation to be known and understood at each stage in a calculation. However, the range of the numbers used in a computation can vary significantly between the input, the output and the intermediate stages of a calculation. Furthermore, it is common for inputs and outputs of a formula to have similar ranges, but for intermediate results to have quite different ranges. Equation 2 illustrates a simple example of this:
                              p          ′                =                                            p              ×              1000                        -            5                    1000                                    (                  Equation          ⁢                                          ⁢          2                )            In this case, the final result p′ will have a similar (though not identical) range to the input p, assuming the value of p is large compared to 5/1000. However, the intermediate value p×1000 will have a range a thousand times greater. If the same fixed point number format is used to represent p, p′ and p×1000 then either the value p×1000 will overflow or the values of p and p′ will be have to be represented less precisely since they are approximately 1000 times smaller (equivalent to approximately 9 bits).
For simple equations like the one above, there is a straightforward relationship between input range and intermediate value ranges. However, for more complex formulae the relationship can be non-obvious. For example, a convolution calculation is set out in Equation 3:For 0<=i<=N:y[i]=x[i−1]×c1+x[i]×c2+x[i+1]×c3  (Equation 3)The range (and most importantly, largest value) of y[i] depends not just on largest value in x but also on how the different values of x relate to each other (since y is computed from a combination of these values). Commonly convolutions might be used to compute derivatives, and the maximum value is related to the rate of change of the input values rather than directly to the values of the inputs themselves.
Therefore, whilst it may be possible to optimise fixed point types in a calculation for a particular dataset, this optimised data path may not produce the correct results when utilised with a dataset having different properties. Consequently, whilst automatic optimisations can be used to provide a fixed point data path having certain properties, such optimisation may not necessarily lead to a useable computation structure unless care is taken.
The example illustrated in FIG. 3 is a simple graph comprising a limited number of data paths, with each data path comprising only a small number of nodes and edges. Therefore, in principle, the above optimisation issues may be addressed manually for such a simple structure.
However, the number of data paths on a typical FPGA is around 10, and each data path may comprise a multiplicity of parallel branches which include, in total, typically 102 to 105 computation elements. This enables massively parallel calculations to be performed. Consequently, it is simply not possible to optimise manually outputs from all nodes forming intermediate steps in a typical parallel processor in order to address optimisation problems as set out above.
It is known to provide automatic conversion of floating point programs to fixed point. In general, known arrangements concentrate on automatically calculating fixed point types based on user-specified error constraints.
“Automatic conversion of floating point MATLAB programs into fixed point FPGA based hardware design” Banerjee, P.; Bagchi, D.; Haldar, M.; Nayak, A.; Kim, V.; Uribe, R. 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2003. IEEE relates to an approach where value ranges are automatically propagated forwards then backwards in a simplified MATLAB program to compute appropriate types.
“FRIDGE: a fixed-point design and simulation environment” Keding, H.; Willems, M.; Coors, M.; Meyr, H. In Proceedings Design, Automation and Test in Europe (DATE), 1998. IEEE relates to a configuration which transforms floating point programs written in ANSI C into a fixed point specification and assigns a tuple of (word-length, integer-bits, sign) to each operand.
“An automatic word length determination method” Cantin, M.-A.; Savaria, Y.; Prodanos, D.; Lavoie, P, 2001 IEEE International Symposium on Circuits and Systems, 2001. ISCAS 2001. IEEE relates to a method which finds fixed point types. This is done by computing the differences between floating point and fixed point versions by simulating the data path and finding a set of word lengths where the differences meet a criterion set by the user.
“Wordlength optimization for linear digital signal processing” Constantinides, G. A.; Cheung, P. Y. K.; Luk, W., IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, volume 22 issue 10, IEEE relates to an approach which concentrates on determining the optimum fixed point word lengths for variables in a DSP data path under error constraints.
Therefore, to date, known methods and associated hardware have suffered from the technical problem that they are unable to determine appropriate scaling factors, maximum values and rounding methods for each stage of a calculation in dependence upon user-determined parameters in order to provide a user-selected balance between consumption of on-chip resources and providing sufficient precision and/or accuracy in a parallel calculation on a stream processor.
According to a first aspect of the present invention, there is provided a method of generating a hardware design for a pipelined parallel stream processor, the method comprising: defining, on a computing device, a processing operation designating processes to be implemented in hardware as part of said pipelined parallel stream processor; specifying, on a computing device, at least one propagation rule for said processing operation; defining, on a computing device, a graph representing said processing operation as a parallel structure in the time domain, said graph comprising at least one data path to be implemented as a hardware design for said pipelined parallel stream processor and comprising a plurality of parallel branches configured to enable data values to be streamed therethrough, the or each data path being represented as comprising: at least one data path input; at least one data path output; and at least one discrete object corresponding directly to a hardware element to be implemented in hardware as part of said pipelined parallel stream processor, the or each discrete object comprising an input for receiving at least one input variable represented in a fixed point format; an operator for executing a function on said input variable or variables; and at least one output for outputting an output variable represented in a fixed point format; optimising, on a computing device, the number of bits, the offset, the number format and the rounding mode for each output from each discrete object in dependence upon the specified propagation rule or rules to produce an optimised graph; and utilising, on a computing device, said optimised graph to define an optimised hardware design for implementation in hardware as said pipelined parallel stream processor.
The present invention relates to a method for optimising the streaming calculation hardware structure such that each node in a calculation (of which there may be many thousands) can provide sufficient precision and have an appropriate range (or scaling) whilst minimising the hardware required.
By providing such a method, fixed point processing logic can be utilised to provide streaming data paths in which calculations are performed with sufficient precision whilst still providing the advantage of a significantly reduced logic area to perform the calculations in fixed point.
For example, the programmer might use a bitwidth limit of 24 bits to ensure optimal multiplier resource usage on Xilinx FPGAs (which contain 18 bit x 25 bit multipliers), and then the compiler will automatically generate the best hardware it can under that constraint. The programmer can further optimize the result quality with manual typing at critical regions should this be desired.
The present invention comprises an arrangement whereby the code that needs to be written to apply variable types to a data path can be minimised. Rules that the programmer might have applied by hand can be automatically propagated through the nodes created to enable calculations to be performed on a stream processor. Further, the present invention allows the programmer to select the most appropriate set of rules to use for each region of the design.
In one embodiment, said processing operation comprises a mathematical function or calculation to be implemented in hardware as said pipelined parallel stream processor.
In one embodiment, the or each hardware element comprises one or more hardware units on said pipelined parallel stream processor.
In one embodiment, the or each hardware element is configured to carry out a predetermined mathematical function.
In one embodiment, the or each propagation rule comprises at least one rule selected from the group of: maximum allowable bit size for the output from the or each discrete element; the desired offset of the output from the or each discrete element; and the rounding mode used by the or each element to produce an output or outputs therefrom.
In one embodiment, the or each propagation rule comprises at least one rule selected from the group of: maximum allowable bit size for the output from the or each discrete element; the desired offset of the output from the or each discrete element; the number format of the output from the or each discrete element; and the rounding mode used by the or each element to produce an output or outputs therefrom.
In one embodiment, the or each maximum allowable bit size rule is selected from the group of: the full number of bits generated in a calculation in said discrete element; from the largest of the number of bits in the or each input to a discrete element; a specified maximum value; and a specified exact value.
In one embodiment, the or each offset rule is selected such that: overflow is prevented; underflow is prevented; the offset is set to the maximum value of the input or inputs to the or each discrete object; or the offset is set to a predetermined value.
In one embodiment, the rounding mode rule is selected from the group of: round to zero; round to positive infinity; round to negative infinity; round to nearest value; and round to nearest value with ties to even.
In one embodiment, the rounding mode is specified manually in the or each propagation rule or is selected automatically in dependence upon a specified parameter.
In one embodiment, the rounding mode is selected automatically in dependence upon the number of bits rounded off in an output from the or each discrete element.
In one embodiment, the number format for each output variable is selected from the group of: unsigned; twos complement; and sign-magnitude.
In one embodiment, the number format is selected automatically in dependence upon the or each propagation rule.
In one embodiment, a plurality of propagation rules are specified for said processing operation.
In one embodiment, said processing operation comprises a plurality of data expressions, one or more propagation rules being specified for each data expression.
In one embodiment, the or each propagation rule is stored in a stack such that the most recent propagation rule is active for subsequent data expressions in said processing operation.
In one embodiment, said graph comprises multiple inputs and multiple outputs, each input and each output being connected to at least one branch of said at least one data path.
In one embodiment, said graph comprises multiple parallel data paths to be implemented in hardware as said pipelined parallel stream processor, and said step of optimising is carried out for each of said multiple parallel data paths.
In one embodiment, said stream processor is implemented on a Field Programmable Gate Array or an Application Specific Integrated Circuit.
In one embodiment, the method further comprises the step of forming said optimised hardware design on said stream processor such that said stream processor is operable to perform said processing operation.
According to a second aspect of the present invention, there is provided a method of making a programmable logic device, comprising: generating a design using the method of the first aspect; and programming the logic device to embody the generated design.
According to a third aspect of the present invention, there is provided a computer program arranged, when run on a computer to execute the steps of the first or second aspects.
According to a fourth aspect of the present invention, there is provided a method according to the first or second aspects stored on a computer-readable medium.
According to a fifth aspect of the present invention, there is provided a Field Programmable Gate Array, Application Specific Integrated Circuit or other programmable logic device, having a design generated using method of the first aspect.
According to a sixth aspect of the present invention, there is provided a system for generating a hardware stream processor design, the system comprising: a processor arranged to execute the method of the first aspect and to generate a list of instructions for the programming of a programmable logic device having the generated design.
According to a seventh aspect of the present invention, there is provided a computer program product executable by a programmable processing apparatus, comprising one or more software portions for performing the steps of the first to sixth aspects.
According to a eighth aspect of the present invention, there is provided a computer usable storage medium having a computer program product according to the seventh aspect stored thereon.