It is known to provide data processing systems with accelerator hardware operating to accelerate execution of some program subgraphs within a program. As an example, it may be that a program has a particular need to perform a complex operation a large number of times during its normal operation, such as decrypt a large quantity of data from a stream of data using a decryption technique which repeatedly executes the same piece of program code. It is possible that this program code may be written as a sequence of individual program instructions that are sequentially separately executed by a general purpose execution unit. However, it is known to provide special purpose accelerator hardware in such circumstances that can operate to provide hardware support for accelerated execution of such specific processing requirements.
One approach is to add such special purpose accelerated hardware and then add specific instructions to the instruction set of the apparatus to represent the complex operation which is to be performed by the accelerator hardware. As an example, a general purpose instruction set could be augmented by the addition of specific decryption instructions which when encountered would be executed by the decryption acceleration hardware. This approach suffers from a number of disadvantages.
A program written to include the new decryption program instructions in place of the previous sequence of standard program instructions is no longer capable of being executed on a system which does not include the accelerator hardware. Thus, several versions of a computer program may need to be written, tested and maintained, each targeted at different hardware platforms which may or may not contain the hardware accelerator. Furthermore, different versions of a hardware accelerator may be present in different implementations with varying capabilities requiring different programs to be written to reflect those differing capabilities. The special purpose accelerator added to implement the new special purpose instructions also represents a significant design investment and requires the testing and validation for each variant that was produced.
It is also known to provide data processing systems with the capability to examine the stream of program instructions that are being executed to determine if they can be modified/re-ordered or otherwise changed to run in a more efficient fashion. An example is a system which can combine two individual program instructions to form a single fused instruction that results in the same overall processing operation but is able to execute more rapidly. Whilst such systems are effective, the hardware and complexity overhead associated with seeking to identify program instructions that can be safely fused in this way is considerable and a disadvantage.
The article by J. Phillips and S. Vassiliadis, entitled “High-performance 3-1 interlock collapsing ALUs”, IEEE Transactions on computers, 43(3) March 1994, proposes a techniques which employs a 3-input ALU that can collapse up to three dependent instructions, a dependent instruction being an instruction which has an input operand which is dependent on the results of a preceding instruction. However, this approach has a number of drawbacks, in that only a limited number of dependent instructions can be managed, and the specialised ALU device only allows a specific number of patterns to be catered for.
The article by N. Clark, M. Kudlur, H. Park, S. Mahlke, K. Flautner entitled “Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization”, International Symposium on Microarchitecture (Micro-37)—2004, describes a transparent instruction set customisation technique, in which a subgraph accelerator is used alongside a general purpose processor. Thus, a fixed processor design is maintained and the instructions set is unaltered. In accordance with this technique, subgraphs are identified and control is generated on-the-fly to map and execute data flow subgraphs on the accelerator. This avoids the need to explicitly change the instruction set. Subgraphs in the program can be discovered offline using a compiler, through binary translation, or online using a more heavy-weight dynamic optimiser. The technique proposed in this article allows the mapping of larger data flow subgraphs onto the accelerator, referred to in that article as a Custom Compute Accelerator (CCA). The CCA consists of an array of functional units of predefined patterns. However, the CCA is limited to a maximum number of dependent instructions according to the predefined shape of the CCA and the interconnection network. The CCA must either be sufficiently large to cover all desired subgraphs of the application, or can alternatively be made smaller but for specialised applications.
In the article by S. Yehia and O. Temam, entitled “From sequences of Dependent Instructions to Functions: An approach for Improving Performance without ILP or Speculation”, 31st International Symposium on Computer Architecture, 2004, a technique is proposed for collapsing sequences of dependent instructions on a bit-level configurable device associated with a ripple carry generation network. The technique described is based on configurable look-up tables (LUTs) but the approach proposed in the above article suffers from two major drawbacks. First, the proposed device is similar to a Field Programmable Gate Array (FPGA) because of its fine grain configurability at the bit level. Due to the complex hardware and interconnection network, this bit level approach makes the device too slow. Furthermore, because each bit of the output requires a configuration, the device requires very large configuration which is inefficient. Secondly, the technique proposed for generating the configuration of every output bit requires substantial hardware or software resources and must be done offline with respect to execution.
Given the above, it would be desirable to provide an improved technique for transparent instruction set customisation, which allows the accelerator to execute larger chains of dependent instructions, and which allows a more efficient and straightforward technique for generating configurations for the accelerator.