This invention broadly relates to parallel processing in the field of computer technology, and more particularly concerns systems, devices and methods for generating instructions for a parallel computer such as a Single Instruction Multiple Data (SIMD) data processor.
Parallel processing is increasingly used to meet the computing demands of the most challenging scientific and engineering problems, since the computing performance required by such problems is usually several orders of magnitude higher than that delivered by general-purpose serial computers. Growth in parallel processing has opened up a broad spectrum of application areas including image processing, artificial neural networks, weather forecasting, and nuclear reactor calculations.
Whilst different parallel computer architectures support differing modes of operation, in very general terms, the core elements of a parallel processor include a network of processing elements (PEs) each having one or more data memories and operand registers, with each of the PEs being interconnected through an interconnection network (IN).
One of the most extensively researched approaches to parallel processing concerns Array Processors, which are commonly embodied in single instruction stream operating on multiple data stream processors (known as Single Instruction Multiple Data or SIMD processors). The basic processing units of an SIMD processor are an array of processing elements (PEs), memory elements (M), a control unit (CU), and an interconnection network (IN). In operation, the CU fetches and decodes a sequence of instructions from a program, then synchronises all the PEs by broadcasting control signals to them. In turn, the PEs, operating under the control of a common instruction stream, simultaneously execute the same instructions but on the different data that each fetches from its own memory. The interconnection network facilitates data communication among processing units and memory. Thus the key to parallelism in SIMD processors is that one instruction operates on several operands simultaneously rather than on a single one.
In a standard set-up, an SIMD processor is attached to a host computer, which, from the user""s point of view, is a front-end system. The role of the host computer is to perform compilation, load programs, perform input/output (I/O) operations, and execute other operating system functions.
An example of an SIMD processor which is made and sold by the Applicant, the Aspex(trademark) ASP(trademark) (Associative String Processor) data processor, can in typical configurations operate on 1000 to 100,000 data items in parallel. The major features of current implementations of the ASP are:
256 processing elements on a single device 8.1 mmxc3x979.3 mm in size to 1152 processing elements on a single device 14.5 mmxc3x9713.5 mm in size.
DPC interface 80-82 bits wide operating at 20M-50M instructions per second (20-50 MIPS).
40-100MHz clock speed.
An ASP that has been implemented is controlled by a 76-bit wide instruction consisting of 32-bit Control, 32-bit Data and 12-bit Activity fields. The ASP performs two (sequentially executed) operations for every instruction received. To support this, the control field is further subdivided into the sub-instruction fields A and B. Data I/O to the ASP uses high-speed channels, but it also can return a 32-bit wide value to a control unit, and has four status lines that can be monitored by a control unit.
The ASP can perform operations on data in the APE in bit serial (one bit at a time) or bit parallel (many bits at a time). Operations are classed as Scalar-Vector when one operand is the same value on all APEs or Vector-Vector for all other cases. Vector-Vector operations require the control unit to supply the operand addresses in the instruction and are normally performed bit serial. Scalar-Vector operations require the control unit to supply the common, i.e. scalar, operand""s value and the address of the second operand in the instruction and are performed bit serial or parallel. Both cases require that the address of the result is also included in the instruction.
For the purposes of controlling an SIMD processor, the range of architectures can be considered to be bounded by two basic cases: standalone and co-processor. Other architectures are either variations, a blend or multiple instances of the two basic cases. A control unit common to standalone, co-processor and intermediary architectures is a Data Processor Controller (DPC). As will become apparent, a DPC executes the control statements of a program and issues instructions to the SIMD processor.
The standalone arrangement which is shown in FIG. 1 of the accompanying drawings consists of two blocks: the SIMD processor which manipulates data, and the DPC which issues instructions to the SIMD processor and thereby controls the operation of the SIMD processor. A characteristic of the standalone case is that data I/O is direct to the SIMD processor. Optional external commands and status checks go to and from the DPC.
The co-processor arrangement which is shown in FIG. 2 consists of a SIMD processor coupled via a DPC to a more conventional processor embodied in a single instruction stream operating on single data stream processor (also known as a Single Instruction Single Data or SISD processor). The combination of the DPC and the SIMD processor can be regarded as a co-processor to the SISD processor.
SISD processors can range in complexity from a processor core like the ARM, through microprocessors like the Intel Pentium or the Sun SPARC, up to complete machines like an IBM/Apple PC or a Sun/DEC workstation (all trade marks acknowledged).
During the execution by an SISD of a given program, the organisation of the system is such that the SISD delegates certain tasks along with their parameters to the co-processor. The division of this task between the DPC and the SIMD processor is the same as for the standalone case. While the co-processor is performing its assigned task, the SISD processor continues executing the program; the overall result being that the program steps are completed faster than if the SISD processor alone had been relied upon to execute the program. For example, in an image processing application, a program contains a statement which divides all the pixels in an image by the value X, the SISD processor will assign this statement and the value X to the co-processor for execution. Similarly if, say, another part of the program performs a two dimension convolution on the image, this task would also be assigned to the co-processor for execution.
Notably, the major attributes of a DPC are:
Supply instructions to the SIMD processor at a very high rate, typically 20-100M instructions per second.
Generate wide instructions, typically a couple of hundred bits.
Process status information from the data processor.
At present, known DPCs fall into one of two general categories: (i) direct microprocessor drive and (ii) custom micro-code sequencer.
Direct microprocessor drive provides a versatile and simple DPC solution employing software running on a stored-program microprocessor or digital signal processor (DSP) device to generate and assemble the data processor instructions. FIG. 3 shows such a solution. The SIMD processor""s M-bit wide instruction and N-bit wide status/result interfaces are connected via registers to a P-bit wide interface to the address/data bus or I/O channel of the microprocessor/DSP, and in general, M and/or N will be larger than P. In use, the software program builds each data processor instruction by writing it P bits at a time and once all M bits have been written, the instruction is issued. Similarly, the N-bit status/result data is read in segments.
The versatility of the direct microprocessor drive approach comes from the direct generation of the data processor instructions by the microprocessor/DSP. However, its main disadvantage is the poor instruction generation speed caused by the need to write a number of P-bit words to generate each M-bit instruction, and the relatively poor write speed of even the latest microprocessor/DSP. Consequently, the rate at which the DPC operates lags behind the rate at which the SIMD processor can operate; processing capacity of the SIMD processor therefore remains untapped.
Turning to custom micro-code sequencer DPCs, these use a custom micro-code sequencer or bit-slice sequencer architecture micro-coded with a complete application or with a library of micro-routines that perform simple tasks which can be assembled to build an application. The micro-code is normally hardwired or down-loaded before an application is run, but some DPCs have schemes for changing the micro-code while the application is running.
FIG. 4 shows a simplified micro-code sequencer DPC solution. It consists of four blocks: 1) micro-code sequencer, 2) arithmetic processor unit (APU), 3) data processor instruction multiplexer (DPMX) and 4) command buffer. Taking each unit in turn:
1) The micro-code sequencer controls the DPC in the sense that it generates the base data processor instructions. The sequencer contains a very wide high-speed memory that holds the micro-code which is addressed by the address generation unit and its output is registered, then divided into micro-order fields that control the sequencer""s address generation unit and the other DPC blocks, or contain the data processor instruction. The address generation unit has dedicated logic for performing calls, branches and deterministic and non-deterministic loops. It has test inputs that allow it to make decisions based on the state of the DPC or the SIMD processor and has a data input for loading a branch address or a loop count.
2) The APU performs general arithmetic. It can be loaded with parameters from the command buffer, results from the SIMD processor or literals from the micro-code sequencer. The result is used to control the sequencer or parameterise the data processor instructions. Often the APU supports generation of the operand/result address fields of a data processor instruction and manipulation of the scalar value when the data processor is performing a scalar-vector operation. In practice a DPC will have a number of APUs with private data paths each dedicated to a particular function or groups of functions. For instance, a typical micro-code sequencer DPC may have four to six APUs dedicated to specific functions and a 200-bit wide micro-code store built from very fast random access static memory.
3) The DPMX parameterises the base data processor instruction produced by the micro-code sequencer, replacing parts of the instruction with values taken from APU registers.
4) The command buffer provides a means for external control of the DPC: task requests along with their parameters are taken from the buffer and results and status information is placed into the buffer. The buffer may be implemented as a simple register, small memory or first in first out (FIFO) memory. In the standalone case, a command buffer is optional.
An example of a DPC with custom micro-code sequencer architecture is the Aspex(trademark) Microsystems LAC-1001(trademark) card. This card generates a 80-bit data processor instruction every 50 nS. It is 340 mm by 367 mm in size and has a power consumption of 12 Amps at 5 Volts.
The principle advantage of the custom micro-code sequencer is its speed of operation. However, this is offset by its lack of flexibility and circuit complexity. Such a DPC solution can only perform tasks it has been micro-coded to do; also the flexibility of the micro-code is restricted by the functions and data paths provided in the hardware which renders it application-specific and limits its usefulness. Additionally, the complexity of the circuit results in a DPC that suffers the drawbacks of being large, expensive and having a high power consumption.
The disadvantages arising in particular from the complexity of the circuit have numerous undesirable knock-on effects. Initially, the circuit design has to be elaborate, then the extensive number of components have to be manufactured, assembled, coded and tested. Inevitably, the circuit is relatively large (typically around 1250 cm2) and therefore requires xe2x80x98big boxxe2x80x99 custom equipment which is not suitable for PC based or OEM implementation. Reliability is a further important issue; with complex multi-component circuitry this is constantly a problem.
All of these factors clearly add to cost, which drastically limits the accessibility of the technology. For example, the current cost a 3D medical imaging system ranges between £1M to £10M, and yet still fails to meet ideal real-time performance standards.
Against this background, the present invention seeks to provide a DPC which achieves the performance of the custom micro-code sequencer approach and the flexibility of the direct microprocessor drive approach in a size and at a cost of near that of the direct drive solution.
To this end, the invention feeds data processor instructions generated at a low rate to a circuit that generates the data processor instructions at a high rate, i.e. increases the instruction generation bandwidth. The term xe2x80x9cratexe2x80x9d means the number of instructions generated in a given period of time, and the term xe2x80x9cinstruction generation bandwidthxe2x80x9d means the number of instruction bits generated in a given period of time.
The invention may be applied to both standalone cases and co-processor cases. In standalone cases, a DPC executes control statements of a program containing data processor instructions, multiplies the data processor instructions and issues the multiplied data processor instructions to the data processor which manipulates the data. The rate of the generation of the multiplied data processor instructions is greater than the rate of the execution of the statements of the program.
In co-processor cases, an SISD processor executes a program and farms out some tasks to a co-processor comprising a DPC and a data processor. The DPC multiplies data processor instructions and issues them to the data processor at a rate greater than the rate the DPC receives the data processor instructions from the SISD processor.
The invention may exploit two properties of a data processor instruction stream or a block of instructions produced by a typical application. Firstly, individual instructions and blocks of instructions can be repeated. Secondly, the instruction stream can be compressed. The performance can be further increased by recognising that most loops in the data processor instruction stream will change either the operand/result address or (for scalar-vector operations) the scalar value, or both, during each iteration.
The data processor instructions may include specific instructions for controlling the operation of the multiplication circuit. In this way it is not necessary to pre-load the circuit with a set of specific tasks that the circuit can be required to carry out. Rather, the provision of the specific instructions within the data processor instructions enables the instruction stream generated by the DPC to be multiplied out, at run time, into a format that is suitable for the data processor, without any predetermined knowledge of the specific multiplication processes that are required. Accordingly, the data processor controller of the present invention is very flexible and does not require complicated additional circuitry which can also be expensive.
Another way of considering the multiplication aspect of the present invention is that the data processor instructions received by the multiplication circuit are increased in number by being expanded out. This feature enables the bandwidths of physical data paths between a data processor instruction generator, the multiplying circuit and the data processor itself to be fully used with maximum efficiency; each data path operating at its optimum capacity. In addition, the data processor instructions can be generated in a compounded format and can be separated by the multiplication circuit before reaching the data processor in a non-compounded format. Therefore advantageously, the data processor instruction generator can output instructions to the multiplying circuit along a relatively small bandwidth pathway and from the multiplication circuit to the data processor along a relatively large bandwidth pathway, without the overall performance being limited by the slowest of the pathways.
Expressed another way, the invention resides in a data processor controller for controlling a data processor, comprising: a first processor for issuing data processor instructions at a first rate; multiplying means for receiving the data processor instructions issued by the first processor, multiplying the data processor instructions, and generating the multiplied data processor instructions to the data processor at a second rate, the second rate being greater than the first rate.
From one aspect, the invention resides in a data processor controller comprising instruction generating means for generating data processor instructions at a first rate and instruction accelerating means for receiving the data processor instructions at the first rate and being arranged to multiply the instructions and forward the multiplied instructions to the data processor at a second rate substantially greater than the first rate.
Within the same inventive concept, the invention also encompasses a bandwidth multiplier for multiplying data processor instructions for controlling a data processor, the bandwidth multiplier comprising: input means for receiving instructions; and bandwidth multiplying means for multiplying data processor instructions contained in the instructions received by the input means.
The invention extends to a method for controlling a data processor, comprising the steps of: issuing data processor instructions at a first rate; reading the data processor instructions; multiplying the data processor instructions; and writing the multiplied data processor instructions to the data processor at a second rate, the second rate being greater than the first rate.