With the every increasing need for denser computing power there is a current trend to implement multi-core arrays. These silicon devices usually have the same microprocessor core instantiated several times on the same device and are interconnected by a shared bus. Due to the sequential architecture of microprocessors they tend only be able to perform a limited number of operations per clock cycle, though peripheral functions offer some parallelism in that are used to calculate the next potential instruction address and implement various interfaces. Different parallel or concurrent threads within a complex application will be assigned to each processor. A thread is a sequence of instructions used to implement a task. A task implements an algorithm and forms part of a computer program. A thread of execution results from a fork of a computer program into two or more concurrently running tasks. When a thread has completed its task, the thread is suspended, destroyed or initiates another thread. Multi-threading describes a program that is designed to have parts of its code or multiple threads execute concurrently. These threads share the processor's resources but are able to execute independently. As a result many of the microprocessor resources may be under utilized, as there is not a one-to-one match between the application algorithms and hardware resources. In addition, many calculations require the transfer and temporary storage of intermediate results, which further consumes processing time and power. Due to their sequential processing, microprocessors and hence related software approaches to parallelism tend to be much slower and inefficient, especially when implementing Digital Signal Processing (DSP) intensive applications.
One solution to this problem is to implement an array processor, in which an array of homogeneous processing elements is provided. The term array processor used herein is not limited to vector processors and includes processors that contain an array of homogeneous or heterogeneous processing elements and can process two or more program threads concurrently. The processing elements in an array processor are usually interconnected in a simple way, for example nearest neighbour, in order to reduce the routing overhead. Several prior art array processors employ a common bus means to transfer data between one or a plurality of elements in an array for processing and reconfiguration. For example, Vorbach, et. al. in U.S. Pat. No. 7,237,087 teaches such an architecture. Nonetheless, such common bus schemes are inefficient and provide data/processing bottlenecks. In addition, such arrays have the disadvantage that each homogeneous processing element needs to be quite complex (implement many type of arithmetic and logic functions) as it may be required to implement one of many functions depending on the algorithm to be implemented. If, for example, the output of one processing element needed to be shifted up or down say, and the next processing element did not implement a shifting function, then an algorithm would be difficult to implement. A shifter may be provided at a certain location in the array, but for data to reach the array it will need to be passed through several pipeline stages. Consequently, all the other stages will either need to be halted or stalled or extra register delays inserted to compensate. In such cases, the sole purpose of a complex array element is to perform a simple pipeline register function. Consequently, the hardware resources are under utilised. It also means that the processing array is synchronous and any delay in one thread will interfere with the processing of other non-related threads. Due to the global synchronous switching of data and array elements the processing of independent threads is limited. This type of processing architecture tends to be very unwieldy to implement and program for.
Another parallel processing solution is a Very Long Instruction Word (VLIW) processor, where sub-fields of an instruction are partitioned to control separate execution units. However, if a VLIW compiler cannot find enough parallel operations to fill all of the slots in an instruction, it must place explicit NOP (no-operation) operations into the corresponding operation slots. This means the hardware is then under utilized. This causes VLIW programs to use more memory than equivalent programs for superscalar processors. Though a VLIW processor provides some parallelism there is no provision for executing independent parallel threads asynchronously.
Many array processors usually have processing elements that implement multiplies and arithmetic logic functions as these operations are commonly found in DSP algorithms. Such arrays lend themselves to implementing digital filters and the like as their data flow graphs map neatly on to the processing array. However, they have limited applications.
Another disadvantage of array processors is that they are based on coarse-grained processing elements and as a consequence it is difficult to implement fine-grained logic functions. Again, this limits the use of these devices.
In some cases, integrated circuits have a mixture of processing cores and hardware resources. This further complicates the issue, especially at design time as many different design tools e.g. separate compilers and simulators for the embedded cores and hardware resources are required to design and test any application.
An alternative to implementing both coarse and fine-grained random logic is to employ Field Programmable Logic Arrays, also referred to as Field Programmable Gate Arrays (FPGAs). FPGA devices use a memory based Look Up Table (LUT) to implement a simple logic function and the more complex versions can include preconfigured DSP slices consisting of many fixed interconnected processing elements. The disadvantage to this approach is that the DSP slices tend to target particular applications and hence FPGA manufacturers need to provide different versions of FPGAs to target these different applications. Though these more complex FPGAs provide a high degree of user programmability they are not fully flexible.
Unfortunately, there are several disadvantages to using FPGAs when compared to alternatives, such as Application Specific Integrated Circuits (ASICs). Firstly, FPGAs tend to be much larger than their hardwired ASIC counterparts, consume more power and are more expensive. Secondly, though they can be re-programmed, a large amount of memory is required to implement specific functions. Another disadvantage of FPGAs is that there is a significant routing overhead required to interconnect all the fine-grained LUTs. The aforementioned devices are usually fabricated using a Complementary Metal Oxide Substrate (CMOS) process.
Once an integrated circuit has been defined and initially tested subsequent actions in the design flow includes automatic test generation and or the insertion of test circuitry, such as Built In Self Test (BIST) and scan chains. However, there is a major design conflict with test circuitry. It is desirable to keep this extra test circuitry to a minimum to reduce silicon overheads and path delays, but it must be flexible enough to provide the desired test/fault coverage. It would be advantageous to be able to reconfigure the available circuit resources so they can be employed as test circuits.
Programmable logic devices allow a circuit designer to use the same device to implement many different logic functions at different times, for example, to include circuit upgrades, try out prototype circuits or correct design errors. This design methodology allows the designer to use off the shelf components rather than designing an Application Specific Integrated Circuit (ASIC), which would be more expensive, take longer to design and to get to market. Another advantage, from a programmable logic manufacturer's perspective, is that one device can be used to address the needs of many different customers and their particular applications. This also allows end product differentiation.
Another way to cater for product differentiation and allow for future upgrades to silicon devices is to provide an area of silicon real estate on a device that is dedicated to implementing programmable or reconfigurable logic. The remainder of the silicon real estate being used to implement dedicated functions. Consequently, such an ASIC device would provide both the benefits of an ASIC device and a programmable logic device.
One reason for using array processor is to provide a high degree of hardware parallelism and allow both dependent and independent threads to be executed concurrently. However, dependent threads (where the execution of one or more threads relies on the results of another thread) need to be synchronised in order to maintain error free processing. Prior art schemes to address this problem, for example US2009013323A1 (May, et. Al.), require elaborate control or Finite State Machines (FSMs), thread control and status registers, inter-thread FSM communication links and associated protocols and instruction sets. Other thread synchronisation methods include using semaphores, mailboxes and mutexes. These approaches tend to be unwieldy (especially for large multi-dimensional arrays as they do not scale well), consume valuable silicon real estate and can hinder thread processing due to delays required to implement thread synchronisation. It is therefore a goal of the present invention to provide a simpler and more efficient thread synchronisation method.
In view of the forgoing, it is a goal of example embodiments of the present invention to provide a programmable shared resource multi-thread processing array in which individual heterogeneous function blocks (both coarse and fine grained) can be interconnected in any combination to implement the desired algorithm. The architecture of example embodiments of the present invention enables the processor array to be reconfigured to implement different processing architectures, such as a Single Instruction Multiple Data (SIMD), Multiple Instruction Multiple Data (MIMD), symmetric multiprocessing and asymmetric multiprocessing. This level of versatility allows the example embodiments of the present invention to target many spheres of use.
Another goal of example embodiments of the present invention is to optimally utilise the available processing array resources by allowing operations from separate and independent threads to share or utilise the processing resources of the same heterogeneous function block as required without reprogramming on the fly.
Yet another goal of example embodiments of the present invention is to allow independent threads to run asynchronously even though the same heterogeneous function blocks are used by different threads, including when interrupts occur in a particular thread and the suspension of one thread using a shared resource does not affect other threads employing the same resource.
Yet another goal of example embodiments is to reduce the number of program memory accesses.
FIG. 1 shows a logical block diagram of the shared resource multi-thread hardware array processor comprising a processor comprising a processor array 100 according to an example embodiment of the present invention. Each block will be introduced initially before being described in more detail later.
One way to overcome the limitations outlined above would be to have an array of heterogeneous function blocks that are interconnected via a plurality of self-routing switch fabrics (SRSF) 700. The heterogeneous function blocks 500 shown further in FIG. 5 are selected from a plurality of specific function blocks, the plurality of function blocks including function blocks and interfaces for fixed point arithmetic operations, floating point arithmetic operations, logical operations, shift operations, memory, input operations, output operations, bit-level manipulations, combinatorial, programmable logic arrays, synchronous and asynchronous logic. A function block 500 therefore implements a discrete program instruction or a plurality of related functions, for example addition and subtraction. However, one or more macro function blocks 500 can be instantiated in the processor array 100 that implement more complex functions, such as cordic, Reduced Instruction Set Computer (RISC) cores, and data block transforms, such as fast Fourier Transforms (FFTs), Inverse Fast Fourier Transforms, Discrete Cosine Transforms (DCTs), Discrete Hilbert Transforms, linear algebra methods, correlation and convolution functions for example. In addition, a macro function block 500 can implement control functions, such as for loops, do-while loops, if-else functions and case statements. This approach allows C-type language constructs to be easily mapped to the resources provided in a processor array 100. As described later a function block 500 can contain a plurality of arithmetic logic elements 560 that can be interconnected via a local switch fabric 550 enabling many operations to be performed in parallel and in a single clock cycle.
A function block requiring N operands, where N is an integer, would connect to N outputs of a particular self-routing switch fabric 700. For example, a multiplier having two operand inputs would have each input connected to an output port of a self-routing switch fabric. The output of a function block is connected to an input of a self-routing switch fabric. Each output port of a preferred self-routing switch fabric is buffered (buffered output port) in order to allow a plurality of inputs to transfer input data tokens (tokens are described in more detail later) to a single output port without causing any delays in the processing of subsequent input data tokens on any of the plurality of input ports. Each self-routing switch fabric is therefore non-blocking. In another embodiment, the self-routing switch fabrics can be blocking. Each output port has a specific address enabling data tokens from different sources to be routed to any chosen output port and hence function block.
The processor array 100 also contains a plurality of thread coordinator units 600 that are used to load program data as well as initiate, maintain and terminate thread execution. In order to implement the various operations or instructions in a given algorithm, resultant data output from one function block is formatted into a token and is then passed to the input of the next function block in algorithm sequence. All token transfers are performed automatically via the self-routing switch fabric and so enables out-of-order or out-of sequence processing to be implemented. As such, the route through concatenated function blocks represents the algorithm to be implemented. As the operation of each function block is implicit by definition, (for example an adder function block performs additions or barrel shifter function block performs shifts on its input data) there is no need to have a centralised instruction control unit issuing commands to the various function block resources. This then reduces the number of program memory and or cache accesses, which can be significant when large program loops are being executed.
Data tokens are passed between each function block based on a unique address attached/appended to the output data of each function block that routes the resultant data token to the next function block. The attached address is also referred to as a routing tag and each function block is an addressable function block. The newly formatted data is referred to as a token and can take different forms as described later. A self-routing switch fabric 700 provides the routing of the data tokens between the function blocks. This allows different threads to operate asynchronously and independently of each other. The term self-routing switch fabric used herein is used to refer to any switch fabric having a plurality of ingress ports and egress ports, wherein input data received at an ingress port can be routed automatically to one or plurality of selected buffered queues based on an address or routing tag appended to the received ingress data. The said self-routing switch fabric being preferably non-blocking. In another embodiment blocking self-routing switch fabrics may be used.
In another embodiment, data transfers between function blocks and switch fabrics and vice versa takes the form of data block transfers or Direct Memory Access (DMA) style transfers. A block of data consists of K concatenated data words, where K is an integer. Such a block than has a single routing tag attached. These block transfers are more efficient than appending a separate routing tag to each data word. In order to facilitate block transfers a switch fabric will route each data word of a block from an ingress port to an egress port on a clock cycle by clock cycle basis and maintain the path between the ingress and egress port until all data from a block has been transferred. The path between the ingress and egress port will be established based on the address fields in the attached routing tag. There are several methods to establish when the last data word of a block has been transferred so the switch fabric can then close the path and establish new ingress to egress paths through the switch fabric. One method is to set the token type field 3A to type block data transfer 3O (which includes the block length) so a switch fabric can count the number of data words transferred. A more efficient method would be to append a condition data field 3C set to end of block 3Q to the end of the block to indicate that the last data word has been processed. Examples of applications where block transfers would be used are DCTs, FFTs, image processing and audio processing where data is processed in blocks. In another embodiment, the routing tag and data word can be transferred in parallel on separate buses. In order to prevent congestion the length of a block can be limited. However, the chosen block length will decide on the application, number of switching resources and simulation results.
Different operands required to perform an operation that arrive at the inputs of a function block from different routes are automatically synchronised before each operand is presented to their respective function block inputs, for example operand A plus operand B when using a two input adder. Thread synchronisation will be explained in more detail later. When the last operation/instruction in a particular thread has been performed, then the associated function block issues a thread complete token, which is routed back to the initiating thread coordinator block. These thread coordinator tokens can be routed back to a thread coordinator unit either via the same self-routing switch fabric used to route the data tokens or a separate self-routing switch fabric dedicated to the purpose.
The output buffer of each self-routing switch port can be configured to implement a plurality of output queues, referred to as thread queues. These queues also have a specific address and are operated on a first-in first-out (FIFO) basis. A queue is associated with a particular thread (referred to as a thread queue or queue for short) and by providing different queues at each output port the same function block can be used by different threads. The scheduling of the output queues is programmable and based on algorithm needs. This can be determined at design time through simulation using Electronic Design Automation (EDA) tool chain 1000, explained below with reference to FIG. 17. The scheduling strategies include, but are not limited to, first come first served, round robin, weighted round robin and priority queues. For example, thread coordinator tokens could be given a higher priority than data tokens has there will be less of them and they are more important in terms of thread control and execution.
Several function block resources can be considered local if they are interconnected using the same basic self-routing switch fabric 700. Such a structure is referred to as a level-1 function block and the self-routing switch fabric interconnecting them a level-1 switch fabric. In another embodiment described later (see FIG. 13), a function block 500 can contain a plurality of arithmetic logic elements 560 interconnected via a local switch fabric 550. A group of level-1 function blocks can be interconnected using another self-routing switch fabric. This switch fabric is referred to as a level-2 switch fabric and the grouped function block a level-2 function block. A plurality of level-2 function blocks can then be tiled and themselves interconnected by separate self-routing switch fabrics. Those familiar with the art will recognise that various switching architectures can be constructed, such as fractal, hypercube, butterfly fat tree or hierarchical switch structures enabling different shared resource multi-thread processor arrays 100 to be implemented.
When implementing different algorithms it becomes apparent that certain operations/instructions occur more frequently than others. For example, most DSP based algorithms rely heavily on multiplies and accumulates or MACs. Function blocks 500 that implement frequently used operations are collectively referred to as frequent functions blocks 107. However, other functions may be required, but do not occur very often or relatively infrequently, such as barrel shifting, truncation, look-up tables, or normalisation. Function blocks 500 that implement infrequently used operations are collectively referred to as infrequent functions blocks 108. Consequently, it would be a very inefficient use of silicon real estate to provide these infrequent functions locally or in every processing element. An alternative would be to implement several of these less used or infrequent operations as function blocks and allow them to be accessed universally from any other function block or thread coordinator unit 101 on a device. This would then lead to a better and more efficient use of available resources by reducing the overall gate count.
Interface blocks 104 are used to transfer data to and from external circuits. Data and control signals 106 are provided to Interface blocks 104 are closely coupled to memory based function blocks 500 and thread coordinators 600. Various types of Interface blocks 104 are provided on the processor array 100 to cater for different interface protocols. Likewise, an Interface block 104 can be constructed from a group of programmable interconnected function blocks enabling the Interface block 104 to be configured to implement one of a plurality of interface protocols.
In an example embodiment, flow control is provided within the self-routing switch fabrics 700 to prevent queue overflow and loss of data. Programmable queue management means are employed so flow control tokens are issued if a particular queue reaches a programmable predefined level. The flow control tokens are routed back to the thread initiator instructing it to “slow down” i.e. reduce the rate at which it issues thread initiator tokens for a determined number of clock cycles. Likewise, the scheduling of tokens from an output queue can be based on the queue level and queue output slots can be stolen from lower priority queues if the need arises. This situation could occur due to uneven or bursty data flows, for example when interrupts occur or data output varies when implementing a compression algorithm.
According to the present invention there is provided a processor array, wherein individual instructions or groups of instructions for one or a plurality of threads are mapped to function blocks of corresponding functionality from an array of addressable heterogeneous function blocks, the same instructions from different threads are optimally mapped to the same function blocks so they share a function block's processing resources, each input port of a N input function block, where N is an integer greater than or equal to 1, is connected directly to a buffered output port of a self-routing switch fabric, each buffered output port being configured to implement one or a plurality of independent thread queues, each thread queue having at least an empty flag output, where one or more groups of Q empty flag outputs, where Q is an integer greater than or equal to 1 and can be a different value for each group, are logically combined by programmable circuit means to form one or more groups of synchronised thread queues, tokens read simultaneously by thread queue scheduler means from the selected group or groups of synchronised thread queues is input directly on selected inputs of an N input function block, resultant data from a function block is formatted into a token by at least having a routing tag appended, the said token being automatically routed via the self-routing switch fabric to a thread coordinator or the next function block in the thread sequence, each thread being initiated, maintained and terminated by a thread coordinator issuing and decoding tokens.
Further features of the invention, its nature and various advantages will become readily apparent from the following detailed description of the invention and the embodiments thereof, from the claims and from the accompanying drawings.