Processor arrays that contain a number of separate but interconnected processor elements are known. One such processor array is the picoArray™ architecture produced by the applicant of the present application and described in International publication WO 02/50624. In the picoArray™ architecture, the processor elements are connected together by a proprietary bus that includes switch matrices.
The software description of a digital signal processing (DSP) system comprises a number of processes that communicate with point-to-point or point-to-multipoint signals. Each signal has a fixed bandwidth, known as its slot rate, which has a value that is a power of two in the range 2-1024, in units of the picoArray™ cycle. Thus, a slot rate of four means that slots must be allocated on the bus between a sending processor element and the receiving processor element(s) every four system clock cycles.
A partitioning procedure can be used to allocate groups of processes to each of the processor arrays in the system. A placement procedure can be used to allocate each process to a specific processor element within its allocated processor array. A switching or routing procedure determines the multiplexing of the signals on to the physical connections of the bus in the processor array.
The placement and switching procedure takes a user's abstract design, which consists of processes and signals, and places each process onto a processor element on a picoArray™ and routes all of the signals using the switching matrix of the picoArray™. This procedure must be carried out in a way that maximizes the number of processor elements that can be used within a given picoArray™ and that minimises the length of the routing needed for the signals.
The placement and the routing steps are generally performed separately, for example a candidate placement is created and then the signals are routed using that placement.
The output of the placement and switching procedure is a “load file” which contains configuration data for a single picoArray™.
The proprietary bus used in picoArrays™ is a time division multiplexed (TDM) structure in which communication timing is determined at “compile time”. In other words, there is no dynamic arbitration.
The bus comprises a set of “switches” placed throughout the processor array, and these switches are either in-line with the processor elements (see FIG. 1(a)), or offset (see FIG. 1(b)).
In-line switches are easier to use for placement and routing algorithms since the regularity makes it easier to compute distances between processor elements. With offset switches, each row of processor elements is connected to two rows of switches, and therefore it is possible to communicate between adjacent rows by only traversing one switch, whereas in-line switches require the traversal of two switches.
However, for offset switches, each processor element is connected to two bus connections and only one of these can be used to provide this single switch transfer. If that direction becomes blocked (perhaps by another signal) then the other direction must be used, and this requires the traversal of three switches. For in-line switches, the two possible directions both require the traversal of two switches.
Thus, it is easier to predict “bus costs” before routing is actually performed if in-line switches are used.
The routing procedure (which takes place after the placement procedure) requires a tool that can determine the contents of routing tables within each of the switches that make up the picoBus structure from the signals that need to be routed. Each routing table consists of a set of entries that indicate the routing for each clock cycle. The set of entries are repeated every N clock cycles. In addition, it is possible for some of the entries to be repeated at a lower frequency to provide communications at lower rates, while reducing the size of routing tables that are required.
In currently available picoArrays™, N is 1024. This is implemented as a table of 124+(4×8) entries. The main part of the table, which comprises the 124 entries, is repeated once every 128 clock cycles. The 8 blocks of 4 entries are repeated every 1024 clock cycles and are known as the “hierarchical” entries.
The present application is concerned with a procedure for placing or allocating the processes to the processor elements. Therefore, it is assumed that any partitioning procedure has been carried out before the placement algorithm is started.
According to an aspect of the invention, there is provided a method for placing a plurality of processes onto respective processor elements in a processor array, the method comprising (i) assigning each of the plurality of processes to a respective processor element to generate a first placement; (ii) evaluating a cost function for the first placement to determine an initial value for the cost function, the result of the evaluation of the cost function indicating the suitability of a placement, wherein the cost function comprises a bandwidth utilization of a bus interconnecting the processor elements in the processor array; (iii) reassigning one or more of the processes to respective different ones of the processor elements to generate a second placement; (iv) evaluating the cost function for the second placement to determine a modified value for the cost function; and (v) accepting or rejecting the reassignments of the one or more processes based on a comparison between the modified value and the initial value.