Certain types of computing devices, such as network interface cards supporting the MPI (Message Passing Interface) interface, require extensive implementation of item searches in lists.
It involves, for example, listing the messages expected by a communication network node with an indication of their respective storage spaces and comparing all of the node's incoming messages with those on the list. Thus, when a message arrives, it can be sent to its storage space to be processed.
Traditionally, each incoming message has a label that must be compared to the label of the messages in the list. The labels of the messages in the list may be masked so that the comparison of labels is performed on a reduced number of bits.
When a message arrives, its label is compared to that of the first item in the list, then the second, then the third, and so on, until a matching label is found.
When this happens, the incoming message is sent to the storage space, and the matching item in the list is deleted. The list is then updated.
The list of expected messages is therefore a list that is dynamically modified with items that can be removed (when a corresponding message arrives) or added (when a new message is expected).
The implementation of this type of search requires the execution of complex path and list management algorithms. In addition, these algorithms are usually implemented with a large number of options to manage.
As a result, in computing devices, particularly MPI-type interfaces, a processor dedicated to this type of operation is required. With a dedicated processor, searching items in a list (or matching, as it is also called) can be managed using software and not hardware. This offers greater flexibility because the computer code directing the processor (also known as microcode or firmware) can evolve to reflect modifications to the interface specification, for example.
To obtain top performance from a processor, its execution time, and therefore its operating cycles, should be reduced. The process execution time in the processor impacts the flow of messages managed by the interface.
The writing of firmware by developers should also be facilitated. Firmware is written in assembly language and therefore does not go through the high-level control structures offered by other types of language. An assembly code writing error can have serious and direct consequences on the processor, with no hope of controlling the error.
It may also be desirable to keep machine instructions performed by the processor to a reasonable size.
The document by Hemmert et al, “An architecture to perform NIC Based MPI Matching” discloses a processor based on predicates to control the flow of machine instructions executed. The machine instructions are executed according to the values stored in predicate registers that store logical combinations (of the AND and OR type) of comparison results (bit to bit). The predicate registers represent the conditions to fulfill for the instructions to be executed.
In this document, flow is controlled by branch instructions according to the value of one predicate register bit. As known, a branch consists in not executing a part of a sequential suite of instructions, by not executing a next instruction in the code, but by passing directly to a previous or later instruction in the code. The branch can therefore by done forward or backward in the computer code.
To extract the execution options from the instructions, the comparisons are made by a ternary comparison unit (NALU), which compares two values with a compare mask.
However, this type of processor has a number of drawbacks.
For example, the number of cycles necessary to execute a code is high. This is mainly due to the widespread use of branching as a means of control. This document calls for a number of two cycles to create a branch. However, in this case, it is a study processor with access to the memory in a single cycle and without an error connection code (of the ECC type, for example). Such a processor cannot be used realistically in industrial applications. In industrial applications, a number of five cycles is generally necessary to execute a connection.
Furthermore, the processor shown uses a classic arithmetic unit (ALU) and a ternary arithmetic unit (TALU). It is therefore not possible to perform parallel calculation, which does not optimize the size of the instructions, which is, however, 164 bits, which normally allows parallel instructions to be executed.