This is a national stage application of PCT/EP98/06948 filed Nov. 3, 1998, which claims the priority of German Application 197 56 591.3 filed Dec. 18, 1997.
The present invention relates to processor architectures and in particular to an architecture for connecting a plurality of functional units in a processor.
In a processor a calculation task is achieved by the cooperative effort of several so-called functional units (F.U.), each functional unit performing a specific task. Some functional units perform special calculations, e.g. the addition or multiplication of numbers. Other functional units are able to store values, e.g. memories and register files. Yet other functional units carry out logic functions, e.g. an inversion, an AND operation and so on. Finally there are functional units which perform a communication with the xe2x80x9coutside worldxe2x80x9d, an example being bus interfaces. The functional units operate in parallel and deliver their results via outputs to a connection network, to which the inputs of the functional units are also connected, to obtain input values. The values which are output by functional units can thus be used again by other functional units or even by the same functional unit, e.g. in an iterative calculation, where the result is also used again as input value.
Most of the processors known today have very few separate functional units, chiefly due to the limited space available on the chips on which they are implemented. A known connection network is e.g. a so-called crossbar switch, which permits each functional unit to forward a result which it has calculated to every other functional unit.
FIG. 6 shows such a known processor implementation with n functional units 100, each having two inputs 102 and one output 104. The outputs 104 are connected to a crossbar switch 106, which is shown schematically in FIG. 6. It is thus possible to feed every value at an output of a functional unit 100 into each and every input of the same or another functional unit 100.
A parallel processor with a processor array and a crossbar switch is e.g. known from the U.S. Pat. No. 5,123,109. The crossbar switch or router is there connected to a data generation circuit and to a data receiving circuit of the processors in the processor array so as to make it possible to transmit information between the processors in the array in accordance with the respective addresses in response to router control signals from a control circuit. The connection circuit connects the data generator and the data receiver of each processor to a data receiver and a data generator of neighbouring processors in the array so as to make it possible to transfer data between each processor and at least one of the neighbouring processors simultaneously in response to control signals from the control circuit.
A disadvantage of this circuit is that this two-dimensional connection network in the crossbar switch is simply too complicated for a large number of processors or memories if numerous connections must be established randomly. Furthermore it should be noted that to connect n functional units requires a crossbar switch with n2 connection points. As the number of functional units increases, the size and complexity of the crossbar connection network becomes increasingly dominant as regards the size, complexity and cost of the entire processor.
As a solution to the problem of the no longer justifiable number of connection points in a crossbar switch with a large number of functional units, U.S. Pat. No. 4,968,977 proposes that, instead of a single crossbar switch, a plurality of expandible crossbar modules be used, each of which provides a set of connections or defined mappings between the sets of input and output nodes and where each output is defined in terms of just one single input. In addition, each crossbar module is connected to a separate input and output port via which the module is connected to other identically configured modules if additional nodes are to be integrated into the system. This modular construction enables an existing design to be expanded modularly without redesigning the whole connection network.
U.S. Pat. No. 5,655,133 discloses a massively multiplexed central processing unit with a plurality of independent calculating circuits in Harvard architecture with a separate internal result bus for transferring a resultant output from each calculating circuit and with a plurality of all-purpose registers which are coupled to each calculating circuit. Each all-purpose register has multiplexed input ports, which are connected to every result bus. Each all-purpose register also has an output port, which is connected to a multiplexed input port of at least one calculating circuit. The calculating units are here coupled in parallel to a main data bus in such a way that data which flow on the main data bus are simultaneously available for each calculating unit.
U.S. Pat. No. 4,952,930 relates to complete computer networks and discloses a hierarchical multipath network. Here a multipath network, i.e. a network in which a number of paths exist between a data source and a data destination, e.g. the Internet, is described as a hierarchical connection structure between a plurality of sources and a plurality of destinations. The hierarchy comprises a first multipath network without storage, which consists of two or more stages and which provides a quick path for connecting a source to a destination. At least one second multipath network with storage and a plurality of stages constitutes an alternative slower path for connecting a source to a destination if a connection between the source and the destination over the first, fast path is blocked. The address field of a message from a source is investigated at each stage so as to select a suitable connection to the next stage, the message being passed on to the second stage without the address field if the connection is available and being stopped if the next stage is blocked, a negative acknowledgment being returned to the source. Retransmission of the message over the second network in the hierarchy is initiated on receipt of a negative acknowledgment at the source.
DE 3048414 discloses a circuit arrangement for a data processing system in which a plurality of central subsystems are in data communication via a single input/output multiplexer, a transmission register being assigned to each central subsystem such that an output from a central subsystem is first fed into the transmission register which is assigned to this subsystem before being subsequently transmitted to one of the other central subsystems via the single input/output multiplexer.
It is the object of the present invention to provide a processor architecture in which every functional unit is connected to every other functional unit and in which at the same time attention is paid to an efficient layout of the connection network.
In accordance with a first aspect of the invention, this object is achieved by a device for the hierarchical connection of a plurality of functional units in a processor, comprising a first-order connector with at least two inputs and an output, which is adapted to be operated so as to connect one of the at least two inputs to the output, where the output of the first-order connector is connected to an input of a first functional unit and where an output of a second functional unit is connected to a first input of the at least two inputs of the first-order connector; and a second-order connector with at least one input and an output, which is adapted to be operated so as to connect the at least one input to the output, where the at least one input of the second-order connector is connected to a third functional unit and where a signal which is appliable to the at least one input of the second-order connector is bufferable before it is forwarded to a further input of the first-order connector; where connections which are established by the first-order connector exhibit shorter signal transit times than connections which are established by the second-order connector; where a connection between functional units which is established by the first-order connector is more frequently used in a task performed by the processor than is a connection between functional units which is established by the second-order connector.
In accordance with a second aspect of the invention, this object is achieved by a processor comprising a plurality of functional units; a device for the hierarchical connection of the plurality of functional units including a first-order connector with at least two inputs and an output, which is adapted to be operated so as to connect one of the at least two inputs to the output, where the output of the first-order connector is connected to an input of a first functional unit and where an output of a second functional unit is connected to a first input of the at least two inputs of the first-order connector; and a second-order connector with at least one input and an output, which is adapted to be operated so as to connect the at least one input to the output, where the at least one input of the second-order connector is connected to a third functional unit and where a signal which is appliable to the at least one input of the second-order connector is bufferable before it is forwarded to a further input of the first-order connector; where connections which are established by the first-order connector exhibit shorter signal transit times than connections which are established by the second-order connector; where a connection between functional units which is established by the first-order connector is more frequently used in a task performed by the processor than is a connection between functional units which is established by the second-order connector; and a controller for controlling the functional units and the device for hierarchical connection.
The present invention is based on the finding that in a processor with many functional units the preponderant part of the communication, i.e. of the usage of the connection network, normally occurs between particular functional units and not arbitrarily between all the functional units. Popular connections and less frequently used connections thus exist. According to the present invention this situation is taken into account in the interests of an efficient connection network. This means that the fastest possible communication connections are established for the preferred communication paths whereas the slow communication paths are subject to delay in the interests of the fast, i.e. important or frequently used, communication paths.
To exploit fully the advantages of the present invention, only a few neighbouring functional units are connected essentially directly via a first connector, while functional units arranged more remotely on a chip communicate via a second connector and with buffering with one of the few neighbouring functional units. The fast connection of neighbouring functional units also exploits the fact that neighbouring functional units are linked by very short wires or conductors, which means that the signal transit times are short, in view of which high clock frequencies can be used.
A relatively small multiplexer, which thus exhibits a minimal transit time, is preferably employed for the fast connection of functional units which communicate very frequently in comparison with the total activity in the processor. A multiplexer, though one with considerably more inputs than the first-order multiplexer, can also be used as the second connector.
To implement the storage facility between the small first-order connector and the large second-order connector, a register is preferably used, which stores a value received via the second-order connector until it is required by a functional unit connected to the first-order connector. A buffered value is thus buffered for at least one clock cycle. This buffering constitutes the essentially xe2x80x9cintentionalxe2x80x9d retardation of the connections via the secondorder connector.
If it is assumed, however, that e.g. 90% of the connection activity occurs between neighbouring functional units, which are connected quickly via the first-order multiplexer, and only 10% of the connection activity in a processor occurs via the second-order connector and thus also via the memory, the additionally introduced wait cycle has only a minor effect on the total processor working time since values normally stored in the memory must be gathered from more remote functional units, which may involve a time of the order of the wait cycle. The present invention leads instead to a reduction in the processor working time in comparison with a processor with a large crossbar switch, which, due to its size, entails increased signal transit times.
To exploit optimally the advantages of the hierarchical connection network for connecting a plurality of functional units in a processor, the processor must be so programmed and the functional units so designed that as many calculations as possible can be performed by neighbouring functional units. The present invention will therefore exhibit the greatest benefits as regards the complexity of the connection network and the speed of the processor in the case of application oriented processors, such as those used in graphics processing equipment and the like, in which certain calculation structures occur very frequently.