The invention relates to an interconnect, and to interconnect architecture, for communicating between processing elements and memory modules in a computer system comprising on-chip parallel computation, in order to reduce the tight synchrony that is required by important components of most present computers.
The processing architecture employed by today""s personal computers is based on the von-Neuman architecture developed in the late 1940s (Hennessy, J. L. et al., xe2x80x9cComputer Architecture: A Qualitative Approach,xe2x80x9d 2nd Edition 1996 (Morgan Kaufmann Publ., San Francisco, Calif.). Originally, the architecture presumed that processing follows a set of sequentially executed instructions, without any concurrent operations. While architecture implementation based on such a presumption had been effective in the past, the number of transistors on an integrated circuit (chip) continues to double every 1-2 years, and will eventually outstrip the resources provided by the von-Neuman architecture. For this reason, instruction-level parallelism (ILP) architecture implementations are being developed that permit xe2x80x9cpipeliningxe2x80x9dxe2x80x94the execution of instructions in stages such that different instructions may be at different stages of processing at the same time, or xe2x80x9cmultiple-issuexe2x80x9dxe2x80x94the issuance of multiple instructions at the same time.
A further advancement has involved the use of multiple xe2x80x9cexecutionxe2x80x9d threads. Such threads are sets instructions, controlled by several program counters, which operate concurrently (R. Alverson et al. xe2x80x9cThe Tera Computer System,xe2x80x9d Int. Conf. on Supercomp., 1-6, (1990); D. M. Tullsen et al. xe2x80x9cSimultaneous Multithreading: Maximizing On-Chip Parallelism,xe2x80x9d In Proc. 22nd ISCA (1995).
Pipeline designs store and process data as they passes along pipeline stages. They can be clock driven or event driven depending upon whether their parts act in response to an external clock, or act independently in response to local events. Pipeline throughput can be either inelastic (i.e., input rate fixed) or dynamic (i.e., input rate may vary). The separate pipeline stages are capable of concurrent operation. Thus, pipelining provides high speed and are a common paradigm for very high speed computing machinery (Sutherland, I. E., xe2x80x9cMicropipelines,xe2x80x9d Communications of the ACM 32:720-738 (1989), herein incorporated by reference). With very few exceptionsxe2x80x94a handful of research and small commercial systems, most present day computers are synchronous machines or systems. In such machines or systems, instructions and data move from one pipeline stage to the next under the control of a single (global) clock.
A standard on-chip interconnect is the self routing synchronous crossbar (see, U.S. Pat. No. 4,949,084 (Schwartz, R.); U.S. Pat. No. 5,400,262 (Mohsen; A. M. et al.). Several factors make the synchronous crossbar a less than ideal solution. One problem is that orchestrating unpredictable access among the various ports (processing elements and memory modules) adds unnecessary overhead to the system. Traditional on-chip solutions require some sort of global coordination to set a particular configuration. Ideally, messages would freely access their destinations, with contention dealt with locally only where it occurs.
It is well documented that wire delay will become an increasingly important factor in future chip design. As process dimensions shrink, clock speed improves while relative wire resistance increases. The foundations of present systems were conceived 10-15 years ago when 1.6-micron technologies allowed interconnect delays to be ignored. These systems thus were focused on optimizing for gate delay only. In today""s deep submicron silicon, however, wire interconnect delay can represent as much as 80% of the path delay (http://www.magma-da.com/c/@gcLWtfuirkG0Q/Pages/fixedtiming.html#Managing). The reason for the delay shift is technological.
To reach high operating frequencies, a design that involves short, low-capacitance wires is required. A crossbar design with high connectivity is, however, a large structure. Modulating signals on a long wire across such a structure at speeds near that at which the active devices are capable of operation becomes infeasible.
Furthermore, clock distribution across a large chip also becomes more difficult. At high frequency, stable clock distribution is an increasingly difficult design problem. The International Technology Roadmap for Semiconductors projects a growing disparity between clock speeds that can be achieved locally and those that can be achieved across a chip. Additionally, high performance clocking trees with large drivers consume a large amount of power. In contemporary Alpha processor designs, the clock tree consumes 50% of the processor power (J. Montanaro et al., xe2x80x9cA 160-MHz 32-b 0.5W CMOS RISC Microprocessor,xe2x80x9d IEEE Journal of Solid-State Circuits, volume 31, number 11, November 1996, p. 1704,
In sum, with very few exceptionsxe2x80x94a handful of research and small commercial systemsxe2x80x94most present day computers are synchronous machines or systems. In such machines or systems, instructions and data move from one pipeline stage to the next under the control of a single (global) clock. The present invention seeks to reduce the tight synchrony required in some important computer components. The present invention relates to a form of pipeline architecture implementation whose interconnects permit high-speed communication between processing elements and memory modules thereby reducing the need for tight synchrony. The invention provides an approach which advocates increasing greatly the number of threads and program counters.
An object of the invention is to provide a method and interconnect device, or a part of such device, for interconnecting processing elements and memory modules in a computer system comprising on-chip parallel computation such that the elements and modules can communicate with one another with increased efficiency.
In detail, the invention provides a semiconductor integrated circuit device comprising an interconnect structure for electrically processing data from at least one of a plurality of input ports to at least one of a plurality of output ports on or within a semiconductor substrate having contact areas disposed in a predetermined spaced relation, the interconnect structure comprising input port leads, multiplexors, and output port leads, wherein the multiplexors perform switching decisions and the input leads from each of the input ports are co-located.
The invention additionally provides a digital device on or within a semiconductor chip, which comprises:
(A) a plurality of storage cells for storing digital signals therein;
(B) a plurality of functional cells for performing functional operations on digital signals; and
(C) an interconnect matrix having a plurality of multibit input leads from at least one of a plurality of input ports connected to receive multibit inputs from one or more of the storage or functional cells, and having a plurality of multibit output leads to at least one of a plurality of output ports connected to send signals to one or more other of the storage or functional cells,
wherein the interconnect matrix further includes at least one pair of multiplexors that functions to coordinately process the multibit inputs from an input port.
The invention additionally provides a computer circuit which comprises a semiconductor integrated circuit device including an interconnect structure for electrically processing data from at least one of a plurality of input ports to at least one of a plurality of output ports in a semiconductor substrate having contact areas disposed in a predetermined spaced relation, the interconnect structure comprising input port leads, multiplexors, and output port leads, wherein the multiplexors perform switching decisions and the input leads from each of the input ports are co-located.
The invention additionally provides a computer circuit which comprises a digital device comprised of:
(A) a plurality of storage cells for storing digital signals therein;
(B) a plurality of functional cells for performing functional operations on digital signals; and
(C) an interconnect matrix having a plurality of multibit input leads from at least one of a plurality of input ports connected to receive multibit inputs from one or more of the storage or functional cells, and having a plurality of multibit output leads to at least one of a plurality of output ports connected to send signals to one or more other of the storage or functional cells,
wherein the interconnect-matrix further includes at least one pair of multiplexors that functions to coordinately process the multibit inputs from an input port.
The invention additionally provides a computer system comprising two or more computers in communication with one another which comprises a semiconductor integrated circuit device comprising an interconnect structure for electrically processing data from at least one of a plurality of input ports to at least one of a plurality of output ports on or within a semiconductor substrate having contact areas disposed in a predetermined spaced relation, the interconnect structure comprising input port leads, multiplexors, and output port leads, wherein the multiplexors perform switching decisions and the input leads from each of the input ports are co-located.
The invention additionally provides a computer system comprising two or more computers in communication with one another which comprises a digital device comprised of:
(A) a plurality of storage cells for storing digital signals therein;
(B) a plurality of functional cells for performing functional operations on digital signals; and
(C) an interconnect matrix having a plurality of multibit input leads from at least one of a plurality of input ports connected to receive multibit inputs from one or more of the storage or functional cells, and having a plurality of multibit output leads to at least one of a plurality of output ports connected to send signals to one or more other of the storage or functional cells,
wherein the interconnect matrix further includes at least one pair of multiplexors, each of which functions to coordinately process the multibit inputs of two input ports.
The invention additionally provides embodiments of such semiconductor integrated circuit device, digital device, computer circuit, and computer system wherein the interconnect structure is configured such that the switching decisions can be made locally, and/or wherein the data progresses dynamically through the multiplexors as soon as they are able to do so, and/or wherein the interconnect structure is highly pipelined, and/or wherein the device operates asynchronously and/or without global coordination.
The invention additionally provides embodiments of such semiconductor integrated circuit device, digital device, computer circuit, and computer system which comprises the architecture of input ports, output ports and multiplexors shown in FIG. 2.