1. Field of the Invention
The present invention relates to crossbar circuitry and a method of operation of such crossbar circuitry.
2. Description of the Prior Art
A crossbar circuit is a switch infrastructure for connecting multiple inputs to multiple outputs in a matrix manner. Accordingly, crossbar circuitry can be used to interconnect a plurality of source circuits and a plurality of destination circuits such that data input to the crossbar circuitry from any of the plurality of source circuits can be output to any of the plurality of destination circuits. Crossbar circuits can be used in a variety of implementations. For example, in a data processing system implementation, such crossbar circuitry can be used to interconnect a plurality of processors used to perform data processing operations on data values with a plurality of memory devices used to store those data values, thereby allowing the data values from any memory device to be routed to any processor.
Known techniques for producing crossbar circuits require a large area for the crossbar circuitry due to the components required to form the crossbar circuitry and the significant number of control lines required for routing control signals to those components, and also consume significant power. Further, their complexity tends to grow rapidly with size, making many of the known techniques impractical for use with crossbar circuits required to interconnect a large number of source circuits with a large number of destination circuits. Some known techniques are discussed below.
The article “A 50 Gb/s 32×32 CMOS Crossbar Chip using Asymmetric Serial Links” by K Chang et al, 1999 Symposium on VLSI Circuits, Digest of Technical Papers, Pages 19 to 22, and the article “A 2 Gb/s 256×256 CMOS Crossbar Switch Fabric Core Design using Pipelined MUX” by T Wu et al, IEEE International Symposium on Circuits and System, 2002, Pages 568 to 571, describe crossbar circuits using a hierarchical arrangement of multiplexers to enable the routing of data from any input source to any output destination. However, such MUX-based crossbar circuits are relatively large in size, and have a large power consumption. Further, they typically require a considerable number of control lines in order to control the various multiplexers. Such MUX-based designs are not typically scalable with an increase in the number of inputs and outputs to be supported, at least in part because it becomes increasingly difficult as the size increases to route the necessary control signals to the various multiplexers. Further, if the input data is multi-bit data routed over an input bus, the routing of the data paths themselves becomes very complex.
The article “A Novel Reduced Swing CMOS Bus Interface Circuit for High Speed Low Power VLSI Systems” by R Golshan et al, IEEE International Symposium on Circuits and System, Pages 351 to 354, 1994, describes an X-Y style crossbar circuit where the input paths pass in a horizontal direction and the output paths pass in a vertical direction. At the intersection between each input path and output path, a storage element in the form of a flip-flop circuit is provided whose output controls a transistor used to couple the input path to the output path. However, such a design requires a large number of control lines to program the various flip-flops in order to configure the crossbar circuit to perform the required routing. Further, the input data provided on an input data path is used to drive the output data on the relevant output data path. As the crossbar circuit becomes larger to accommodate larger numbers of inputs and outputs, the capacitance of the output data paths increases, and accordingly it is necessary to provide larger drive transistors on the inputs in order to overcome the increased capacitance. In addition, the coupling transistor driven by the flip-flop at the interconnection between a data input path and a data output path also needs to increase in size as the crossbar circuit increases in size. Further, it will typically be necessary to include one or more buffers in the data output paths as the crossbar circuitry increases in size. All of these factors give rise to significant problems in layout of the elements of the crossbar circuit and the associated control lines, particularly as the crossbar circuit increases in size to accommodate more inputs and outputs. Accordingly, this approach becomes complex and is not scalable.
The article “High-Performance Crossbar Design for System-On-Chip” by P Wijetunga, Proceedings of the Third IEEE International Workshop on System-On-Chip for Real-Time Applications, 2003, describes a crossbar design employing a pass-transistor chain as transmission circuitry to be located at each intersection between the data input paths and the data output paths. When the pass-transistor chain is required to couple the input data to the output data path, a current sensing circuit located on the output data path is used to detect the input data value. A significant disadvantage of this design is that it consumes large power due to the requirement for current sensing devices to be located on each output path. Further, a large number of control signals need routing within the crossbar circuitry, and indeed those control signals will dominate the routing requirements as the crossbar circuit increases in size to accommodate more inputs and outputs. Hence, again, this design is not readily scalable to larger designs of crossbar circuitry.
The article “A Multi-Context 6.4 Gb/s/Channel On-Chip Communication Network using 0.18 μm Flash-EEPROM Switches and Elastic Interconnects” by M Borgatti et al, ISSCC 2003, Session 26, Embedded and Digital Systems, Paper 26.5, describes a programmable crossbar implemented using a matrix of modified flash-EEPROM devices. However, as is clear from Figure 26.5.6, the number of memory cells involved in routing a data input from a source device to a destination device depends on where those devices are connected into the crossbar, and accordingly the timing for signals passing through the crossbar is not deterministic. Further, a large number of control lines will be required to enable programming of the various flash-EEPROM cells, and a significant amount of time will be required to program those various flash-EEPROM cells. Any reconfiguration of the crossbar device will hence also take a significant time.
Accordingly, such a design of crossbar is complex, and will increase in complexity as more input devices and output devices need to be supported by the crossbar, due to the proliferation in control lines required. Further, since the timing of the crossbar is not deterministic, this will make the crossbar design inappropriate for certain implementations.
In summary, it will be appreciated from the above discussions that existing crossbar designs typically involve complex routing of control signals, with that complexity rapidly increasing as the size of the crossbar increases. Often the designs consume significant power consumption and lack scalability, due partly to the number of control lines required, and partly due to the need to increase the size of certain components provided within the crossbar as the size of the crossbar increases.
Commonly owned US published patent application 2010/0211719, the entire contents of which are hereby incorporated by reference, describes a crossbar circuit design where routing patterns are cached locally at crosspoints in the crossbar and are then used to route data. This significantly reduces routing congestion when producing the wiring layout for the crossbar circuit. The design is readily scalable, and hence the crossbar circuitry can be readily utilised even where the number of source circuits and destination circuits to be connected to the crossbar circuitry is large. Further, the design produces a regular layout providing a fixed latency for the transfer of data through the crossbar circuitry.
Another issue in crossbar design is how to provide the crossbar with collision detection and resolution abilities. Multiple requests for the same destination in a switching fabric is termed a collision. With increasing number of sources and destinations, collisions get more frequent. Under such circumstances, arbitration becomes a bottle-neck in the overall efficiency of the crossbar circuitry.
Most contemporary switching fabrics consist primarily of two modules: a crossbar to transmit data, and an arbiter to configure the crossbar. In such an implementation, the source circuits send requests to the arbiter for channels. The arbiter samples all requests and using some priority assignment scheme, grants some or all of the requests and configures the crossbar accordingly. This scheme poses two major problems for scalability:                1) Routing all request signals from the source circuits to the arbiter, and all the grant signals back, becomes increasingly difficult for larger systems; and        2) The arbiter needs to have knowledge of all incoming requests as well the current state of the crossbar, before it can make a decision. Monitoring the state of the crossbar every cycle requires additional logic as well as interconnects. This contributes to additional delay.        
A number of attempts have been made to address the scalability and arbitration delay of packet switching networks. In the article by Chi H. et. al. entitled “Decomposed Arbiters for Large Crossbars with Multi-Queue Input Buffers,” IEEE International Conference on Computer Design, 14-16 Oct. 1991, pages: 233-238, the authors discuss decomposing arbiters, so that some requests can be granted before the arbitration process completes. However, the worst case arbitration delay still remains the same. In general, the arbitration delay grows linearly with size.
In the article by Delgado-Frias et. al. entitled “A VLSI Crossbar Switch with Wrapped Wave Front Arbitration,” IEEE Transactions on Circuits and Systems, Volume 50, Issue 1, Jan. 2003 Pages: 135-141, and the article by Kavaldjiev N. et. al. entitled “A Virtual Channel Router for On-chip Networks,” IEEE International SoC Conference, 12-15 Sep. 2004, pages: 289-293, the authors discuss the handling of arbitration within a crossbar. However, the disclosed implementations are not scalable and are restricted to a crossbar of size 4*4.
In the article by Shin E. et. al. entitled “Round-robin Arbiter Design and Generation,” International Symposium on System Synthesis, 2002, pages: 243-248, the authors propose a tool to generate a round robin arbiter. The approach is hierarchical, attempting to make a 32*32 switch out of 4*4 switches.
In the article by William W. Plummer entitled “Asynchronous Arbiters”, IEEE Transactions on Computers Archive Volume 21, Issue 1 (January 1972) Pages 37-42, the article by Charles E. Molnar et. al. entitled “Simple Circuits that Work for Complicated Reasons,” International Symposium on Advanced Research in Asynchronous Circuits and Systems, 2000. (ASYNC 2000) Proceedings, and the article by Mark B. Josephs et. al. entitled “CMOS Design of the Tree Arbiter Element,” IEEE Transactions on VLSI systems, Volume 4, Issue 4, Dec 1996 Page(s):472-476, some asynchronous techniques for arbitration are described. These designs exploit meta-stability to achieve randomness. However, delay in meta-stable systems can become high at times, thereby restricting their use in real time systems that require guaranteed throughput.
In summary, many solutions available in the literature take a hierarchical approach towards making a large crossbar switch using 2*2 or 4*4 switches. Hence, the delay grows linearly with the size of the crossbar switch. This leaves no room for voltage scaling in wide and parallel systems (SIMD/MIMD type applications).
Further, in previous attempts to integrate the arbiter and the crossbar together, such as for example discussed in the earlier mentioned article “A Virtual Channel Router for On-chip Networks,” IEEE International SoC Conference, 12-15 Sep. 2004, pages: 289-293, it has only been possible to bring them spatially close. However, the logic and interconnect for both functionalities have remained exclusive.
In conventional implementations, collision detection and resolution is done hierarchically. This is achieved in multiple stages of arbitration logic, wherein after each stage the number of requests is reduced by a certain fraction.
Commonly owned U.S. published patent application US 2010/0211720, the entire contents of which are hereby incorporated by reference, describes a self-arbitrating design of crossbar circuitry that enables a very efficient resolution of conflicts to be performed by applying a predetermined priority scheme, whilst providing a very regular design, with uniform delay across all paths, and which requires significantly less control lines that typical prior art crossbar designs. Such crossbar circuitry is readily scalable to form large crossbars. Further, US 2010/0211720 proposed the use of a priority configuration module which, in a priority assignment mode of operation, is able to reprogram the values in selected priority storage circuits of the crossbar. It is also identified that such a priority configuration module could be made adaptive by monitoring activity of the crossbar and determining how to update the priority storage circuits dependent on that analysis.
Certain types of priority scheme could still be difficult to implement efficiently in the above-described techniques, for example adaptive priority schemes where the relative priorities of the source circuits change between each application of the adaptive priority scheme. This is due to the fact that according to such techniques, it is typically necessary to enter a dedicated priority assignment mode of operation in order to reprogram the priority storage circuits, thereby causing clock cycles to be spent purely performing the priority update process. Furthermore it has also generally only been possible at one time to reprogram the storage circuits of the crossbar cells connected to one data output path. Moreover, for the adaptive priority configuration module, it is necessary first to extract information from the crossbar indicative of the activities of the crossbar, then to analyse that extracted information, and then to input the revised priority data into the crossbar for storage within the relevant priority storage circuits, thereby leading to an inefficient update process.
Commonly owned U.S. published patent application US 2011/0138098, the entire contents of which are hereby incorporated by reference, describes a self-arbitrating crossbar design in which each crossbar cell has priority storage circuitry configured to store priority data which allows the implementation of an adaptive priority scheme, for example based on the least recently granted (LRG) priority scheme. In particular, the priority storage circuitry is configured, between each application of the adaptive priority scheme by arbitration circuitry, to self update the priority data stored therein. This provides a fast and scalable technique for implementing adaptive priority schemes within a crossbar circuit.
However, in addition to the requirement to provide the above described techniques for arbitrating and prioritising between multiple requests for the same destination, it is desirable for interconnect fabrics to support Quality-of-Service (QoS) requests from the source circuits seeking to use the interconnect to transmit data to the destination circuits. Such QoS requests provide a mechanism for a requesting source circuit to associate a relative level of importance with a data transmission request, such that more important requests can be prioritized over less important requests.
Contemporary interconnect fabrics which support such QoS based relative prioritization of requests (typically on a packet-by-packet basis), employ a QoS arbiter in the control path for configuring the network. The design and implementation of these QoS arbiters is a complex task, and in particular the typically employed synthesis and place-and-route design techniques for implementing such a QoS arbiter result in an undesirably high latency, because of significant memory requirements, often in the form of look-up tables. Some approaches have been proposed for enabling asynchronous and self-timed circuit techniques for designing these components with lower latency. The disclosures: “An Asynchronous Low Latency Arbiter for Quality of Service (QoS) Applications”, Tomaz Felicijan at al., Proceedings of the 15th International Conference on Microelectronics, ICM 2003; and “New generic GALS NoC architectures with multiple QoS”, Zid, M. et al., International Conference on Design and Test of Integrated Systems in Nanoscale Technology, DTIS 2006 are examples of these proposed techniques.
However the currently known techniques for QoS based arbitration which use a synthesis based approach are best suited to interconnects which link together a limited number of source and destination circuits and these techniques do not scale well for large multiplicity systems. Instead, in such more complex systems an approach taken is to provide a network built with a hierarchy of interconnect fabrics. As a result, data packets have to travel through multiple fabrics, which increases latency as well as dissipates additional energy for intermediate data storage. In consequence the area, delay and energy overhead of the fabric becomes a significant factor in such systems.
Accordingly, it would be desirable to provide an improved crossbar design that alleviates the above described problems related to QoS based arbitration.