1. Field of the Invention
The present invention relates to crossbar circuitry for applying an adaptive priority scheme, and a method of operation of such crossbar circuitry.
2. Description of the Prior Art
A crossbar circuit is a switch infrastructure for connecting multiple inputs to multiple outputs in a matrix manner. Accordingly, crossbar circuitry can be used to interconnect a plurality of source circuits and a plurality of destination circuits such that data input to the crossbar circuitry from any of the plurality of source circuits can be output to any of the plurality of destination circuits. Crossbar circuits can be used in a variety of implementations. For example, in a data processing system implementation, such crossbar circuitry can be used to interconnect a plurality of processors used to perform data processing operations on data values with a plurality of memory devices used to store those data values, thereby allowing the data values from any memory device to be routed to any processor.
Known techniques for producing crossbar circuits require a large area for the crossbar circuitry due to the components required to form the crossbar circuitry and the significant number of control lines required for routing control signals to those components, and also consume significant power. Further, their complexity tends to grow rapidly with size, making many of the known techniques impractical for use with crossbar circuits required to interconnect a large number of source circuits with a large number of destination circuits. Some known techniques are discussed below.
The article “A 50 Gb/s 32×32 CMOS Crossbar Chip using Asymmetric Serial Links” by K Chang et al, 1999 Symposium on VLSI Circuits, Digest of Technical Papers, Pages 19 to 22, and the article “A 2 Gb/s 256×256 CMOS Crossbar Switch Fabric Core Design using Pipelined MUX” by T Wu et al, IEEE International Symposium on Circuits and System, 2002, Pages 568 to 571, describe crossbar circuits using a hierarchical arrangement of multiplexers to enable the routing of data from any input source to any output destination. However, such MUX-based crossbar circuits are relatively large in size, and have a large power consumption. Further, they typically require a considerable number of control lines in order to control the various multiplexers. Such MUX-based designs are not typically scalable with an increase in the number of inputs and outputs to be supported, at least in part because it becomes increasingly difficult as the size increases to route the necessary control signals to the various multiplexers. Further, if the input data is multi-bit data routed over an input bus, the routing of the data paths themselves becomes very complex.
The article “A Novel Reduced Swing CMOS Bus Interface Circuit for High Speed Low Power VLSI Systems” by R Golshan et al, IEEE International Symposium on Circuits and System, Pages 351 to 354, 1994, describes an X-Y style crossbar circuit where the input paths pass in a horizontal direction and the output paths pass in a vertical direction. At the intersection between each input path and output path, a storage element in the form of a flip-flop circuit is provided whose output controls a transistor used to couple the input path to the output path. However, such a design requires a large number of control lines to program the various flip-flops in order to configure the crossbar circuit to perform the required routing. Further, the input data provided on an input data path is used to drive the output data on the relevant output data path. As the crossbar circuit becomes larger to accommodate larger numbers of inputs and outputs, the capacitance of the output data paths increases, and accordingly it is necessary to provide larger drive transistors on the inputs in order to overcome the increased capacitance. In addition, the coupling transistor driven by the flip-flop at the interconnection between a data input path and a data output path also needs to increase in size as the crossbar circuit increases in size. Further, it will typically be necessary to include one or more buffers in the data output paths as the crossbar circuitry increases in size. All of these factors give rise to significant problems in layout of the elements of the crossbar circuit and the associated control lines, particularly as the crossbar circuit increases in size to accommodate more inputs and outputs. Accordingly, this approach becomes complex and is not scalable.
The article “High-Performance Crossbar Design for System-On-Chip” by P Wijetunga, Proceedings of the Third IEEE International Workshop on System-On-Chip for Real-Time Applications, 2003, describes a crossbar design employing a pass-transistor chain as transmission circuitry to be located at each intersection between the data input paths and the data output paths. When the pass-transistor chain is required to couple the input data to the output data path, a current sensing circuit located on the output data path is used to detect the input data value. A significant disadvantage of this design is that it consumes large power due to the requirement for current sensing devices to be located on each output path. Further, a large number of control signals need routing within the crossbar circuitry, and indeed those control signals will dominate the routing requirements as the crossbar circuit increases in size to accommodate more inputs and outputs. Hence, again, this design is not readily scalable to larger designs of crossbar circuitry.
The article “A Multi-Context 6.4 Gb/s/Channel On-Chip Communication Network using 0.18 μm Flash-EEPROM Switches and Elastic Interconnects” by M Borgatti et al, ISSCC 2003, Session 26, Embedded and Digital Systems, Paper 26.5, describes a programmable crossbar implemented using a matrix of modified flash-EEPROM devices. However, as is clear from Figure 26.5.6, the number of memory cells involved in routing a data input from a source device to a destination device depends on where those devices are connected into the crossbar, and accordingly the timing for signals passing through the crossbar is not deterministic. Further, a large number of control lines will be required to enable programming of the various flash-EEPROM cells, and a significant amount of time will be required to program those various flash-EEPROM cells. Any reconfiguration of the crossbar device will hence also take a significant time.
Accordingly, such a design of crossbar is complex, and will increase in complexity as more input devices and output devices need to be supported by the crossbar, due to the proliferation in control lines required. Further, since the timing of the crossbar is not deterministic, this will make the crossbar design inappropriate for certain implementations.
In summary, it will be appreciated from the above discussions that existing crossbar designs typically involve complex routing of control signals, with that complexity rapidly increasing as the size of the crossbar increases. Often the designs consume significant power consumption and lack scalability, due partly to the number of control lines required, and partly due to the need to increase the size of certain components provided within the crossbar as the size of the crossbar increases.
Commonly owned US published patent application 2010/0211719, the entire contents of which are hereby incorporated by reference, describes a crossbar circuit design where routing patterns are cached locally at crosspoints in the crossbar and are then used to route data. This significantly reduces routing congestion when producing the wiring layout for the crossbar circuit. The design is readily scalable, and hence the crossbar circuitry can be readily utilised even where the number of source circuits and destination circuits to be connected to the crossbar circuitry is large. Further, the design produces a regular layout providing a fixed latency for the transfer of data through the crossbar circuitry.
Another issue in crossbar design is how to provide the crossbar with collision detection and resolution abilities. Multiple requests for the same destination in a switching fabric is termed a collision. With increasing number of sources and destinations, collisions get more frequent. Under such circumstances, arbitration becomes a bottle-neck in the overall efficiency of the crossbar circuitry.
Most contemporary switching fabrics consist primarily of two modules: a crossbar to transmit data, and an arbiter to configure the crossbar. In such an implementation, the source circuits send requests to the arbiter for channels. The arbiter samples all requests and using some priority assignment scheme, grants some or all of the requests and configures the crossbar accordingly. This scheme poses two major problems for scalability:                1) Routing all request signals from the source circuits to the arbiter, and all the grant signals back, becomes increasingly difficult for larger systems; and        2) The arbiter needs to have knowledge of all incoming requests as well the current state of the crossbar, before it can make a decision. Monitoring the state of the crossbar every cycle requires additional logic as well as interconnects. This contributes to additional delay.        
A number of attempts have been made to address the scalability and arbitration delay of packet switching networks. In the article by Chi H. et. al. entitled “Decomposed Arbiters for Large Crossbars with Multi-Queue Input Buffers,” IEEE International Conference on Computer Design, 14-16 Oct. 1991, pages: 233-238, the authors discuss decomposing arbiters, so that some requests can be granted before the arbitration process completes. However, the worst case arbitration delay still remains the same. In general, the arbitration delay grows linearly with size.
In the article by Delgado-Frias et. al. entitled “A VLSI Crossbar Switch with Wrapped Wave Front Arbitration,” IEEE Transactions on Circuits and Systems, Volume 50, Issue 1, January 2003 Pages: 135-141, and the article by Kavaldjiev N. et. al. entitled “A Virtual Channel Router for On-chip Networks,” IEEE International SoC Conference, 12-15 Sep. 2004, pages: 289-293, the authors discuss the handling of arbitration within a crossbar. However, the disclosed implementations are not scalable and are restricted to a crossbar of size 4*4.
In the article by Shin E. et. al. entitled “Round-robin Arbiter Design and Generation,” International Symposium on System Synthesis, 2002, pages: 243-248, the authors propose a tool to generate a round robin arbiter. The approach is hierarchical, attempting to make a 32*32 switch out of 4*4 switches.
In the article by William W. Plummer entitled “Asynchronous Arbiters”, IEEE Transactions on Computers Archive Volume 21, Issue 1 (January 1972) Pages 37-42, the article by Charles E. Molnar et. al. entitled “Simple Circuits that Work for Complicated Reasons,” International Symposium on Advanced Research in Asynchronous Circuits and Systems, 2000. (ASYNC 2000) Proceedings, and the article by Mark B. Josephs et. al. entitled “CMOS Design of the Tree Arbiter Element,” IEEE Transactions on VLSI systems, Volume 4, Issue 4, December 1996 Page(s):472-476, some asynchronous techniques for arbitration are described. These designs exploit meta-stability to achieve randomness. However, delay in meta-stable systems can become high at times, thereby restricting their use in real time systems that require guaranteed throughput.
In summary, many solutions available in literature take a hierarchical approach towards making a large crossbar switch using 2*2 or 4*4 switches. Hence, the delay grows linearly with the size of the crossbar switch. This leaves no room for voltage scaling in wide and parallel systems (SIMD/MIMD type applications).
Further, in previous attempts to integrate the arbiter and the crossbar together, such as for example discussed in the earlier mentioned article “A Virtual Channel Router for On-chip Networks,” IEEE International SoC Conference, 12-15 Sep. 2004, pages: 289-293, it has only been possible to bring them spatially close. However, the logic and interconnect for both functionalities have remained exclusive.
In conventional implementations, collision detection and resolution is done hierarchically. This is achieved in multiple stages of arbitration logic, wherein after each stage the number of requests is reduced by a certain fraction.
Commonly owned US published patent application US 2010/0211720, the entire contents of which are hereby incorporated by reference, describes a self-arbitrating design of crossbar circuitry that enables a very efficient resolution of conflicts to be performed by applying a predetermined priority scheme, whilst providing a very regular design, with uniform delay across all paths, and which requires significantly less control lines that typical prior art crossbar designs. Such crossbar circuitry is readily scalable to form large crossbars.
However, certain types of priority scheme are still difficult to implement efficiently, for example adaptive priority schemes where the relative priorities of the source circuits change between each application of the adaptive priority scheme. US 2010/0211720 proposed the use of a priority configuration module which, in a priority assignment mode of operation, was able to reprogram the values in selected priority storage circuits of the crossbar. It was also identified that such a priority configuration module could be made adaptive by monitoring activity of the crossbar and determining how to update the priority storage circuits dependent on that analysis. However, such an approach impacts efficiency of the crossbar for a number of reasons.
Firstly it is necessary to enter a dedicated priority assignment mode of operation in order to reprogram the priority storage circuits, thereby causing clock cycles to be spent purely performing the priority update process. Typically it is also only possible at one time to reprogram the storage circuits of the crossbar cells connected to one data output path. Furthermore, for the adaptive priority configuration module, it is necessary first to extract information from the crossbar indicative of the activities of the crossbar, then to analyse that extracted information, and then to input the revised priority data into the crossbar for storage within the relevant priority storage circuits, thereby leading to an inefficient update process.
One type of adaptive priority scheme which has traditionally been difficult to implement efficiently is the least recently granted (LRG) priority scheme, where the relative priorities of the various source circuits are updated dependent on which source circuit wins a current arbitration process. In particular, if source circuit X wins the arbitration process, then in accordance with the LRG scheme all sources having a lower priority than source X have their priorities increased, all sources having a higher priority than source X have their priorities maintained at their current level, and source X has its priority demoted so as to have the lowest relative priority.
Although the LRG priority scheme guarantees fairness, and a better quality of service (QoS) than other priority schemes (for example round-robin, pseudo round-robin, random priority assignment, etc), its hardware implementation typically results in a significant overhead in the area, performance and energy consumption of the crossbar circuit. In addition the hardware complexity grows quadratically with the size of the crossbar circuit.
In contemporary switching fabrics that support LRG, the arbiter updates the priorities assigned to different sources by keeping track of all incoming and granted requests in the network. This requires additional memory, processing elements and wires to implement this in hardware.
Accordingly, it would be desirable to provide an improved crossbar design that alleviates the above described problems associated with supporting adaptive priority schemes.