Recent advancements in processor architectures have resulted in an ever more prevalent use of multi-core processors in the mainstream of computing across a wide range of market segments. The debut of Intel® Corporation's Xeon 8-core “Nehalem EX” processor and the AMD's Opteron 6-core “Istanbul” processor in 2009 was followed by Intel's 10-core “Westmere-EX” and AMD's 12-core “Magny-Cours” processors. Intel's recent Single Chip Cloud computing (SCC) platform integrates 48 Intel Architecture cores on a single chip. Several non-x86 multi-core processors have also been showcased, including STI's 8-core Cell processor, Sun's 8-core Niagara, Victoria Falls (16-core), Tilera (36, 64, and 100 core versions), and Intel's TeraFLOPS Processor prototype. This overall trend toward a higher core count is expected to continue and create mainstream terascale processors with 50 to 100+ cores in the next 3-5 years.
A scalable on-chip interconnection network fabric is a key ingredient in the architecture of terascale processors. For a commercial design the interconnect architecture should offer the flexibility to scale-up or reduce the number of processor cores, be amenable to high-volume manufacturing and provide reliability. Additionally, the interconnect needs to address a principle problem of providing high performance while optimizing power consumption.
As illustrated in FIG. 1, one popular topology employs a plurality of processing elements (PE) or “tiles” configured in a two-dimensional (2D) array and interconnected via a 2D mesh interconnect 100 comprising multiple interconnect links 102. Each PE node 104 includes a network interface 106 that is connected to the interconnect mesh at a respective router 108, which may be configured as 5-port crossbar, 4-port crossbar, or 3-port crossbar depending on its location, as illustrated. The crossbar is one of the two major architectural contributors of the router power (the other significant contributor being the packet buffers). For example, crossbar power consumption is 15% of the total router power in Intel's TeraFLOPS processor. In the MIT RAW processor, the crossbar consumes 30% of the power, while in the TRIPS processor data network it consumes 33% of network power. Crossbars also collectively occupy significant layout area. Thus it would be advantageous to reduce power and/or the area occupied by the crossbars.
Earlier work on crossbar power and area reduction has used two basic approaches: decomposition and segmentation. Under the decomposition approach, a functionally larger crossbar is made of smaller sub-crossbars, resulting in a smaller area and power but restricting connectivity between some input-output pairs and/or concurrency among multiple input-output pairs. The connectivity restriction, if any, may in turn restrict the routing algorithms available for the topology. Also, the concurrency restriction may impact overall latency and throughput. The segmentation approach is focused on power reduction by energizing only the necessary wire-segment of the crossbar for establishing input-output connectivity through the use of tri-state buffers. However, the segmentation itself does not provide any area reduction. Moreover, such designs typically isolate and focus only on crossbar ignoring its inter-connectivity with other logic within the router, placement of ports, flit-buffers and drivers and inter router connectivity, all of which present real constraints. This often leads to the unrealistic optimization assuming crossbar layout from a logical view without considering physical design constraints.