This relates generally to a graphics processing pipeline with a switching crossbar that enables multiple processors, such as Single Instruction Multiple Data (SIMD) processors, to access banks of shared local memory.
Shared local memory crossbars are implemented as a single crossbar switch. However this requires a large gate count when the number of single instruction multiple data register slots and banks is high. For example, for a shared local memory with SIMD with 32 slots, 4-bytes per slot per cycle in a 64 bank configuration, the monolithic data crossbar requires 63,488 2-to-1 multiplexer cells.
Current implementations of N×M crossbars have N:1 multiplexers for every output port (total of M). This uses a lot of gates and area when N, M are large (e.g. 32 or more). For example, a N×M switch will have M*(N−1) 2-to-1 multiplexors. It also causes timing issues due to a large fanout because each input port is routed to N output multiplexers.