Some processing systems may support vector processing or parallel processing of operations on two or more data elements of a data vector. Some such operations may involve movement of data elements of a data vector. For example, a permutation operation may involve rearranging positions of one or more data elements within the data vector. A broadcasting operation may involve copying a selected data element and replacing every other data element with the selected data element. Numerous other such data movement operations may be used in processing applications such as multimedia processing, digital signal processing, etc.
Conventional processing systems handle data movement operations by implementing interconnection networks such as crossbar. A crossbar may be implemented using multiplexors. For example, in order to achieve all permutations and data movement operations for a vector comprising N data elements, an N×N crossbar may be implemented using N N-input multiplexors. Each N-input multiplexor may select as its output, any one of the N data elements. While a crossbar implementation achieves the desired functionality, it incurs significant hardware costs for implementing the N N-input multiplexors. Moreover, the crossbar implementations are not easily scalable as the size of (e.g. the number of data elements in) data vectors to be operated on increases.
Accordingly, there is a need in the art for hardware-efficient and scalable solutions for implementing data movement operations for data elements of data vectors.