Matrix transposition is required in many algorithms in image and video processing. Several iMX hardware accelerators have been designed by Texas Instruments for use for parallel multiply-and-add operations. The most basic iMX can perform matrix transposition on only one data item at a time. Thus the matrix transposition operations performed using such hardware accelerators have low efficiency.
FIG. 1 illustrates a simplified diagram of a prior art iMX architecture including four multiply-and-accumulate (MAC) ALUs. Each logical memory block supporting iMX has the same number of banks as the number of processors. In this example there are four banks in input matrix memory 101 and four banks in output matrix memory 109. Each bank is 16-bits wide. The iMX accelerator addresses each bank individually so iMX can read/write any four consecutive 16-bit words starting from any word. For example, four consecutive words from a word address of ‘2’ routes data from bank 2 to MAC_0 ALU, data from bank 3 to MAC_1 ALU, data from bank 0 to MAC_2 ALU and data from bank 1 to MAC_3 ALU. Input logic blocks 103 include input datapath 113 and input rotator 114. Output logic blocks 107 include output rotator 115 and output datapath 116. The iMX accelerator also uses multiple banks for parallel table lookup (not shown in FIG. 1). In that usage input rotator block 114 and output rotator block 115 are pass-through elements.
Input memory controller 110 computes required input addresses. These addresses are supplied to each of the input matrix memory banks 101 via corresponding address buffers 102.
Output memory controller 111 computes required output addresses. These addresses are supplied to each of the output matrix memory banks 109 via corresponding address buffers 108.
In earlier iMX accelerators, other than table reads, there is no provision to simultaneously read or write non-consecutive memory words. This limitation requires that matrix transposition be carried out one data item at a time.
FIG. 2 illustrates the sequence of operations for matrix transformation using the prior art iMX architecture illustrated in FIG. 1. The method of FIG. 2 is the only method possible with early iMX hardware. This sequence is very inefficient because only one MAC and one memory bank is used on each clock cycle. FIG. 2 illustrates successive reads of the input memory 201, 202, 203, and 204 one bank at a time. FIG. 2 also illustrates corresponding writes to the output memory 211, 212, 213, and 214 one bank at a time.