Matrix calculations are becoming more and more popular in various electrical and computer systems. Matrix calculations are used in many systems with concurrent multiple data paths. For example, matrix calculations are used in conventional equalizers in the Universal Mobile Telecommunications System (UMTS) High-Speed Downlink Packet Access (HSDPA). Matrix calculations are also used in conventional joint detection receivers in Time Division-Synchronous Code Division Multiple Access (TD-SCDMA), conventional Multiple-Input and Multiple-Output (MIMO) technologies, and other conventional technologies. The algorithms implemented in these technologies can be easily expressed in matrix format and implemented by a series of matrix operations, including matrix inversion, multiplication, conjugation and transposition, and so forth.
In conventional architectures, whether implemented as pure hardware or programmable architectures, matrix operations are conventionally realized by loop structures based on scalar operations. The scalar processing for these matrix operations usually incurs tremendous computational load since each matrix element is processed in series. To overcome these computational loads, vector processing architectures are implemented to accelerate the computation.
The basic principle of vector processing is that a set of identical operations are executed in parallel. To avoid the bottleneck of data accessing, the vector architecture usually has vector memory, as well, which is organized in lines instead of basic memory units. Organizing the vector memory in lines does not mean that conventional scalar data accessing cannot be supported.
FIG. 1A illustrates a conventional vector processing architecture 10. In particular, FIG. 1A illustrates the basic principles of vector processing using a vector memory 12. The depicted vector processing architecture 10 includes the vector memory 12 and a vector processor 14. The vector memory 12 is subdivided into several lines 16. Each line 16 has a width, L, which is the same as the instinctive vector width, L, of the whole vector processing architecture 10. Arithmetic and logic operations implemented by the vector processing architecture 10 are based on the instinctive vector width, L. Similarly, memory accessing operations are also based on the instinctive vector width, L.
The vector processing architecture 10 also accommodates operations using vectors which have a vector width that is different from the instinctive vector width, L. For example, the vector processing architecture 10 may implement operations for a vector with a width, K, which is less than the instinctive vector width, L, (i.e., K<L) by masking or padding the elements greater than K (i.e., the L-K elements). Also, in accessing data on the vector memory 12, the beginning address of a vector can be arbitrary. In general, within the vector memory 12, a “vector” means L elements stored in successive memory addresses. This can be stated as:V=v(s),v(s+1), . . . ,v(s+L−1),where V is a vector, and s is the starting address of the vector. This configuration is consistent with conventional vector architectures within the state of art of the circuit design.
The matrix transposition is one of the most frequently used operations in many vector algorithms. For a given matrix:
      A    =                  [                                                            a                                  1                  ,                  1                                                                                    a                                  1                  ,                  2                                                                    …                                                      a                                  1                  ,                  n                                                                                                        a                                  2                  ,                  1                                                                                    a                                  2                  ,                  2                                                                    …                                                      a                                  2                  ,                  n                                                                                        ⋮                                                                                                                                                                                                                                                                                      a                                  m                  ,                  1                                                                                    a                                  m                  ,                  2                                                                    …                                                      a                                                      m                    ,                    n                                    ⁢                                                                                                                            ]            ⁢              m        /        n              ,the matrix transposition, in the mathematical description, is:
      A    T    =            [                                                  a                              1                ,                1                                                                        a                              1                ,                2                                                          …                                              a                              n                ,                1                                                                                        a                              2                ,                1                                                                        a                              2                ,                2                                                          …                                              a                              n                ,                2                                                                          ⋮                                                                                                                                                                                                                                            a                              1                ,                m                                                                        a                              2                ,                m                                                          …                                              a                              n                ,                m                                                        ]        ⁢    n    ⁢          \        ⁢                  ⁢    m  
Besides the matrix transposition itself, some other matrix operations include the matrix transposition as a sub-operation, as well. For example, the Hermit operation, AH, which is widely used in many algorithms, uses matrix element conjugation and matrix transposition.
The matrix transposition is an operation that is typically more difficult to implement in vector processing architectures than in scalar processing architectures. From the original and transposed matrices shown above, the only change after the transposition is the arrangement of data elements in the matrix. In contrast to many other matrix operations, the main operations of the matrix transposition are memory array re-organizing, instead of arithmetic and logic operations. Hence, the matrix transposition operation described above is also referred to as a memory array transposition.
In many instances, the execution efficiency of a memory array transposition operation in the vector processing architecture 10 is lower than other kinds of operations. In the memory array transposition operation, the adjacent elements are scattered after the operation execution. In other words, there is not a direct correlation between the integral vector output and adjacent elements in the original memory configuration. Thus, the parallel processing advantages of the vector processing architecture 10 are not efficiently used during the data element relocation operations.
The operations for the memory array transposition in the vector processing architecture 10 normally include three operations for each line within the vector memory 12. In general, these operations include fetching an integral vector, relocating all of the elements inside the integral vector, and moving the elements to target addresses. The operations of relocating and moving the elements are usually iterated as a loop to achieve the transposition.
In more detail, a simple vector reading operation is implemented to load the data vector into a vector register 18 of the vector processor 14. This is a normal operation in the vector processing architecture 10. The target memory address of each element is then determined, and the address, S, is decomposed to 2 parts—a basic address and an offset, as follows:S=Sbasic_address+Soffset 
The basic address, Sbasic_address, is the maximum number that is an integer multiple of the instinctive vector width, L. The offset, Soffset, is the number of remaining elements. Hence, 0≦Soffset<L. The data element relocation operation is based on the address offset, Soffset. Since 0≦Soffset<L, the input and output of the relocating operation are both vectors.
In the element moving operation, each data element in the output vector element relocating operation is moved to a target address. Due to the data relocation in the previous operation, only the data element for the corresponding basic address, Sbasic_address, is moved. Usually, the basic addresses, Sbasic_address, for various data elements are different. This means that the target addresses of these data elements are located in different vectors. Therefore, a single execution of the data moving operation only affects a single element, or possibly a few elements. Often, multiple loops of the vector moving operation are implemented for a single vector.
FIG. 1B illustrates a data element arrangement 20 in various stages of the memory array transposition operations. In particular, FIG. 1B illustrates an output vector memory 22 using the lines 16 shown in the vector memory 12 of FIG. 1A as input for the memory array transposition operations. In particular, several of the individual lines 16, or vectors, are shown, including a first vector 24, a second vector 26, and a last vector 28. The first vector 24 is one of the rows of the output vector memory 22. After the matrix transposition, the first vector 24 becomes the first column of the transposed matrix. Similarly, the second vector 26 becomes the second column of the transposed matrix, and the last vector 26 becomes the last column of the transposed matrix. In conventional vector processing, which deals with row vectors, the elements of each vector are stored, one by one, in a target memory in order to rearrange the elements into a column. This can be achieved by two typical vector processing operations: shift left and masked store. The shift left operation circularly shifts all elements of the vector 30 to the next position on the left. The masked store operation stores one or more elements to the target memory while not storing the remaining elements at the same time. After each shift left operation, the target element is put into the left-most position and then, using the masked store operation, is stored to the target memory. As one example, if there are L=8 elements in a vector, then it would take eight cycles of shift left and masked store operations to transpose the vector from a row to a column.
The execution efficiency of the process shown in FIG. 1B and described above is not high in most cases, even though the instructions are executed in vector format. As described above, only a limited number of elements (e.g., one or a few) are moved with each cycle of the vector instructions.
There are some particular cases which illustrate additional difficulties with conventional memory array transpositions. Sometimes, the dimensions of a matrix are not an integer multiple of the instinctive vector width, L. In this case, the lines of the matrix are not aligned in the vector memory 12. FIG. 2 illustrates a memory array transposition 40 in which the dimensions of a matrix 42 are not an integer multiple of the instinctive vector width, L. In the illustrated example, the instinctive vector width is eight elements, but the matrix width is 10 elements (e.g., elements e11 through e1a). Such an unaligned vector memory layout 44 adds more complication for the address decomposition in the relocating and moving operations described above.
Given the difficulties of using vector processing methods for memory array transpositions, scalar methods are often used to simplify the programming and processing parameters. In other words, scalar operations may be used exclusively for relocating elements in a memory array transposition. Using a conventional scalar method, the typical execution can be implemented according to the following pseudo code:
j=0Loop mi=0Loop nRead (temp, ptr_s++)Ptr_t=j*m+iWrite (temp, ptr_t)i++End Loopi++End Loopj++End Loop
The pseudo code program presented above uses two nested loops (i.e., loops m and n). The loop body is simple for each of these nested loops. Given the nested loops, the overall process repeats for m×n times. During each loop, at least three address index update operations, one scalar read operation, and one scalar write operation are implemented. This facilitates a high processing load for the scalar method, especially when the matrix dimensions are large, since the number of element in the matrix is in square proportion to the dimensions.
To compare the conventional vector and scalar methods, an example can be given for both conventional methods. Assuming an instinctive vector width of 16 and matrix size of 256×256, and assuming a proper pipeline is made during the read and store operations (i.e., ignoring the delay of the read and store operations), the 256×256 matrix transposition consumes 204,800 cycles for vector processing compared to 196,680 cycles for scalar processing. If the delay of the read and store operations is considered, the number of cycles consumed for vector processing would be even higher.