Nowadays, image display and audio-video data (generally referred to as content data) playback gradually turn to be the most popular application of the computing apparatus, and require higher and higher processing speed of the microprocessor.
Currently, Instruction-Level Parallelism architectures such as single instruction multiple data type (SIMD), multiple instruction multiple data type (MIMD), vector or array processing and so on are the dominant trends to increase the processing speed of the microprocessor. A vector machine applying parallel processing structure of SIMD processor can reduce computing time period while collectively processing a great amount of vector data such as image data composed of pixels, thus, vector machines applying SIMD processor are suitable for running image processing and video encoding/decoding applications which have heavy matrix computation loads.
In the field of image and audio-video processing, operations of transposing encoding and matrix transposing are commonly used technique of image and audio-video compressing and decompressing. Transposing matrix relates to rearranging the columns of a matrix into rows.
For vector machines, Vector transposing is usually carried out through transpose registers. These transpose registers are special register arrays which facilitate horizontal write-in and vertical read-out. As illustrated in FIG. 2, the data hold in the register array 200 is in row-first and column-second order. While reading those data out of the register array, special routing channels are used to read a vertical data column into a vector. Taking the FIG. 2 for example, initially, we write data [A0|A1|A2|A3], [B0|B1|B2|B3], [C0|C1|C2|C3], [D0|D1|D2|D3] into this transpose register from its write ports 201a, 201b, 201c, 201d. Then, data [A0,A1,A2,A3] are stored in the array cells denoted as 202a, 202b,202c, 202d; data [B0,B1,B2,B3] are stored in the array cells denoted as 203a, 203b, 203c, 203d; data [C0,C1,C2,C3] are stored in the array cells denoted as 204a, 204b, 204c, 204d; data [D0,D1,D2,D3] are stored in the array cells denoted as 205a, 205b, 205c, 205d. 
When reading operation is carried out, the vector data are read out from the read ports 206a, 206b, 206c, 206d. The data read out are organized in vertical direction in that the first vector data read out is formed by concatenating the contents in array cells 202a, 203a, 204a, 205a, i.e. the data read out [A0,B0,C0,D0] as shown in the drawing. Vector data [A1,B1,C1,D1], [A2,B2,C2,D2], [A3,B3,C3,D3] could be read out in the similar manner. The effect of write-in horizontally then read-out vertically from the transpose register array is equal to transposing a matrix. In this manner, vector computations can be performed in the above-mentioned manner of matrix transposing, no matter the matrix computation is in vertical or horizontal direction.
When matrix computation in vertical is desired, transposing a matrix involves N horizontal write operations and N vertical read operations (2×N cycles) to get a transposed N×N matrix. In more precise, to get a transposed 4×4 matrix, 4 horizontal vector write and 4 vertical vector read are necessary, which results in 8 cycles in total.
However, for those algorithm kernels that is performance-critical such as 6-tap symmetric filtering in H.264 standard (i.e. advanced video coding for audio-video service), transposing a matrix before filtering algorithms could impose a heavy overhead on the algorithm efficiency. The desired H.264 symmetric filtering is illustrated in FIG. 3, each box represents a pixel in a displayed picture. For example, boxes denoted as 301a, 301b, 301c, 301d, 301e, 301f, 301g, 301h, 301i contain a array of reference pixels [p0,p1,p2,p3,p4,p5,p6,p7,p8]. A 6-tap symmetric filtering needs to obtain the predicted half-pixel array [p9,p10,p11,p12] contained in boxes 302a, 302b, 302c, 302d from the known pixels [p0,p1,p2,p3,p4,p5,p6,p7,p8,p9] by interpolating with following equations (1)-(4):p9=a0*p0+a1*p1+a2*p2+a2*p3+a1*p4+a0*p5  (1)p10=a0*p1+a1*p2+a2*p3+a2*p4+a1*p5+a0*p6  (2)p11=a0*p2+a1*p3+a2*p4+a2*p5+a1*p6+a0*p7  (3)p12=a0*p3+a1*p4+a2*p5+a2*p6+a1*p7+a0*p8  (4),where p0-p8 are known pixels used as interpolation references and a0, a1, a2 are filtering coefficients. In H.264 standard, a0=1, a1=−5, a2=20; p9-p12 are the half pixels predicted from the 9 reference pixels p0-p8.
Normally, pixels p0-p8 are 8-bit words. Thus, each 64-bit vector register can hold 8 pixels. Assume that the contents in the eight 64-bit vector registers v0-v7 are:v0=[p0|p1|p2|p3|p4|p5|p6|p7]v1=[q0|q1|q2|q3|q4|q5|q6|q7]v2=[r0|r1|r2|r3|r4|r5|r6|r7]v3=[s0|s1|s2|s3|s4|s5|s6|s7]v4=[ . . . . . . . . . . . . . . . . . . . . . . . ]v5=[ . . . . . . . . . . . . . . . . . . . . . . . ]v6=[ . . . . . . . . . . . . . . . . . . . . . . . ]v7=[ . . . . . . . . . . . . . . . . . . . . . . . ]
Due to fact that the reference pixels p0-p8 are originally organized in horizontal manner, it is difficult to obtain half pixels p9, p10, p11, p12 by processing with SIMD (Single Instruction Multiple Data) instructions in parallel.
In order to exploit the parallelism, vertical half pixels p9, p13, p14, p15 as shown in FIG. 3 can be processed in parallel. But this necessitates transposing of a pixel matrix. With a matrix transposing, the contents in the eight 64-bit vector registers v0-v7 become:v0′=[p0|q0|r0|s0|..|..|..|..|]v1′=[p1|q1|r1|s1|..|..|..|..|]v2′=[p2|q2|r2|s2|..|..|..|..|]v3′=[p3|q3|r3|s3|..|..|..|..|]v4′=[p4|q4|r4|s4|..|..|..|..|]v5′=[p5|q5|r5|s5|..|..|..|..|]v6′=[p6|q6|r6|s6|..|..|..|..|]v7′=[p7|q7|r7|s7|..|..|..|..|]
Transposing the vector data from its original horizontal organization to vertical organization can facilitate the SIMD processing. Then half pixels p9, p13, p14, p15 can be calculated in parallel manner:[p9|p13|p14|p15|]=a0*v0′+a1*v1′+a2*v2′+a2*v3′+a1*v4′+a0*v5′
That is to say, still taking FIG. 3 for example, p9, p13, p14, p15 are computed in the transposed matrix in the following manner:p9=a0*p0+a1*p1+a2*p2+a2*p3+a1*p4+a0*p5p13=a0*q0+a1*q1+a2*q2+a2*q3+a1*q4+a0*q5p14=a0*r0+a1*r1+a2*r2+a2*r3+a1*r4+a0*r5p15=a0*s0+a1*s1+a2*s2+a2*s3+a1*s4+a0*s5
However, transposing incurs quite a few extra instructions to transpose the matrix into desired formats. For example, it needs 2×N cycle overhead for transposing an N×N pixel matrix. Transposing a matrix before filtering algorithms could impose a heavy overhead on the algorithm efficiency. Thus, a new method which eliminates the 2×N transposing overhead for horizontal symmetric filtering is needed.