With advance of Integrated Circuit (IC) processes, more computing components and larger-capacity Static Random Access Memory (SRAM) can be integrated on a chip. A high-speed embedded signal processing chip can be designed and provided with multiple computing components and multiple on-chip memories of large capacity and bit width to enable parallel computation and storage. Signal processing algorithms generally organize input/output data in a matrix, and use the matrix as an object for computation. Matrix data are generally stored in rows or columns in a memory. The R/W ports of a memory are fixed in bit width, and sequentially addressed. When a matrix is stored in rows, the memory can read/write in parallel multiple elements in a row of the matrix at a time, but cannot read/write in parallel multiple elements in a column of the matrix at a time. When a matrix is stored in columns, the memory can read/write in parallel multiple elements in a column of the matrix at a time, but cannot read/write in parallel multiple elements in a row of the matrix at a time.
FIG. 1 is a schematic diagram of a structure of a conventional on-chip memory and an addressing scheme thereof, showing the locations of elements in a matrix on the conventional on-chip memory when the data type of the matrix is consistent with memory unit. As shown in FIG. 1, assuming that the memory R/W port 110 has a bit width of 4, i.e., 4 elements are stored in one row of the memory 100, and 4 elements having consecutive addresses can be read/written in parallel at one operation. The matrix A is of a size 4×4, and an element at the ith row and jth column of the matrix is denoted as aij (0≦i<4, 0≦j<4). The matrix A is stored by rows at an address 0. In this case, the memory 100 can read/write in parallel 4 elements in a row of the matrix at a time. As the elements in a column of the matrix are distributed in multiple rows 104 of the memory, only one element in a column can be read/written per time. It is impossible to conduct parallel read/write of elements in a column.
In the signal processing system, parallel read/write of matrix elements in a column is often required while parallel read/write of matrix elements is performed in a row. For example, some signal processing algorithm takes three matrices (A,B,D) as input, and expects to obtain two matrix multiplication results C=A×B, E=B×D. Meanwhile, there are 4 computation units capable of parallel computation in the signal processing system. When C=A×B is calculated, it is necessary to read/write in parallel 4 elements in a column of the matrix B; when E=B×D is calculated, it is necessary to read/write in parallel 4 elements in a row of the matrix B. According, in addition to parallel read/write of the matrix B by row, parallel read/write of the matrix B by column is also required throughout the processing of the algorithm. Unfortunately, the conventionally-structured memory is only capable of parallel read/write either by row or by column. When the memory fails to provide concurrently the required 4 operands in each clock cycle, only one of the 4 operational units can be in an active state, and this inevitably degrades the operational efficiency of the overall system.
There are various data types for matrices. Common data types include byte of 8 bits, short word of 16 bits, integer and single-precision floating-point of 32 bits, and double-precision floating-point of 64 bits. The memory units have one fixed data type, and each address corresponds to an 8 bit data or a 32 bit data. In order to express all the data types with the most basic memory unit in the memory, a common approach is to concatenate multiple consecutive low-bit-width data types into a high-bit-width data type. As shown in FIG. 2, assuming that the memory unit is a byte of 8 bits, the matrix has a size of 4×2, and a data type of 16 bit short word. The matrix elements are arranged in rows, and one matrix element is formed by concatenating two consecutive bytes of 8 bits. In FIG. 1, the data type of the matrix is matched with the memory unit. The addresses of the column elements of the matrix are {3, 7, 11, 15}, that is, the addresses of the columns are discrete. In FIG. 2, however, the data type of the matrix is not matched with the memory unit. The addresses of the column elements of the matrix are {2, 3, 6, 7, 10, 11, 14, 15}, that is, the addresses of the columns as a whole are discrete, but part of the addresses are consecutive. Therefore, during parallel read/write of matrix row and column elements, it is necessary to take different data types of the matrix elements into account, and accordingly to use different read/write granularities. Here, “read/write granularity” refers to the number of memory units at consecutive addresses.
Some patent documents have discussed how to perform read/write operations on matrix rows/columns, but not yet fulfill the function of multi-granularity parallel read/write of matrix row/columns at the level of SRAM architecture. The patent documents, such as U.S. Pat. No. 6,084,771B (“Processor With Register File Accessible By Row Column to Achieve Data Array Transposition”), and CN Patent 200910043343.5 (“Matrix Register File with Separate Row and Column Access Ports”), have provided a register file that supports read/write of matrix rows/columns. However, the matrix data are still stored in the memory, and it is necessary to initially load the matrix data from the memory to the register file, and then read/write matrix row/columns in the register file. Meanwhile, the register file has a very small capacity, and thus only a small part of the matrix data can be read/written at each operation. Further, these patent documents do not consider how to support different data types. U.S. Pat. No. 7,802,049B2 (“Random Access Memory Have Fast Column Access”) primarily discusses how to rapidly acquire consecutive data from the memory rows of DRAM, but does not discuss parallel read/write of matrix rows/columns.