The present invention is related to the field of memory architecture, more specifically, to n-dimensional hyper-matrix (rectangular data array) with s-data elements along each dimension.
The design of a memory architecture for a n-dimensional rectangular data array is a well-known problem and its scope stretches to a myriad of applications. The particular cases of parallel data access in 2- and 3-dimensional rectangular data array is of importance for signal processing applications. Specifically, the memory architecture for 2-dimensional data access is attractive for video, image, and graphics processing whereas data access to 3-dimensional space is attractive for 3-dimensional graphics and video signal processing.
Many image and video processing algorithms require either row-wise or column-wise access to data in a 2-dimensional data array (an image or a frame of a video sequence). The most relevant applications are lossy compression algorithms for images and video, which use 2-dimensional separable transforms such as Discrete Cosine Transform (DCT) and Inverse Discrete Cosine Transform (IDCT). These transforms are an integral part of compression techniques utilized in widely accepted video and image compression standards such as MPEG (Moving Picture Expert Group), H.261, H.263, JPEG (Joint Photographic Expert Group), etc. In accordance with the recommendations made in these standards, each image or a frame in a video sequence is divided into macroblocks, which is further divided into blocks of (8xc3x978) data array. In the encoder, the 2D-DCT operation is applied over this block of (8xc3x978) data array followed by quantization and entropy coding to achieve compression. In the decoder, a 2D-IDCT operation is performed after variable length decoding and de-quantization operations. The 2D-DCT (or 2D-IDCT) is a separable transform and can be computed by performing a 1D-DCT (or 1D-IDCT) operation over all the rows of a block followed by a 1D-DCT (or 1D-IDCT) operation over all columns, or vice-versa.
As shown in FIG. 1, after the first 1D-(I)DCT operation 12 over all the rows (or columns) of 8xc3x978 block 11, the data is to be fed to second 1D-(I)DCT block 14 in column-(or row-)wise fashion. This requires a memory 13 which allows both row-wise as well as column-wise access because after the first 1D-(I)DCT operation 12 the data is written into memory 13 in row(column)-wise fashion, whereas for second 1D-(I)DCT operation 14 data is read from the memory 13 in column-(row-)wise fashion. For a DSP 15 processor with SIMD architecture having 4 data-elements vector as operands, each (8xc3x978) block can be divided into four data arrays of size (4xc3x974). For each row (or column) of this (8xc3x978) block, two row-wise (or column-wise) accesses are required to be made, each access fetching four consecutive elements. The present invention provides a scheme that meets this requirement.
Similarly, the 3D-(I)DCT can also be achieved using 1D-(I)DCT but in this case the transpose memory should be such that it allows the parallel access to data along all three dimensions. The present invention describes a memory architecture for a n-dimensional data array allowing parallel access to data along any of the n dimensions.
The problem of 2-dimensional memory architecture allowing row-wise as well as column-wise access is not new, but there is no record of extension of the same concept to higher dimensions to the best of the authors knowledge. As a solution to carrying out 2-dimensional matrix transpose operation, several conventional transpose memories have been proposed.
In U.S. Pat. No. 5,740,340, as is understood, the memory cells are organized as an (sxc3x97s) data array. The s rows and s columns are addressed by 2s addresses, and there is a decoder that decodes any one of the said 2s addresses and enables access to said s rows of data and said s columns of data. This solution appears quite restrictive, since it needs a special kind of 2-D memory in which any row or any column can be enabled at a time for accessing. In addition, all enabled locations are accessed at a time. So the extension of this architecture will be very complex for large data arrays which are segmented into smaller (sxc3x97s) data sub-arrays as only a complete row (or column) of a data array is enabled, not part of it. The mentioned complexity is not addressed in the disclosed document of the patent. Further, the complexity of this scheme is higher as it involves s2 banks as compared to s banks in the present invention. Moreover, this scheme can not be generalized to n-dimensional data arrays.
The U.S. Pat. No. 5,481,487 appears to suggest a different memory architecture, which requires 4 parallel banks to store one (8xc3x978) data array. Each bank stores one of the four quadrants of the data array, each quadrant being a (4xc3x974) data array. This scheme appears to have the following restrictions:
1. Though address and data buses are provided for all the four banks, not all are accessed in parallel.
2. This memory architecture is restrictive in the sense that it implements only a transpose function. If data is written in row (column) order, it can be read only in column (row) order.
3. This scheme is restricted to only one (8xc3x978) block, and cannot be generalized to store larger 2-dimensional data arrays.
4. This architecture can store consecutive (8xc3x978) blocks (in the same memory locations) but with the following restriction. If a first (8xc3x978) block is written in row-wise (column-wise) order then a second block must be written in column-wise (row-wise) order.
5. This scheme may not be generalized for storing n-dimensional data arrays.
In U.S. Pat. No. 4,603,348, a memory architecture has been described for storing a multi-dimensional array. According to this scheme, the n-dimensional array is divided into a number of divisions, which do not overlap. Each such division is defined as an n-dimensional array with 2 elements in each dimension. The number of banks in the proposed architecture is equal to the number of elements in each of these divisions. Each bank has one data element from a given division, hence enabling the parallel access to all elements of a division. This scheme appears to provide access only to a division of an n-dimensional array. In contrast, the scheme disclosed in the present invention provides access to data along any given dimension.
In U.S. Pat. No. 4,740,927, a bit addressable memory has been proposed in which a 2-dimensional array of bits is divided into partition sectors equal to the number of parallel memory modules (banks) provided. Each memory module has addresses equal to number of bits in each partition sector. Each partition is divided into several sxc3x97s matrices, where s is the number of parallel banks. The logical placement of the bits of these matrices is such that bits of any row or column lie in different memory modules, providing parallel access along row and column. However, the present invention proposes an architecture with less complex address generation logic. A particular case of proposed architecture, referred to as memory architecture with dyadic permutation, provides an address generation logic in which a main operation is a logical EXORing operation as against the addition operation in address generation logic proposed in the prior art. Moreover, unlike this scheme, the invention disclosed in this document is much more generic and holds good for a dimension greater than 2 as well.
The present invention provides a novel solution to overcome the disadvantages of the prior arts.
The objective of the present invention is to provide a generalized framework of memory architecture for n-dimensional rectangular data arrays such that parallel access to data along any of the n dimensions is possible. It is claimed that the memory architecture of the present invention is generic and less complex as compared to architectures discussed in prior arts. It also overcomes the disadvantages of the prior arts for 2-dimensional transpose memories. The objective of this invention is achieved by applying a simple, yet effective, method of rearranging (permuting) the elements of the data array while reading/writing data from/to the memory. This rearrangement is the distinguishing feature of this invention. The brief description of the invention is as follows.
The proposed memory architecture allows parallel access to s data elements along any given dimension of an n-dimensional data array, each dimension having s-data elements1. It is evident that in order to provide parallel access to s-data elements, there must be s-parallel memory banks. The data of this array is stored in these banks in such a fashion that all s data elements of a vector of data, which is parallel to any of the n dimensions, lie in different banks.
1 For 2-dimension (sxc3x97s) data array, it means that the memory architecture of the present invention allows parallel access to all s elements in any row or column of the said data array. 
1. More specifically, the present scheme allows parallel access to s-elements along any given dimension of an n-dimensional data array with s-elements in each dimension. For an example with n=s=3, please refer to Sub-array 21 in FIG. 2.
2. The hardware complexity of this scheme is less than the similar solutions proposed in past. The reduction in complexity is achieved by introducing a particular type of rearrangement in the data to be read from/written into the memory. This particular feature makes this scheme different from other solutions for the given problem.
3. Unlike the schemes proposed in prior arts, the scheme described in the present invention is not restricted to only 2-dimensional data arrays. This scheme is generic for n-dimensional rectangular data array.
4. Unlike other similar solutions, this scheme can be extended to cover a larger n-dimensional data array with m (m=st; where t is greater than 1) data elements along each dimension, which can be divided into smaller n-dimensional rectangular data sub-arrays with s-elements along each dimension. Please refer to FIG. 2 for an illustration with n=s=3 and m=6.
5. Further, at the cost of a little more complexity, the scheme can be generalized to access the s data elements in parallel, from any index within an n-dimensional sub-array. More precisely, the s-data elements to be accessed need not start from a boundary of one sub-array and hence may as well stretch to adjacent n-dimension sub-array (Refer to Sub-arrays 22, 23, and 24 in FIG. 2).
6. The complexity of address generation logic in this scheme gets reduced significantly if the parameter s is an integer power of 2.
7. This scheme can also be used to access data serially, if addresses are issued serially.