1. Field of the Invention
The present invention relates to a memory mapping apparatus and method in a video decoder/encoder, and more particularly, to a memory mapping apparatus and method using an independently-accessible memory bank structure.
2. Description of the Related Art
H.264/AVC is a standard recommended by Joint Video Team (JVT) that has performed a cooperative project of ISO/IEC MPEG and ITU-T VCEG (see “Text of ISO/IEC FDIS 14496-10: Information Technology—Coding of audio-visual objects—Part 10: Advanced Video Coding”, ISO/IEC JTC 1/SC 29/WG 11.n5555, March 2003).
FIG. 1 is a schematic block diagram showing a decoder according to the H.264/AVC standard.
In order to speed up the decoder, an entropy decoder (P1) 105, a prediction unit (P2) 110, and a de-block filter (P3) 115 are connected with a pipeline constructed in units of a macroblock. A RISC core (P0) 100 performs controlling over the pipeline constructed in units of a macroblock and parsing of header information of picture data.
There are two buses used for the H.264/AVC codex. The one is an external bus 135 connecting an external buffer (not shown), the RISC core (P0) 100, the prediction unit (P2) 110, and the de-block filter (P3) 115. The other is an internal bus 130 connecting the modules (P0 to P3) 100, 105, 110, 115 and an on-chip SRAM 120. In addition, a parser buffer 140 and a PCI unit 145 are connected to the internal bus 130.
Each of the RISC core (P0) 100 and other modules (P1 to P3) 105, 110, 115 serves as a bus master, so that a wrapper having the bus master for the connected buses is needed. The RISC core (P0) 100 is directly connected to an instruction SRAM (INSTR. SRAM) 155.
Now, operations of the decoder according to the H.264/AVC standard will be described.
A network abstraction layer (NAL) video bit stream is input through a PCI slave interface. The NAL video bit stream is divided into plural pieces of data having a desired size by the RISC core (P0) 100, subjected to entropy-decoding by the entropy decoder (P1) 105, and stored in the parser buffer 140. In order to speed up data processing, the RISC core (P0) 100 and the entropy decoder (P1) 105 store the plural pieces of data in separate parser buffers.
The parser buffer 140 has a size suitable for decoding one macroblock. The data stored in the parser buffer 140 is read in units of a macroblock by the entropy decoder (P1) 105 and subjected to parsing in units of a syntax.
The parsing mainly comprises context adaptive variable length decoding (CAVLD) and context adaptive binary arithmetic decoding (CABAD). A unit for performing CAVLD, that is, a CAVLD unit, supports an Exp-Golomb code operation in the parsing of the RISC core (P0) 100. The data decoded by the entropy decoder (P1) 105 is transferred to the prediction unit (P2) 110. The entropy decoder (P1) 105 and the prediction unit (P2) 110 transmit and receive data through a dual on-chip SRAM 107 disposed therebetween, so that the internal bus 130 isn't used for data transmission therebetween. The entropy decoder (P1) 105 accesses an on-chip SRAM 120 via the internal bus 130 to acquire necessary information during the entropy-decoding. In general, the on-chip SRAM 120 has a size of about 16 KB.
The prediction unit (P2) 110 receives data stored in the dual on-chip SRAM 107 and data from an external bus of a memory controller 125. The prediction unit (P2) 110 uses a relatively wide bandwidth to access data of the external buffer 140. That is, to perform data processing in real time, the prediction unit (P2) 110 needs to access a large amount of data simultaneously. In particular, in the H.264/AVC codex, a larger amount of data needs to be accessed than in other conventional video standards.
After the prediction, the data stored in a dual on-chip SRAM 112 is transmitted to the de-block filter (P3) 115. The de-block filter (P3) 115 calculates a filter coefficient used to reduce a block effect on the restored data obtained from the prediction unit (P2) 110, and stores the calculation result in an SRAM. The calculation result stored in the SRAM is again stored in an external buffer. The transmission of the calculation result stored in the SRAM to the external buffer may be sent using an independent pipeline.
FIG. 2 shows macroblock and sub-macroblock partitions which are used as units for the prediction.
Referring to FIGS. 1 and 2, the prediction unit (P2) 110 performs prediction in various units of block sizes ranging from a 4×4-byte (minimum-size) partition to a 16×16-byte partition. For example, the sub-macroblock is a 8×8-byte block obtained by dividing the macroblock in units of a 8×8-byte partition. If necessary, the prediction unit (P2) 110 may divide the sub-macroblock into units of an 8×4-byte, 4×8-byte, or 4×4-byte partition used for prediction.
In the H.264 codex, the prediction unit (P2) 110 uses a 6-tap filter and a bilinear interpolation filter for luma and chroma data, respectively, so that boundary data, as well as the partition data, is required to be read from the external buffer.
In the case of the prediction unit (P2) 110 performing motion prediction in units of a 4×4-byte block, a data access amount further increases, in comparison with cases of performing motion prediction, in units of a 16×16-byte or 8×8-byte block. Therefore, in the case of performing inter-prediction, it is important to perform memory management effectively when the data is read from the external buffer.
FIGS. 3A to 3C show timing diagrams in a burst mode.
In the burst mode, that is, in a consecutive recording/reproducing scheme, a large amount of data is simultaneously read from the external buffer through direct memory access (DMA) 150, so that bus usage efficiency may be improved.
If a 64-bit (8-byte) data bus is used, FIG. 3A shows a timing diagram to access 64-bit data once; FIG. 3B shows a timing diagram to access 64-bit data four times consecutively; and FIG. 3C shows a timing diagram to access 64-bit data eight times consecutively. To use the burst mode in these cases of FIGS. 3B and 3C, a data block to be read or written needs to be consecutively located.
If an address of the external buffer where data is stored is requested, the memory controller 125 accesses the data corresponding to the request address, and consecutively accesses the next data corresponding to the next address, which is the request address plus 64 bits, because the data and the next data are consecutively located.
In these cases of FIGS. 3A to 3C, if an address where to-be-accessed data is stored is requested, the memory controller 125 accesses the corresponding data after a predetermined latency. As a result, in the cases of FIGS. 3B and 3C, it takes less time to access the same amount of data than in the case of FIG. 3A because the amount of the latency that occurs in the access of the saved amount of data is less in the cases of FIGS. 3B and 3C than in the case of FIG. 3A.
Assuming that a latency of a particular external buffer and a time interval taken to access 64-bit data are 6 cycles and 1 cycle, respectively, the total time intervals taken to access data in the cases of FIGS. 3A to 3C are 7 cycles (for 64 bits), 10 cycles (for 64×4=256 bits), and 14 cycles (for 64×8=512 bits), respectively. If 512-bit data is accessed by using the method of FIG. 3A, a time interval of 56 (=7×8) cycles is taken. However, by using the method of FIG. 3B, a time interval of 20 (=10×2) cycles is taken to access the 512-bit data. Therefore, it is possible to improve the bus usage efficiency by consecutively accessing data blocks after one address request.
In general, if data of an external buffer is read through the DMA in the burst mode, a large amount of data may be efficiently read at one time or for the same number of cycles. However, since there is a limitation in the data amount consecutively readable after one-time address allocation, the bus usage efficiency may still be lowered due to the latency involved in accessing data, even in a case where the burst mode is used. In addition, since data is accessed in units of a block, continuity of data is not ensured, so that it is difficult to use the burst mode efficiently.
FIGS. 4 to 6 show a conventional method of storing picture data in a memory and accessing the memory.
More specifically, FIG. 4 shows an operation of storing the picture data in the memory. The picture data having a size of M×N bytes is read in a raster scan scheme and sequentially stored in the memory.
FIG. 5 shows an operation of accessing particular regions of the picture data in the memory of FIG. 4. When a first block is accessed, since memory locations corresponding to the lines in the blocks are not adjacent to each other, a new address is requested every time that data corresponding to each of the lines is read, so that a large number of latencies occur.
FIG. 6 shows a timing diagram to access particular regions of the picture data in the memory. The latency occurs every time that data corresponding to the each of the lines is read. Therefore, in the case of accessing the picture data of 16×16 macroblocks, 16 latencies occur.