1. Field of the Invention
The invention relates generally to a bus architecture for a system-on-a-chip (SOC). More particularly, the invention relates to a dual layer SOC bus architecture adapted for use in high-performance multimedia processing applications.
2. Description of the Related Art
Modern electronic devices increasingly provide users with various multimedia processing capabilities. For example, portable electronic devices such as cellular phones and personal digital assistants (PDAs) allow users to capture, download, display, or otherwise process various forms of multimedia information such as audio and video. As the use of multimedia-enabled devices becomes increasingly widespread, the demand for smaller, faster devices continues to grow. Accordingly, improved designs for multimedia-enabled devices are constantly in demand.
One approach to the design and manufacture of small, high performance electronic devices involves placing all of the necessary system elements within a single integrated circuit (IC). Such an arrangement or implementation of elements is commonly referred to as a system-on-a chip (SOC). For example, a SOC for an audio processing application may combine an audio receiver, an analog to digital converter (ADC), a microprocessor, a memory, and input/output logic, on a single IC chip.
One problem associated with conventional SOC architectures is that they are not well adapted to processing data in several commonly used multimedia formats. For example, conventional SOC architectures typically provide sluggish performance and consume excessive power when coding (i.e., encoding and decoding) data in any one of the various Moving Picture Experts Group (MPEG) formats. This is due, at least in part, to the fact that the conventional SOC architectures are easily overwhelmed by the large amount of data that is read from and written to memory during coding procedures. In order to overcome this problem, improved bus architectures designed to accommodate the expanded bandwidth (i.e. data carrying capacity) requirements of multimedia processing applications are needed.
To better understand the bandwidth requirements for multimedia processing applications, a brief overview of MPEG encoding and decoding will be provided. MPEG is just a selected example. Any one of a number of coding examples might be alternatively presented, but MPEG is a widely understood standard and provides an excellent teaching predicate for the discussion of the invention that follows.
In general, the term “encoding” refers to a process of converting raw, unstructured input data into a structured, or coded format. For example, in the case of MPEG encoding, this process comprises transforming a sequence of input video frames into a sequence of coded, or compressed data frames. The device used to carry out the process of encoding is generically referred to as an encoder. Many different encoder designs are conventionally available to perform MPEG encoding.
The term “decoding” refers to a process of reconstructing the original input data from coded data. For example, in the case of MPEG decoding, the process comprises reconstructing the input video frames based on the coded frames. In most cases, the reconstructed input video frames are not identical to the original input video frames due to information lost during the encoding/decoding process. In such cases, the reconstructed input video frames are approximations of the corresponding originals. The device used to carry out the process of decoding is generically referred to as a decoder. Many different decoder designs are conventionally available to perform MPEG decoding.
The input video frames used in MPEG encoding are typically composed of a collection of pixel values arranged in a row-column format. In most cases, each pixel comprises values for more than one information channel. For example, a pixel may comprise values for red, green, and blue (RGB) color channels. In other cases, the RGB color channels are equivalently expressed as luminance (Y) and chrominance (UV) components. The chrominance values are typically subsampled relative to the luminance values for purposes of bit reduction. For example, four blocks of luminance values may be combined with two equivalently sized blocks of chrominance values to form a single larger block called a “macroblock”. In general, a macroblock may comprise any number of chrominance or luminance blocks of any size. However, for illustrative purposes it will be assumed that a macroblock comprises four 8×8 luminance blocks arranged in a square, and an 8×8 red chrominance block and an 8×8 blue chrominance block subsampled at the middle of the four 8×8 luminance blocks.
MPEG encoding is performed by first dividing the input video frames into three different types: I-frames, P-frames, and B-frames. I-frames are termed intra-coded frames because they are encoded without reference to other frames. P-frames and B-frames are termed inter-coded frames because they are encoded using information from other frames. More specifically, each P-frame is predicted based on a previous I-frame or P-frame, and each B-frame is predicted based on a previous and a next I-frame or P-frame.
Each I-frame in an input video sequence is encoded as a set of quantized discrete cosine transform (DCT) coefficients, while each P-frame and B-frame, on the other hand, is encoded as a set of motion vectors and a corresponding prediction error frame. The process of encoding the I-frames, P-frames, and B-frames will now be explained.
Each input video frame in an input video sequence is designated a-priori as an I-frame, a P-frame, or a B-frame. One way to make this designation is to define a repeating sequence of frame types and to perform coding on the input video frames according to the repeating sequence. For example, suppose that the sequence is defined as I1, B2, B3, B4, P5, B6, B7, B8, P9, where “I1” denotes that the first frame in the sequence is an I-frame, “B2” denotes that the second frame in the sequence is a B-frame, and so forth. Accordingly, the first frame in the sequence of input video frames is designated as an I-frame, the second frame a B-frame, and so forth.
Since each P-frame in the sequence is coded with respect to the previous I-frame or P-frame, and each B-frame in the sequence is coded with respect to the previous and next I-frame or P-frame, the input video frames are generally encoded out of order. For example, the frames in the above sequence may be encoded in the order, I1, P5, B2, B3, B4, P9, B6, B7, B8 so that frames B2, B3, and B4 have access to both of frames I1 and P5, as needed for their encoding, and so that frames B6, B7, and B8 have access to frames P5 and P9. In sum, the input video frames are first designated as I-frames, B-frames, and P-frames and then reordered according to a corresponding predetermined sequence before encoding takes place. The coded frames are typically restored to their original order after they have been decoded.
An I-frame is encoded using intra-frame DCT coding. Intra-frame DCT coding begins by dividing a frame into small blocks. Typically, each small block comprises an 8×8 block of 8-bit pixel values. Each small block is transformed into a DCT coefficient block using a discrete cosine transform. The DCT coefficient block typically holds the same number of values as the small block, but usually more bits are used to store each value. For example, an 8 pixel by 8 pixel block of 8-bit values may be transformed into an 8×8 DCT coefficient block of 11-bit values. Where a frame comprises pixel values for multiple information channels, small blocks for each channel are typically DCT coded separately.
Following intra-frame DCT coding, values stored in each DCT coefficient block are quantized by dividing the values by some amount (usually a multiple of 2) and truncating the result. This usually results in a loss of some information contained in the original I-frame, however, measures are taken to ensure that the loss of information does not significantly impair the resulting image quality for the I-frame. For example, DCT coefficients corresponding to higher frequency image components are typically quantized to a greater degree than those corresponding to lower frequency image components because the human eye is less sensitive to detail near the edges of objects than other portions of an image.
Finally, after undergoing quantization, each DCT coefficient block is serialized and encoded using variable length coding (VLC). Serialization is performed by reading the values in the DCT coefficient block in a series using a zigzag pattern starting with the direct current (DC) component and continuing from coefficients representing low-frequency image components to coefficients representing higher-frequency image components. For example, coefficients in the matrix
            1              2              3                  4              5              6                  7              8              9      would typically be read out in the order 1, 2, 4, 7, 5, 3, 6, 8, 9.
Variable length coding is performed by grouping together runs of zeros followed by a non-zero value. For example, suppose that the following series is read from the DCT coefficient block using the zigzag pattern: 3, 1, 0, 0, 5, 2, 0, 0, 0. The values are arranged into groups as follows: (3), (1), (0, 0, 5), (2), EOB, where the label EOB stands for “end of block”, and it indicates that the remaining entries in the sequence are all zero.
Once the values are arranged into groups, each group is then substituted with a unique code word from a VLC look-up table. The VLC look-up table has the property that no code word in the table is a prefix for any other code word in the table. Hence, a series of code words generated according to the VLC look-up table can be arranged as a bitstream while still allowing a decoder to determine the beginning (start) and end (finish) of each code word within the bitstream. To illustrate the conversion of the above series into a bitstream, the following look-up table will be used as a simple example. Let the group “(3)” be represented by the code word “000”, let the group “(1)” be represented by the code word “111”, let the group “(0, 0, 5)” be represented by the code word “101”, let the group “(2)” be represented by the code word “110”, and let the label EOB be represented by the code word “01”. Accordingly, the values in the series can coded by the bitstream “00011110111001”.
A P-frame is encoded by performing motion estimation on the frame relative to a reference frame in order to generate a set of motion vectors. For P-frames, the reference frame is the previous I-frame or P-frame in the input video sequence and each motion vector denotes estimated motion of a macroblock between the reference frame and the P-frame. For example, a motion vector defines a relative shift between a macroblock in the P-frame and the “best match” for the block in the reference frame.
The motion vectors are applied to the reference frame to generate a frame “V”, which is an approximation of the P-frame. The motion vectors are applied to the reference frame by shifting each macroblock in the reference frame by an amount indicated by one of the motion vectors. Frame “V” is then subtracted from the P-frame to generate a prediction error frame “E”, and frame “E” is stored along with the motion vectors in order to eventually reconstruct the P-frame.
In reconstructing a frame based on the motion vectors and frame “E”, the motion vectors are added to the reference frame to generate frame “V” and then frame “E” is added to frame “V” to generate an approximation of the original P-frame. Because frame “E” is used to compensate for error in frame “V”, frame “E” is often referred to as “motion compensation error”. Accordingly, encoding techniques that rely on generating motion vectors and motion compensation error as described above are often referred to as “motion compensated inter-frame coding”.
Frame “E” is generally encoded using intra-frame DCT coding, quantization, and VLC. This tends to significantly reduce the number of bits needed to represent frame “E”, especially in cases where frame “V” is very similar to the P-frame, i.e., where the prediction error is small. In these cases, the quantized DCT coefficient blocks corresponding to frame “E” tend to contain large numbers of zeros. As a result, VLC will generally achieve efficient compression for the DCT coefficient blocks.
A B-frame is encoded in a manner very similar to a P-frame. However, motion estimation for a B-frame is performed relative to two reference frames instead of one. The reference frames are the previous and next I-frame or P-frame in the input video sequence and the motion estimation generates motion vectors which are typically averages based on motion estimation performed relative to the two reference frames.
It should be particularly noted that the motion estimation is generally not performed on the original input video frames, but instead it is performed using previously encoded and decoded I-frames, P-frames, and B-frames. In other words, before motion estimation is performed, the input video frames are pre-processed by intra-frame DCT coding and quantization, followed by inverse quantization and inverse DCT coding. This is done so that the frame estimation based on the motion vectors can be repeated in a decoder. Since intra-frame DCT coding and quantization cause the input video frames to lose information, performing motion estimation on the input video frames would lead to unpredictable results in the decoding process. Since MPEG encoding requires motion estimation to be performed on previously encoded/decoded frames, most MPEG encoders include a local decoder used to produce these frames.
It should also be noted that where the motion compensation error for a particular macroblock in a P-frame or B-frame is extremely large, intra-frame DCT coding may be used to encode the macroblock instead of motion compensated inter-frame coding. This prevents drastic changes in the input video sequence from causing poor encoding of the sequence.
The result of MPEG encoding is a compressed bitstream (i.e., compressed image data) that can either be stored in memory or transmitted to a decoder. The bitstream generally includes any VLC coded DCT coefficients and motion vectors corresponding to each frame as well as some additional information used for decoding the frames. The additional information typically includes the type of each frame, the quantization values used for the DCT coefficients, and so forth.
The decoding procedures used for each of the different frame types are generally inverses of the procedures used to encode the frames. For example, an I-frame is decoded by decoding the VLC encoded DCT coefficients using the look-up table, multiplying the resulting DCT coefficients by the quantization values, and then inverse DCT transforming the DCT coefficients to yield a set of pixel values for an output video frame.
Similar inverse procedures are performed on P-frames and B-frames to produce output video frames corresponding to the input video frames. In addition, the P-frames and B-frames are decoded using the motion vectors and the motion compensated error as described above.
Once the decoding procedure is completed, the output video frames are reordered into their original order based on the input video frames.
For simplicity of explanation, several details have been omitted from the explanation of MPEG encoding and decoding. In addition, specific details of various MPEG standards, including MPEG-1, MPEG-2, MPEG-4, and MPEG-7 were also omitted. However, MPEG encoding and decoding are well known procedures and hence the omitted details are available from other sources.
Real-time MPEG encoding and decoding generally requires at least enough bandwidth to achieve a frame rate of several frames per second. Accordingly, each of the several frames is read from an input device and written to memory. Then, blocks within each frame are successively transferred back and forth between memory and an image compression module used for DCT coding, quantization, motion estimation and so forth. These operations can easily consume the available bandwidth of conventional SOC architectures, which usually rely on slower, high density memories such as dynamic random access memory (DRAM) or Flash. The slower, high density memories are used in the SOC architectures because they are cheaper, they take up less space, and they have larger capacities than faster low density memories such as static random access memory (SRAM).
FIGS. 1 and 2 are block diagrams illustrating conventional SOC architectures. FIG. 1 shows a conventional single layer SOC bus architecture and FIG. 2 shows a conventional multi-layer SOC bus architecture.
In the conventional single layer SOC bus architecture shown in FIG. 1, a plurality of modules 10 through 80 are connected to a single system bus 12. The term “module” is used here to refer to a particular functional entity within an electronic device. For example, a module may be read to encompass a set of software routines, a particular hardware (e.g., circuit) configuration, and/or some combination thereof. The term “module” may also refer to a collection of functional entities, i.e., multiple modules, or even sub-elements within a module.
Referring to FIG. 1, module 10 comprises a reduced instruction set computer (RISC), module 20 comprises a camera interface, module 30 comprises a moving image compression module, module 40 comprises a still image compression module, module 50 comprises a graphics acceleration module, and module 60 comprises a transfer module adapted to transfer image data to a liquid crystal display (LCD) device. Module 70 comprises a memory controller and module 80 comprises a high density (e.g., DRAM) memory.
The SOC bus architecture shown in FIG. 1 is perhaps the most commonly used SOC bus architecture—at least in part because of its low cost and ease of implementation. However, because the bandwidth demands placed upon bus 12 is determined by summing the bandwidth demand for each connected module, the total available bandwidth may be consumed by the needs of only a small number of the connected modules. In particular, the total available system bus bandwidth may easily be consumed (or exceeded) by the bandwidth demands of the moving image compression module and the camera interface when incoming video is being processed.
As between the exemplary elements shown in FIGS. 1 and 2, and generally throughout the description that follows, like reference numerals indicate like or similar elements. Thus, in the multi-layer SOC bus architecture of FIG. 2, module 10 is connected to a first bus 12-1, modules 20, 30, and 40 are connected to a second bus 12-2, and modules 50 and 60 are connected to a third bus 12-3. The first, second, and third buses are respectively connected to three memory controllers 70-1, 70-2, and 70-3, and the three memory controllers are respectively connected to three high density memories 80-1, 80-2, and 80-3.
By using multiple (e.g., three) layers, the SOC bus architecture shown in FIG. 2 provides more available bandwidth than the single system bus architecture shown in FIG. 1. That is, the total available bandwidth for the system shown in FIG. 2 is the sum of the available bandwidth in each bus layer. By providing more bandwidth, the SOC bus architecture of FIG. 2 is able to effectively support real-time multimedia processing. Unfortunately, however, the multi-layer bus system is expensive and difficult to manufacture. As a result, this system type is not well suited to be manufactured of commercial products where low cost and ease of implementation are important. In addition, the performance improvement gained by using the multi-layer bus architecture of FIG. 2 may, nonetheless, be limited by the access speed of high density memory 80-2, for example, which may be insufficient to accommodate the bandwidth requirements of moving image compression module 30.
FIG. 3 is a block diagram showing a conventional bus architecture for a non-SOC computer system. Such a system is disclosed, for example, in U.S. Pat. No. 5,784,592.
Referring to FIG. 3, the PC computer system enables high-performance multimedia processing by placing a multimedia memory 160 between a standard local bus 120 and a real-time multimedia bus 130B. Multimedia memory 160 provides storage for multimedia devices 142B, 144B, and 146B so that they can process multimedia information without having to contend for access to standard local bus 120.
The operation of the multimedia memory and the multimedia devices is controlled by a central processing unit (CPU) 102 through a chipset 106B. The CPU transfers multimedia data from a main memory 110 to the multimedia memory and sends control signals to the multimedia memory and multimedia devices indicating when to start or stop certain multimedia processing functions and when to send data through bus 120.
The computer system shown in FIG. 3 has at least two limitations. One limitation is the need to fetch the multimedia data from the main memory to the multimedia memory. This adds significant overhead to multimedia processing procedures where the procedures perform fetch operations on a frequent basis. The other limitation, which is related to the first, is the size of the multimedia memory. The multimedia memory is designed to store large amounts of data including code and overflow data from the main memory in addition to the multimedia data. Although the large size of the multimedia memory may help reduce the frequency with which multimedia data is fetched from the main memory, it makes it difficult to implement the suggested system architecture in a small area, such as in a SOC.
Due to the at least the above described limitations apparent in conventional systems, an improved SOC bus architecture is needed for multimedia processing applications.