1. Field of the Invention
This invention relates to digital signal processing (DSP), and more particularly to an extension unit added to a microprocessor for high speed multimedia applications. The extension unit includes an operand routing unit which aligns multiple operands upon an arithmetic logic unit (ALU) in response to specific multimedia-type instructions. Proper ordered arrangement of operands at the ALU enhances the throughput of many image compression algorithms which rely upon repetitive, sequential operations.
2. Description of the Relevant Art
It is well known that conventional computers communicate information primarily through a graphical user interface (GUI). The GUI involves manipulation of complex graphical images, as either still graphic images or full motion video. Current software has spawned numerous multimedia applications which require administering still images or video via the GUI.
Processing still images or video consumes prodigious amounts of storage space within the computer. For example, a 256 color VGA screen image can entail numerous rows and pixels, each consuming a single byte of store. For example, a partial screen containing 200 rows of 320 pixels consumes a minimum of 64K bytes of storage. Real time processing of still images (and especially video) thereby requires that the amount of data be reduced. The task of reducing the amount of data necessary to store or transmit one or more digital images is often referred to as "image compression".
Image compression can be classified as either lossy or lossless. If the reconstructed image is not identical to the original image, the compression is said to be lossy. Lossy compression is used where the reconstructed image, while not identical to the original image, nonetheless conveys the essential features of the image. Minor changes may not be perceptible to a human observer, or may not be objectionable for a particular application. Lossy compression can therefore reduce the amount of data relative to lossless compression but without perceptible defects.
FIG. 1 illustrates a conventional lossy image compression system 10. System 10 is shown applicable to image (i.e., still image or full motion image) compression and decompression. An original image is compressed by an image encoder 12, and the encoded output may be further processed in block 14 using, for example, error correction, encryption, multiplexing, etc. The compressed image can be stored or sent through a communications channel. If forwarded through a communications channel, the compressed data is modulated upon a carrier signal by modulator 16. The data-modulated carrier signal is then forwarded to a decoder via channel 18. If the data is transmitted and requires demodulation, block 20 is used to extract the compressed image which can then be further processed as needed by block 22. Block 22 is used to perform, for example, decryption, demultiplexing, etc. Decoder 24 receives the compressed image having redundant or irrelevant data removed, and thereafter produces a reconstructed image perceptibly similar to the original image.
FIG. 2 illustrates, in further detail, an image encoder 12 used for compressing an image as either a still image or a sequence of images (i.e., full motion video). Upon receiving the image in either RGB or YCrCb format, encoder 12 encodes certain "frames" of a plurality of frames within the sequence of motion images or still images. Frames within a video sequence can be compressed using numerous compression standards, a popular one being the Moving Pictures Experts Group (MPEG) standard. MPEG compression involves discerning intracoded frames from non-intracoded frames. An intracoded frame, often called I-frame, is compressed relative to itself, while a non-intracoded frame, often called P-frames and B-frames, are encoded by exploiting temporal redundancy as well as spatial redundancy to reduce the number of bits required for encoding.
Encoding and decoding video presents many challenges to realizing an efficient MPEG compression standard. The intracoded frames are stored, generally in a moderately compressed format. Successive non-intracoded frames are compared with the intracoded frames and the differences are stored. Periodically, such as when a new scene is displayed, a new intracoded frame is stored, and subsequent comparisons begin from this new reference point.
Video compression standards such as MPEG, DVI and Indeo, all use the intracoded frame technique. Many compression standards such as MPEG treat various frames within the frame sequence as a still image and apply still image compression to those frames. A popular still image compression standard is Joint Photographic Experts Group (JPEG). Encoder 12 illustrates numerous blocks used in MPEG video compression, of which a portion of those blocks are pertinent to, e.g., JPEG. The JPEG portion of encoder 12 is shown within dashed area 26. Functional blocks within dashed area 26 serve to compress pixel data within blocks of each macro block arising from the original frame or image. The compressed digital data is then forwarded into an embedded decoder 30. Embedded decoder 30 is used in a feedback arrangement, wherein the output of decoder 30 is subtracted from the original frame. Subtraction is shown at block 32, and the output from functional blocks 26 is shown fed into a buffer 34 for subsequent output as compressed intracoded and non-intracoded frames.
In order to avoid having to store or transmit large amounts of information on each pixel within each frame, MPEG reduces the data to that which is pertinent only to intracoded and non-intracoded frames. As seen in the feedback arrangement of FIG. 2, data manipulation must be performed as rapidly as possible on each macro block or frame, preferably in real time. Substantial data reduction (lossy compression) is needed on frames of interest and generally occurs in JPEG blocks 26 and, more specifically, during quantization.
JPEG generally employs three stages of compression. A first stage utilizes a discrete cosine transform (DCT) function 36. DCT is a class of mathematical operations which take a signal and transform it from one type of representation to another. Specifically, DCT converts an array of numbers, which represent signal amplitudes at various points in time and space, into another array of numbers, each of which represent the amplitude of a certain frequency component from the original signal. The resulting array of numbers contains the same number of values as the original array. Using a JPEG format, DCT transform is performed on a block of 8.times.8 picture elements (or "pels") taken from an original image.
Output from DCT 36 is fed to a quantizer 38. Quantization 38 involves the lossy stage of data compression by reducing the number of bits needed to store an integer value of lessened precision. A quantization matrix, chosen by a code word, reduces the matrix values output from DCT to the indices for the code words. Upon decode, the images are reconstructed using a table look-up procedure, given the code word selected by the quantization algorithm. The International Standards Organization (ISO) maintain the quantatization code words used by implementers of JPEG code. The quantization matrix can be coded in block 40 using several methods. For example, the quantized images of each frame can be arranged in a zig-zag sequence. The zig-zag sequence is then coded using run-length encoding (RLE) followed by entropy coding (which includes the popular Huffman code).
Code output from block 40 is a variable length code which generally represents smaller decimal numbers, and can be represented with corresponding smaller number of bits depending upon the decimal value. An advantage of using smaller number variable length coding is carried forth within the intracoded and non-intracoded sequence of frames, or more particularly within each macro block of a frame. Accordingly, MPEG involves JPEG-type compression on each selected frame macro block, coupled with frame-by-frame compression using motion estimation, motion compensation and frame classification. Motion estimation, motion compensation and frame classification is relevant on only decoded pertinent frames which are produced as part of the feedback loop within inverse quantization 42 and inverse DCT 44. After undergoing inverse quantization and inverse DCT, the resulting frames are stored in reference memory 46 where they can thereafter be drawn together and placed within motion estimation block 48. Motion estimation block 48, in combination with intracoded and non-intracoded (i.e., intra/inter) frame classifier block 50, form the motion estimation/compensation portion of MPEG. Motion compensation is defined as a process of compensating for displacement of moving objects from one frame to another, and motion estimation is the process of estimating location of corresponding pels with the frames. For each block in the current P-frame, the block in the referenced frame (i.e., I-frame) which matches it best is identified by a motion vector. The differences, undertaken by subtraction block 32, between the pixel values in the matching block in the reference frame and the current block in the current frame is then transformed, quantized and coded by blocks 26.
Blocks 26 used for JPEG functionality, and the various blocks 42-50 used for MPEG decoding, feedback, motion estimation/compensation, and frame classification are generally well documented in the field of image compression. References to many of the blocks shown in FIG. 2 are set forth in numerous disclosures, an exemplary disclosure being Bhaskaran, et al. "Image Compression Standards and Architectures", ACM Multimedia 94, October, 1994, (herein incorporated by reference).
Transformation of a picture element to a DCT output, as well as quantization and coding of that output, requires algorithms unique to multimedia applications. Performing decoding (inverse quantization and inverse DCT) as well as motion estimation and compensation also require operation-intensive algorithms. Those operations can generally be classified as add, multiply, subtract, shift and accumulate operations, each of which must be performed as quickly as possible in order to make JPEG and MPEG a viable compression standard. Dedicated digital signal processors (DSPs) are generally used to carry out those operations in an expeditious manner. DSPs are often included within multimedia devices such as sound cards, speech recognition cards, video capture cards, etc. DSPs function as coprocessors, performing complex and repetitive mathematical computations demanded by the data compression algorithms. DSPs perform specific multimedia-type algorithms more efficiently than general purpose microprocessors.
There are numerous types of DSPs which can perform JPEG and MPEG data compression. For example, Hewlett Packard Corp. PA-7100LC microprocessor functions not only as a general purpose processor, but also as a DSP with generic multimedia-type instructions added to increase data compression throughput. Compression throughput of the PA-7100LC is primarily limited by the execution time involved in performing DCT or inverse DCT (IDCT). See, e.g., Lee, "Realtime MPEG Video via Software Decompression on a PA-RISC Processor", IEEE, 1995, pp. 186-192 (herein incorporated by reference). Sun Microsystems, Inc. has also devised a multimedia-type instruction set labeled Visual Instruction Set (VIS) which is designed to run on the UltraSPARC.TM. processor. See, e.g., Kohn, et al., "The Visual Instruction Set (VIS) in UltraSPARC.TM." IEEE, 1995, pp. 462-469 (herein incorporated by reference); and, Chang-Guo Zhou, "MPEG Video Decoding With The Ultrasparc.TM. Visual Instruction Set", IEEE, 1995, pp. 470-474 (herein incorporated by reference). Similar to the dedicated multimedia instruction set used by the PA-7100LC, maximum efficiency of a VIS instruction set is limited to a particular multimedia application. For example, the optimized instruction set may be efficient in performing fast fourier transforms (FFT), motion estimation or Huffman encoding, but may be lacking in other areas, such as the critical operation-intensive IDCT area. Further, while current multimedia instructions offer a fixed performance increase as to existing algorithms, they unfortunately do not always provide scalability to different types of algorithms or specific algorithms which change over time. As the new standards for JPEG, MPEG, DVI, Indeo and H.320 arrive, new algorithms may be needed where scalability to those operations is critical in achieving viable, real-time compression.
DCT and IDCT form a substantial part of an encode and/or decode algorithm, and certainly contribute numerous operations to data compression. As shown in FIG. 2, DCT and IDCT comprise prevalent portions of an encoder. For an 8.times.8 block of pixel elements, the DCT transform is generally represented as follows: ##EQU1##
Equation 1 indicates numerous multiply, add (or subtract), shift, and accumulate operations needed to carry out DCT. According to the article by Bhaskaran, several thousand multiply and add operations are necessary to perform the operations in equation 1. While faster algorithms reduce the operation count, the number of operations still remains daunting when performed on conventional DSPs. Even DSPs which have specialized multiply, add/subtract and accumulate multimedia-type instruction sets still require numerous instruction cycles in order to complete DCT on a matrix of numbers.
IDCT is carried out not only in an embedded decoder 30 of encoder 12 (shown in FIG. 2), but also in the decoder 24 shown in FIG. 3 at the receiving end of a storage unit or channel. Decoder 24 is shown for illustrative purposes as an MPEG decoder, comprising functional blocks 56-66 which essentially reverse the steps taken by an MPEG compression encoder. Decoder 24 decodes the MPEG header, which provides information regarding the block, macro block, and frame or sequence of frames which follow the header. The variable length encoded pels which follow the header are decoded into fixed length numbers by variable length decoding block 56. A reverse order scan of blocks and macro blocks across the frame, and from frame-to-frame, is performed at block 58. Next, inverse quantatization 60 is applied to the inverse scanned numbers to restore them to the original range. Then, an IDCT computation 62 is performed on the blocks in each frame. IDCT converts the frequency domain back to the original spatial domain, and provides the actual pixel values for I-blocks, but only the differences for each pixel for P-blocks and B-blocks. Next, motion compensation is performed for P-blocks and B-blocks. The differences calculated in the IDCT computation are added to the pixels in the reference block as determined by the motion vector, for P-blocks, and to the average of the forward and backward reference blocks, for B-blocks. Motion compensation is shown by reference numeral 64. Memory 66 is periodically updated at each frame within a plurality of frames which represent a reconstructed image.
Regardless of the data compression standard used, encode and decode operations employ lengthy computations, and a substantial number of those computations involve DCT or IDCT operations. Similar to DCT transform, IDCT requires a careful selection of operations sequentially applied as multiply, add, subtract, shift and accumulate operations. An IDCT transform function for an 8.times.8 matrix can be shown as follows: ##EQU2##
There is no theoretical or mathematical limit on the size of the input array for an IDCT computation. Equation 2 would be the same for transforming an entire image, although the computation time required for that large an array would be prohibitive. As set forth in Mattison, Practical Digital Video With Programming Examples In C (John Wiley & Sons, 1994) pp. 158-178 (herein incorporated by reference), the number of multiplication operations required for each element of a one dimensional DCT matrix is proportional to the square of the number of elements in the sample array. Accordingly, reducing the array size from a two-dimensional array to a one-dimensional array (e.g., to a 1.times.8 array) serves to reduce the number of overall computations for each array. The following equation illustrates an IDCT transform function for converting a 1.times.8 matrix of elements to a 1.times.8 column of pixels: ##EQU3##
Dividing the original image into one-dimensional smaller blocks helps reduce the number of computations on each array from over several thousand to a more manageable number, e.g., 16 multiplications and 26 additions (or subtractions/accumulations). See, e.g., Bhaskaran, "Image Compression Standards and Architectures", pp. 1.012.
It is desirable to introduce a DSP which can optimally perform multimedia-type operations in a rapid manner, at or near real time. The multimedia operations would benefit from being executed upon a DSP formed as part of an existing processor, similar to conventional designs but without the scalability limitation. Thus, the desired DSP must be capable of performing current or future-derived mathmatical computations using not only an enhanced multimedia-type instruction set but also using enhancements to existing hardware. An improved DSP is thereby needed which functions as a hardware and software extension to an existing processor core. Responsive to multimedia instructions, a DSP is needed which allows routing of operands to an arithmetic logic unit (ALU) in accordance with present or future-desired algorithms. An improved DSP is needed which can route multiple operands (i.e., more than two operands) simultaneously from partitioned, non-integer registers to the ALU depending upon any algorithm which might be chosen. The improved DSP must be capable of functioning on algorithms unique to JPEG, MPEG, DVI, Indeo, H.320 and, more specifically, on any future algorithm which requires multiple operations carried out in a structured sequence of simultaneous operations. A popular algorithm to which such a DSP would be particularly useful is one involving IDCT.
Enhancements to existing processors or to existing instruction sets are thereby needed to make MPEG, JPEG, H.320, etc., more viable as data compression standards. It would be desirable to perform as many operation-intensive computations as possible in parallel, and within as few instruction cycles as possible. It would also be beneficial to reorder operands such that operands exist in optimal order for such processing. Each operand within a set of operands must be chosen from one of numerous locations within a non-integer register. Reading from and writing to non-integer registers would avoid bandwidth limitations on existing integer registers, while allowing access to integer registers simultaneous with the multimedia-dedicated (non-integer) registers.