The present invention relates to microprocessors and more specifically to techniques for manipulating vectored data.
Increased computer processing is required to provide for modem digital services. As an example, the Internet has spawned a plethora of multimedia applications for presenting images and playing video and audio content. These applications involve the manipulation of complex data in the form of still graphic images and full motion video. It is commonly accepted that digitized images consume prodigious amounts of storage. For example, a single relatively modest-sized image having 480xc3x97640 pixels and a full-color resolution of 24 bits per pixel (three 8-bit bytes per pixel), occupies nearly a megabyte of data. At a resolution of 1024xc3x97768 pixels, a 24-bit color image requires 2.3 MB of memory to represent. A 24-bit color picture of an 8.5 inch by 11 inch page, at 300 dots per inch, requires as much as 2 MB of storage. Video images are even more data-intensive, since it is generally accepted that for high-quality consumer applications, images must occur at a rate of at least 30 frames per second. Current proposals for high-definition television (HDTV) call for as many as 1920xc3x971035 or more pixels per frame, which translates to a data transmission rate of about 1.5 billion bits per second. Other advances in digital imaging and multimedia applications such as video teleconferencing and home entertainment systems have created an even greater demand for higher bandwidth and consequently ever greater processing capability.
Traditional lossless techniques for compressing digital image and video information include methods such as Huffman encoding, run length encoding and the Lempel-Ziv-Welch algorithm. These approaches, though advantageous in preserving image quality, are otherwise inadequate to meet the demands of high throughput systems. For this reason, compression techniques which typically involve some loss of information have been devised. They include discrete cosine transform (DCT) techniques, adaptive DCT (ADCT) techniques, and wavelet transform techniques.
The Joint Photographic Experts Group (JPEG) has created a standard for still image compression, known as the JPEG standard. This standard defines an algorithm based on the discrete cosine transform (DCT). An encoder using the JPEG algorithm processes an image in four steps: linear transformation, quantization, run-length encoding (RLE), and Huffman coding. The decoder reverses these steps to reconstruct the image. For the linear transformation step, the image is divided up into blocks of 8xc3x978 pixels and a DCT operation is applied in both spatial dimensions for each block. The purpose of dividing the image into blocks is to overcome a deficiency of the DCT algorithm, which is that the DCT is highly non-local. The image is divided into blocks in order to overcome this non-locality by confining it to small regions and doing separate transforms for each block. However, this compromise has the disadvantage of producing a tiled appearance which manifests itself visually by having a blockiness quality.
The quantization step is essential to reduce the amount of information to be transmitted, though it does cause loss of image information. Each transform component is quantized using a value selected from its position in each 8xc3x978 block. This step has the convenient side effect of reducing the abundant small values to zero or other small numbers, which can require much less information to specify.
The run-length encoding step codes runs of same values, such as zeros, to produce codes which identify the number of times to repeat a value and the value to repeat. A single code like xe2x80x9c8 zerosxe2x80x9d requires less space to represent than a string of eight zeros, for example. This step is justified by the abundance of zeros that usually results from the quantization step.
Huffinan coding (a popular form of entropy coding) translates each symbol from the run-length encoding step into a variable-length bit string that is chosen depending on how frequently the symbol occurs. That is, frequent symbols are coded with shorter codes than infrequent symbols. The coding can be done either from a preset table or one composed specifically for the image to minimize the total number of bits needed.
Similarly to JPEG, the Motion Pictures Experts Group (MPEG) has promulgated two standards for coding image sequences. The standards are known as MPEG I and MPEG II. The MPEG algorithms exploit the common occurrence of relatively small variations from frame to frame. In the MPEG standards, a full image is compressed and transmitted only once for every 12 frames. These xe2x80x9creferencexe2x80x9d frames (so-called xe2x80x9cI-framesxe2x80x9d for intra-frames) are typically compressed using JPEG compression. For the intermediate frames, a predicted frame (P-frame) is calculated and only the difference between the actual frame and each predicted frame is compressed and transmitted.
Any of several algorithms can be used to calculate a predicted frame. The algorithm is chosen on a block-by-block basis depending on which predictor algorithm works best for the particular block. One technique called xe2x80x9cmotion estimationxe2x80x9d is used to reduce temporal redundancy. Temporal redundancy is observed in a movie where large portions of an image remain unchanged from frame to adjacent frame. In many situations, such as a camera pan, every pixel in an image will change from frame to frame, but nearly every pixel can be found in a previous image. The process of xe2x80x9cfindingxe2x80x9d copies of pixels in previous (and future) frames is called motion estimation. Video compression standards such as H.261 and MPEG 1 and 2 allow the image encoder (image compression engine) to remove redundancy by specifying the motion of 16xc3x9716 pixel blocks within an image. The image being compressed is broken into blocks of 16xc3x9716 pixels. For each block in an image, a search is carried out to find matching blocks in other images that are in the sequence being compressed. Two measures are typically used to determine the match. One is the sum of absolute difference (SAD) which is mathematically written as             ∑      i        ⁢                  ∑        j            ⁢              (                  "LeftBracketingBar"                                    a              i                        -                          b              j                                "RightBracketingBar"                )              ,
and the other is the sum of differences squared (SDS) which is mathematically written as       ∑    i    ⁢            ∑      j        ⁢                            (                                    a              i                        -                          b              j                                )                2            .      
The SAD measure is easy to implement in hardware. However, though the SDS operation requires greater precision to generate, the result is generally accepted to be of superior quality.
For real time, high-quality video image decompression, the decompression algorithm must be simple enough to be able to produce 30 frames of decompressed images per second. The speed requirement for compression is often not as extreme as for decompression, since in many situations, images are compressed offline. Even then, however, compression time must be reasonable to be commercially viable. In addition, many applications require real time compression as well as decompression, such as real time transmission of live events; e.g., video teleconferencing.
Dedicated digital signal processors (DSPs) are the traditional workhorses generally used to carry out these kinds of operations. Optimized for number crunching, DSPs are often included within multimedia devices such as sound cards, speech recognition cards, video capture cards, etc. DSPs typically function as coprocessors, performing the complex and repetitive mathematical computations demanded by the data compression algorithms, and performing specific multimedia-type algorithms more efficiently than their general purpose microprocessor counterparts.
However, the never ending quest to improve the price/performance ratio of personal computer systems has spawned a generation of general purpose microprocessors which effectively duplicate much of the processing capacity traditionally provided by DSPs. One line of development is the reduced instruction set computer (RISC). RISC processors are characterized by a smaller number of instructions which are simple to decode, and by requiring that all arithmetic/logic operations be performed in register-to-register manner. Another feature is that there are no complex memory access operations. All memory accesses are register load/store operations, and there are a comparatively smaller number of relatively simpler addressing modes; i.e., only a few ways of specifying operand addresses. Instructions are of only one length, and memory accesses are of a standard data width. Instruction execution is of the direct hardwired type, as compared to microcoding. There is a fixed instruction cycle time, and the instructions are defined to be relatively simple so that they all execute in one or a few cycles. Typically, multiple instructions are simultaneously in various states of execution as a consequence of pipeline processing.
To make MPEG, JPEG, H.320, etc., more viable as data compression standards, enhancements to existing RISC architectures processors and to existing instruction sets have been made. Other modern digital services, such as broadband networks, set-top box CPU""s, cable systems, voice-over IP equipment, and wireless products, conventionally implemented using DSP methodology, would also benefit by having increased processing capacity in a single general-purpose processor. More generally, digital filter applications which traditionally are implemented by DSP technology would benefit from the additional processing capability provided by a general-purpose processor having DSP capability.
The instruction set architecture (ISA) of many RISC processors include single-instruction-multi-data (SIMD) instructions. These instructions allow parallel operations to be performed on multiple elements of a vector of data with corresponding elements of another vector. These types of vector operations are common to many digital applications such as image processing. Another critical area is in the field of data encryption and decryption systems. Coding of information is important for secured transactions over the Internet and for wireless communication systems.
Therefore it is desirable to further enhance the performance of the RISC architecture. It is desirable to improve the performance capability of RISC processor cores to provide enhanced multimedia applications and in general to meet the computing power demanded by next generation consumer products. What is needed are enhancements of the ISA for vectored processing instructions. It is also desirable to provide an improved microarchitecture for a RISC-based processor in the areas of vectored data processing.
A method of multiplying 32-bit values includes splitting each multiplicand into two 16-bit values. For each multiplicand, the two 16-bit values can be summed to produce the original 32-bit datum. Thus, each 32-bit value has the form (an+bn). The product is (a1a2+a1b2+a2b1+b1b2). Multiplying the two multiplicands in this manner requires only 16-bit multipliers. The intermediate terms need to be multiplied by powers of two before summing to produce the correct result.
In accordance with the invention a processing core includes a multiplication unit comprising first, second, and third inputs for receiving data from a general purpose register file. The multiplication unit further comprises a first selector coupled to receive the inputs, a set of multiply circuits coupled to receive outputs of the first selector, a first, a second and a third transform path, a second selector coupled to receive the transform paths, a compression circuit coupled to receive an output of the second selector and to receive the third input, and an adder circuit coupled to receive outputs of the second selector.
The first selector selects subsets of its inputs and produces them in various sequences to the multiply circuits, depending on the decoded instruction. Each transform path produces a different data transformation on the outputs of the multiplier circuits. The second selector selects among the three transform paths and feeds the selected path to the compression circuit, also based on the decoded instruction. The adder circuit is selectively configured to provide four, two or a single full adder, again based on the decoded instruction.
The multiply circuits include overflow detection logic. Likewise, the adder circuit includes overflow detection logic. Saturation value generators are provided in the multiply circuits and in the adder circuit to provide saturation upon detecting overflow.
The configurability of the first selector to present its inputs in different sequences to the multiply circuits creates a flexible circuit for accommodating a variety of instructions. In particular, the same instruction can be implemented for different-sized data formats without having to provide circuitry customized for each format. The dual overflow detection logic also supports multiple data formats. In addition, overflow situations are more accurately handled since overflow detection occurs for intermediate results.