1. Field of the Invention
This invention relates to a processor architecture for video processing tasks such as motion estimation and pixel processing where the processor also incorporates general processing capabilities and further relates to arithmetic logic units and multiply units for such processors.
2. Description of Related Art
General purpose processors commonly have an architecture that allows the processor to perform a wide variety of memory access, arithmetic, logical, and program control operations. The wide variety of operations simplifies (or enables) development of software for a nearly endless variety of tasks. For example, with appropriate software, a general purpose processor can execute programs including operating systems, communication applications, word processing applications, data bases, spread sheets, and games. General purpose processors can also perform multimedia tasks such as video data processing (encoding, decoding, and filtering), audio data processing, and communications data processing. A drawback of general purpose processors is that the processor""s architecture may not be efficient for some tasks. For example, video data processing often requires manipulation of large two-dimensional arrays of pixel values. General purpose processors typically handle one pixel value or a few pixel values per instruction and must repeatedly access external memory to retrieve appropriate pixel values just before processing the pixel values.
A processor designed for a specific task (commonly referred to as a digital signal processor or DSP) can be much more efficient at the task and therefore much less expensive than a general purpose processor that provides the same performance when performing the task. An example of a special purpose DSP is an MPEG video decoder that includes a logic specifically adapted for decoding an MPEG video data stream. While special purpose DSPs can be very efficient at specific tasks, such DSPs are typically incapable of or unsuited for other tasks. Accordingly, a system for multimedia data processing may require several separate DSPs for the different tasks and may still need a general purpose processor for control functions not implemented on any of the DSPs.
A processor architecture is desired that efficiently performs a variety of video and general processing tasks. Such a processor would ideally provide high performance at minimal expense and would eliminate the need for additional DSPs or a general purpose processor in many multimedia data processing systems.
In accordance with the invention, a video signal processor operates in three modes, a motion estimation mode for searching a search window to find a block that best matches a reference block, a pixel processing mode for processing such as a half-pixel interpolation and vertical and horizontal filtering of pixel data, and a general processing mode for a general purpose processing including system control and multimedia calculations such as DCTs and FFTs. The processor, by itself, can support the diverse control, video, audio, and modem functions. In one embodiment, the processor includes first and second on-chip memories that have different functions depending on the operating mode. In general processing mode, the first memory is a fast scratch memory and the second memory is a register file containing operands for a relatively wide (e.g., 32-bit) data paths. In pixel processing mode, the first memory still operates as a scratch pad, but the second memory is a register file containing vector operands with pixel-value-size (e.g., 8-bit) data elements. In search mode, the first memory is a search window buffer, the second memory stores a reference block of pixel values, and both memories directly provide operands to the processor""s data paths.
The processor""s data paths may include an arithmetic logic unit and a multiply unit, each of which includes multiple slices. The multiple slices operate independently in for parallel processing in motion estimation and pixel processing modes and operate cooperatively to provide a larger data path width for general purpose processing. In particular, the multiply unit uses four multipliers to independently perform for four parallel multiplications of pixel values or uses the four multipliers cooperatively with an adder to perform a multiplication of larger operands. Each ALU slice includes a pair of adders and operand selection circuits. A line buffer for the ALU enables on-the-fly video data compression and half-pixel interpolation processes on input data, single cycle determination of absolute differences between pixel values, and general arithmetic operations such as addition and subtraction.
In accordance with one embodiment of the invention, an integrated processor includes: a processing circuit; a first memory; and a second memory. The processor operates in a first mode in which the first memory stores pixel values of a search window and the second memory stores pixel values of a reference block for which a matching block in the search window is sought. In this mode both memories can directly provide operands to the processing circuit. The processor operates in a second mode in which the second memory operates as a register file having storage locations identified by register numbers in instructions. In the second mode, the first memory operates as a scratch pad, and the processor has read and write paths for transferring data between the memories in parallel with execution of other instructions. One embodiment of the processing circuit includes an arithmetic logic unit and a multiply unit, each of which includes a plurality of slices that operate independently in the first mode to perform multiple parallel operations on pixel values and operate cooperatively in the second mode to operate on operands that are larger than the pixel values.
In accordance with a further aspect of the invention, a processor includes: an input port for input of pixel data; an operand selection circuit operable to direct pixel data from the input port to the arithmetic logic unit. Results from the arithmetic logic unit can be written into the first or second memory. In addition, the arithmetic logic unit can perform an on-the-fly compression of pixel data from the input port while writing compressed data to either the first or second memory. In one specific implementation, the on-the-fly compression averages pixels horizontally, vertically, or both horizontally and vertically. The compression permits a hierarchical motion vector search that first uses compressed pixel data and then uses uncompressed pixel data. In particular, a first step of the hierarchical motion vector search searches a compressed search window for a block most similar to a compressed reference block. A second step searches an uncompressed search window that is centered on the area identified in the first step. The hierarchical search permits searches of large search windows using a relatively small search window buffer and reduces processing time by reducing the total number of pixel value comparisons.
One embodiment of the arithmetic logic unit includes: a line buffer; and a plurality of slices, where each slice includes a first adder and a second adder. Each adder can perform an addition or a subtraction. In each slice, a first multiplexing circuit for the first adder has input signals including signals representing an associated portion of a first operand, an associated portion of a second operand, and consecutive portions of one of the first and second operands. A second multiplexing circuit for the second adder has input signals including signals representing the associated portion of the first operand, the associated portion of the second operand, data from the line buffer, and results from the first adder. The portions of the operands are typically the size of a pixel value.
For one data compression process, the first multiplexing circuit selects consecutive pixel values as operands for the first adder. For even lines in an image array, the line buffer stores the results from the first adder. For odd lines of the image array, the second multiplexing circuit selects the result from the first adder and a previous result from the line buffer as the operands for the second adder. The resulting sum from the second adder can be shifted to provide an average of four neighboring pixel values in two lines of the image array. For some half-pixel interpolation processes, the first adder stores results to the line buffer and simultaneously provides a sum to the second adder for both even and odd lines. With proper selection of input operands, the ALU can perform a half-pixel interpolation to determine horizontal averaged pixel values, vertically averaged pixel values, or pixel values that are averaged both horizontally and vertically. This permit use of half-pixel motion vectors.
For determining an absolute difference between two blocks of pixel values, the first operand contains pixel values from a first block, and the second operand contains pixel values from a second block. The first adder determines the difference between a pixel value from the first operand and a pixel value from the second operand, and the second adder determines the difference between the pixel value from the second operand and the pixel value from the first operand. A multiplexer coupled to the adders selects whichever difference is positive. A tree adder in the processor can add the positive results from the different slices together to generate a sum of the absolute differences between pixel values in blocks.
In motion search mode, an addressing system implemented in the processor for the search window buffer and the register file provides adjustable incrementing and address basing that simplifies selection of pixel values corresponding to a particular block in the search window. This simplifies coding of programs for video processing such as performing a search or a determination of the difference between the reference block and a block within the search window.
As another aspect of the invention, a multiply unit includes one or more sets of four multipliers and one or more adders that combine results from an associated set of multipliers. The multipliers in a set when operating independently generate four products, for example, four products of 8-bit values. When four multipliers operate cooperatively with the associated adder, the adder combines the results from four multipliers to generate a product of two double-size operands, for example, the product of two 16-bit operands. To perform the combination, the adder has input ports that are larger than output ports of the multipliers, and the output ports of the multipliers are coupled to bits within the input ports of the adder according to the significance of the product determine by the multiplier. An output circuit for the multiply unit provides output signals from the multipliers when the multiply unit operates in a first mode (e.g., pixel processing mode), and provides an output signal from the adder when the multiply unit operates in a second mode (e.g., general processing mode). The multiplication unit further includes an operand selection circuit that selects different portions of operands for each multiplier. The portions selected for a multiplier typically depends on the processor""s operating mode.