Computer processors are well known and widely used for a variety of purposes. One application of computer processors is digital signal processing (DSP). By definition, digital signal processing is connected with the representation of signals by sequences of numbers or symbols and the processing of these signals. DSP has a wide variety of applications and its importance is evident in such fields as pattern recognition, radio communications, telecommunications, radar, biomedical engineering, and many others.
At the heart of every DSP system is a computer processor that performs mathematical operations on signals. Generally, signals received by a DSP system are first converted to a digital format used by the computer processor. Then the computer processor executes a series of mathematical operations on the digitized signal. The purpose of these operations can be to estimate characteristic parameters of the signal or to transform the signal into a form that is in some sense more desirable. Such operations typically implement complicated mathematics and entail intensive numerical processing. Examples of mathematical operations that may be performed in DSP systems include matrix multiplication, matrix-inversion, Fast Fourier Transforms (FFT), auto and cross correlation, Discrete Cosine Transforms (DCT), polynomial equations, and difference equations in general, such as those used to approximate Infinite Impulse Response (IIR) and Finite Impulse Response (FIR) filters.
Computer processors vary considerably in design and function. One aspect of a processor design is its architecture. Generally, the term computer architecture refers to the instruction set and organization of a processor. An instruction set is a group of programmer-visible instructions used to program the processor. The organization of a processor, on the other hand, refers to its overall structure and composition of computational resources, for example, the bus structure, memory arrangement, and number of processing elements. A processing element may be as simple as an adder circuit that sums two values, or it may be a complex as a central processing unit (CPU) that performs a wide variety of different operations.
In a computer, a number of different organizational techniques can be used for increasing execution speed. One technique is execution overlap. Execution overlap is based on the notion of operating a computer like an assembly line with an unending series of operations in various stages of completion. Execution overlap allows these operations to be overlapped and executed simultaneously.
One commonly used form of execution overlap is pipelining. In a computer, pipelining is an implementation technique that allows a sequence of the same operations to be performed on different arguments. Computation to be done for a specific instruction is broken into smaller pieces, i.e., operations, each of which takes a fraction of the time needed to complete the entire instruction. Each of these pieces is called a pipe stage. The stages are connected in a sequence to form a pipeline--arguments of the instruction enter at one end, are processed through the stages, and exit at the other end.
These are many different architectures, ranging from complex-instruction-set-computer (CISC) to reduced-instruction-set-computer (RISC) based architectures. In addition, some architectures have only one processing element, while others include two or more processing elements. Despite differences in architectures, all computer processors have a common goal, which is to provide the highest performance at the lowest cost. However, the performance of a computer processor is highly dependent on the problem to which the processor is applied, and few, if any, low-cost computer processors are capable of performing the mathematical operations listed above at speeds required for some of today's more demanding applications. For example, MPEG data compression of an NTSC television signal can only be performed using expensive super-computers or special purpose hardware.
Many other applications, such as matrix transformations in real-time graphics, require data throughput rates that exceed the capabilities of inexpensive, single processors, such as micro processors and commercially available DSP chips. Instead, these applications require the use of costly, multiprocessor or multiple-processor computers. Although multiprocessor computers typically have higher throughput rates, they also include complex instruction sets and are generally difficult to program.
One application which is particularly expensive in terms of the required computing power is the calculation of L1 norms. The L1 norm of a vector x and a vector y .sub.i is defined as follows: EQU L1(x,y.sub.i)=.vertline.x.sub.1 -y.sub.i1 .vertline.+.vertline.x.sub.2 -y.sub.i2 .vertline.+.vertline.x.sub.3 -y.sub.i3 .vertline.+ . . . +.vertline.x.sub.n -y.sub.in .vertline. (Equation 1)
where ##EQU1##
Another way to express equation (1) is given in the following equation (4): ##EQU2##
The calculation of the L1 norm of two data vectors x and y.sub.i is needed in a large variety of digital signal processing applications. A typical requirement is that the L1 norm is calculated for a very large number of vector pairs. Such a requirement cannot be fulfilled by state of the art microprocessors or digital signal processors. Only supercomputers provide adequate computing power, but are prohibitively expensive for most applications.
Thus, there is a need for a method for effectively calculating an L1 norm and an improved parallel computer processor implementing such an improved method.