The invention relates to architectures for implementing discrete wavelet transforms (DWTs). The invention relates to any field where DWTs may be in use which is particularly but not exclusively concerned with architectures used in the fields of digital signal and image processing, data compression, multimedia, and communications.
A list of documents is given at the end of this description. These documents are referred to in the following by their corresponding numeral in square brackets.
The Discrete Wavelet Transform (DWT) [1]-[4] is a mathematical technique that decomposes an input signal of length N=r×km in the time domain by using dilated/contracted and translated versions of a single basis function, named the prototype wavelet. In one particular case N=2m. DWTs can be performed using Haar wavelets, Hadamard wavelets and wavelet packets. Decomposition by Haar wavelets involves low-pass and high-pass filtering followed by downsampling by two of both resultant bands and repeated decomposition of the low-frequency band to J levels or octaves.
In the last decade, the DWT has often been found preferable to other traditional signal processing techniques since it offers useful features such as inherent scalability, computational complexity of O(N) (where N is the length of the processed sequence), low aliasing distortion for signal processing applications, and adaptive time-frequency windows. Hence, the DWT has been studied and applied to a wide range of applications including numerical analysis [5]-[6], biomedicine [7], image and video processing [1], [8]-[9], signal processing techniques [10] and speech compression/decompression [11]. DWT based compression methods have become the basis of such international standards as JPEG 2000 and MPEG-4.
In many of these applications, real-time processing is required in order to achieve useful results. Even though DWTs possess linear complexity, many applications cannot be handled by software solutions only. DWT implementations using digital signal processors (DSPs) improve computation speeds significantly, and are sufficient for some applications. However, in many applications software DWT implementations on general purpose processors or hardware implementations on DSPs such as TMS320C6x are too slow. Therefore, the implementation of the DWT by means of dedicated very large scale integrated (VLSI) Application Specific Integrate Circuits (ASICs) has recently captivated the attention of a number of researchers, and a number of DWT architectures have been proposed [12]-[24]. Some of these devices have been targetted to have a low hardware complexity. However, they require at least 2N clock cycles (cc's) to compute the DWT of a sequence having N samples. Nevertheless, devices have been designed having a period of approximately N cc's (e.g., the three architectures in [14] when they are provided with a doubled hardware, the architecture A1 in [15], the architectures in [16]-[18], the parallel filter in [19], etc.). Most of these architectures use the Recursive Pyramid Algorithm (RPA) [26], or similar scheduling techniques, in order both to reduce memory requirement and to employ only one or two filter units, independently from the number of decomposition levels (octaves) to be computed. This is done producing each output at the “earliest” instance that it can be produced [26].
Architectures [17], [18] consist of only two pipeline stages where the first pipeline stage implements the first DWT octave and the second stage implements all of the following octaves based on the RPA. Even though the architectures of [17] and [18] operate at approximately 100% hardware utilisation for a large enough number of DWT octaves, they have complex control and/or memory requirements. Furthermore because they employ only two pipelining stages they have relatively low speeds. The highest throughput achieved in conventional architectures is N=2m clock cycles for implementing a 2m-point DWT. Approximately 100% hardware utilisation and higher throughput is achieved in previously proposed architectures [31], [32].
The demand for low power VLSI circuits in modern mobile/visual communication systems is increasing. Improvements in the VLSI technology have considerably reduced the cost of the hardware. Therefore, it is often worthwhile reducing the period, even at the cost of increasing the amount of hardware. One reason is that low-period devices consume less power. For instance, a device D having a period T=N/2 cc's can be employed to perform processing which is twice as fast as a device D′ having a period T′=N cc's. Alternatively, if the device D is clocked at a frequency f then it can achieve the same performance as the device D′ clocked at a frequency f′=2f. Therefore, for the device D the supply voltage (linear with respect to f) and the power dissipation (linear with respect to f2) can be reduced by factors of 2 and 4 respectively with respect to the supply voltage of the device D′ [27].
High throughput architectures typically make use of pipelining or parallelism in which the DWT octaves are implemented with a pipeline consisting of similar hardware units (pipeline stages). Even though pipelining has been already exploited by existing DWT architectures (e.g., those in [12], [23]-[24]), the fastest pipelined designs need at least N time units to implement an N-point DWT.
Most of the known designs for implementation of DWTs are based on the tree-structured filter bank representation of DWT shown in FIG. 1 where there are several (J) stages (or octaves) of signal decomposition each followed by down-sampling by a factor of two. As a consequence of downsampling, the amount of data input to each subsequent decomposition stage is half the amount input to the immediately previous decomposition stage. This makes the hardware of decomposition stages in a typical pipelined device designed to implement DWT using the tree-structured approach heavily under-utilised, since the stage implementing the octave j=1, . . . , J is usually clocked at a frequency 2j−1 times lower than the clock frequency used in the first octave [24]. This under-utilisation comes from a poor balancing of the pipeline stages when they implement the DWT octaves and leads to a low efficiency.
In [30] a pipeline architecture has been proposed based on the tree-structured filter bank representation which achievies approximately 100% hardware utilisation and throughput of N/2=2m−1 clock cycles for a 2m-point DWT. This involves a J-stage pipeline using, as far as it is possible, half as many processing units from one stage to the next stage.
Known parallel or pipelined architectures essentially depend on DWT parameters such as input length, the number of octaves, the length and, in some cases, the actual coefficient values of the low-pass and high-pass filters. For larger values of these parameters, these architectures can be very large. Furthermore, it is only possible to implement a DWT with fixed parameters within a given hardware realization of a given architecture. However, in JPEG 2000, a DWT is separately applied to tiles of an image, in which the sizes of the tiles may vary from one to 232−1. The number of octaves of decomposition may vary from 0 to 255 for different tiles. Thus it is desirable to have a device capable of implementing DWTs with varying parameters or, in other words, a unified device that is relatively independent of the DWT parameters. Designing such a device is straightforward in the case of serial architectures. It is not so straightforward in the case of parallel or pipelined architectures.
Most of the conventional architectures [12]-[26] employ a number of multipliers and adders proportional to the length of the DWT filters. Even though some architectures [17], [18] are able of implementing DWTs with varying number of octaves, their efficiency decreases rapidly as the number of octaves increases.