1. Field of the Invention
The invention relates to a method and arrangement for distributing the computing load in data processing systems during execution of block-based computing instructions, as well as a corresponding computer program and a corresponding computer-readable storage medium, which can be used to uniformly distribute the computing load in processors for periodically occurring computing operations. A particular application is related to the field of digital processing of multimedia signals, in particular audio signals, video signals and the like.
2. Description of the Related Art
The described method can be used, inter alia, for computing the time-discrete convolution, which due to its versatility has gained considerable importance in the field of digital processing of audio signals. The time-discrete convolution can be used to implement any type of linear and time-independent system of arbitrary complexity by mathematically linking the impulse response of the system to the audio signal. The time-discrete convolution is particularly important in the simulation of reverberations of rooms, because reverberations are difficult and complicated to model to their complexity and because the impulse response of a room can be obtained relatively easily. Impulse responses of rooms may sometimes be quite long, which causes the time-discrete convolution to be a very computing-intensive operation.
A computing instruction, which can be applied to the input signal in the time domain, is directly obtained from the definition of the convolution operation. The complexity for computing the time-discrete convolution scales linearly with the length of the impulse response. Depending on the power of the employed processor, the time-discrete convolution for long impulse responses can therefore frequently no longer be computed in real time.
A more efficient approach for computing the convolution is the so-called fast convolution, which is based on the equivalence of convolution in the time domain and multiplication in the frequency domain. Accordingly, impulse response and audio signal are transformed into the frequency domain, where they are multiplied, and the result is back-transformed into the time domain. The computing complexity can be significantly reduced by using the fast Fourier transformation (FFT), because the computing complexity no longer scales linearly with the impulse response, but instead only logarithmically. Because the entire audio signal must be available for the fast convolution, before the transformation can be started, the fast convolution in the aforedescribed form is not suitable for real-time computations.
Other methods are disclosed in the publication Kulp, Barry D.: “Digital Equalization Using Fourier Transform Techniques”, 85th AES Convention, Paper Number 2694; 1988, wherein impulse response and audio signal can be divided into several blocks for performing a fast convolution segment-by segment. The convolution with a segment of the impulse response will be referred to hereinafter as “segment convolution”. The time-discrete convolution can be efficiently computed in real time by suitable selection of the segmentation pattern.
Disadvantageous with this method is the distribution of the computing load. As shown in FIG. 1, “input and output phases” (brick-pattern hatching) alternate in this method with “computing phases” (wavy hatching). During the input and output phases, sampled values from the input signal stream are sampled for the subsequent computing phase and the output signal stream is filled with the result from the preceding computing phase. The next computing phase always commences when N sampled values have been collected, making an additional FFT with the block size 2*N possible. (With the method overlap-add and overlap-save, the FFT length must be twice the size of the individual time segments.)
A load distribution of this type is very unfavorable for real-time processing, because the processing power of the executing system must be dimensioned such that the system is not overloaded during the computing phases. As a result, the processing power must be significantly higher than would otherwise be required with a uniform capacity utilization of the system.
A method for improved load distribution is disclosed in the publication Gardner, William G.: “Efficient Convolution without Input-Output Delay”, JAES Volume 43, Issue 3, pp. 127-136; March 1995, which computes the different segment convolutions asynchronous with respect to the processing of the input and output signal stream. The input and output phases are hereby separated from the computing phases. However, because the input and output phases need the results from the computing phases, a synchronization must be performed at certain times in spite of asynchronous computation so as to ensure that the sampled values to be outputted are completely computed at the time they are outputted. Such synchronization can sometimes be time-consuming and is frequently not compatible with the boundary conditions which govern execution of the method. For example, if the method is executed on a computer with a multithreading operating system, then the real-time thread must never have to wait for another thread, because the real-time capability of the system could then no longer be guaranteed. In particular, the distribution of the computing capacity would then be subordinate to the operating system and cannot be influenced by implementing the method. In addition, according to this method, only the computing phases of different segment convolutions are temporally interleaved with one another. In particular, this method does optimally utilize the available computing capacity for arrangements with a limited number of segment convolutions.
Another method for improved load distribution in block based algorithm is described in the publication US 2003/0046637 A1. This system, which is a decoder, takes as its input a sequence of blocks of data. A certain predefined integer number of these blocks form a code word that needs to be decoded following a number of decoding steps. For each step, a specific algorithm is used and the steps have to be processed in chronological order. According to US 2003/0046637 A1 each algorithm is broken down into the same number of sub-steps as there are blocks in a code word. In this case, every time the system receives one new code block, it will process one sub-step of one of the several decoding steps, for one of the already received code words. This results in a constant computing load over time, where an entire block is the smallest unit, which is considered. A load distribution within a block is not considered. However, because only one sub-step is processed every time a code block is coming, the overall latency of the system is equal to X times a code word length, where X is the number of decoding steps. In a real time system, latency is often a crucial parameter and needs to be minimized, as a result this invention can be unsuitable.