1. Field of the Invention
The invention relates to control flow and memory management optimization, and more particularly to control flow which is optimized in a digital system-level description for faster system simulation and for storage requirements within hardware or software design implementation/synthesis.
2. Description of the Background Art
Many real-time multi-dimensional signal processing (RMSP) algorithms, such as those used in speech, image, and video processing, front-end telecom, and numerical computing systems, exhibit a large amount of control flow, including many loops, and multi-dimensional signals. This is especially so if the applications are described in an applicative or functional way using conventional computer languages like Silage , Signal, or even the industrial standards VHDL and Verilog. The presence of multi-dimensional (M-D) signals and complex nested loops in these applications heavily affects the memory cost for the final hardware (architecture) realization. This leads to a severe memory bottleneck when such algorithms have to be mapped from a behavioral specification into some realization.
A similar problem occurs in mapping applications on predefined instruction-set processors or when the system designer wants to verify his initial RMSP specification by means of pseudo-exhaustive (software-based) simulation on a workstation or (parallel) digital signal processing (DSP) board. Here, the limitations on available memory size and bandwidth require a careful study of the memory organization for RMSP. The available (physical) memory on the processor in the personal computer, workstation, or emulation board is generally insufficient to contain all the intermediate signals. As a result, swapping will occur which heavily influences the elapsed time for the simulation.
Hence, for both hardware and software, the most dominant effect on area and power related to the processing of N-bit multi-dimensional data or signals lies in the memory organization. Memory is a major cost issue when such a specification is mapped on (application-specific) hardware with traditional synthesis techniques. Generally, between 50 and 80% of the area cost in customized architectures for RMSP is due to memory units, i.e., single or multi-port random access memories (RAMs), pointer-addressed memories, and register files.
In terms of power consumption, the transfer count directly influences the number of transitions of the large capacity induced by the path between the "arithmetic processing" and the memory. The maximal number of words alive directly relates to the memory size (and thus area) but indirectly also influences power consumption because the capacitive load, which toggles every transition, will increase at least proportionally to the memory size. This direct memory related cost involves a large part of the total power cost in most systems. Indirectly, this factor becomes even larger when the effect on the clock distribution network is incorporated.
The problem of excessive memory use is not unique for applicative specification languages. It applies equally well for non-optimized procedural specifications in languages such as C, Fortran, Pascal, procedural VHDL.
With these problems in mind, it is important to reduce the size of the M-D signal storage in physical memory during system-level hardware synthesis, software compilation, and simulation/emulation. When describing an algorithm involving multi-dimensional signals in a non-procedural (applicative) language, such as Silage, or equivalently, when a complete optimized ordering of the original description is not provided by the designer, optimizing memory management becomes theoretically untractable. One way of solving the problem under-these constraints is by means of an accurate data-flow analysis preceding the storage minimization. These methods lie in the domain of optimizing (parallel) compiler theory. However, the published methods handle only the detection of the dependencies, because in parallelism detection, a simplified data-flow analysis providing a "yes/no" answer is sufficient. In contrast, minimizing the total storage cost requires knowledge about the exact number of dependencies. Consequently, besides the qualitative aspects, the data-flow analysis must be provided with quantitative capabilities. Moreover, the compiler approaches are not sufficient for many irregular image and speech processing applications. Finally, the compiler oriented approaches have not investigated automated steering mechanisms to modify the control flow in order to arrive at a better description, e.g., with lower storage requirement. The latter task is even more difficult than the data-flow analysis as such.
Memory problems also exist in RMSP compilation from the system-level specification description (e.g., Silage) to a target language (e.g., C). Generally a compiled-code technique gives rise to much better results for RMSP evaluation than event-driven simulators. Unfortunately, it requires a preprocessing step involving data flow analysis which is very complex for true M-D signals. The few existing memory management approaches that deal with the compilation of (applicative) RMSP specifications are based on explicit unrolling of loops to turn them into standard scalar compilation, or a preprocessing step based on symbolic simulation to analyze the dependencies. The former approach is not sufficient to Handle even medium-size loop nests. The latter does not require the execution of the actual operations on the signals, but only the investigation of the production (in the left hand side of a definition) and consumption (in the right hand side of a definition) of signals. This may be effective for some audio, speech, and telecom applications but becomes a problem when the loop depth increases beyond three and when the iterator range exceeds a few hundred, as occurs in image and video processing.
In a high-level synthesis context, little work has been performed in the area of memory management for M-D signals. Efforts in this field have concentrated on memory allocation and in-place storage, and on address cost reduction. The conventional schemes have focused on the execution of the individual loop transformations based on M-D data-flow analysis, on foreground register allocation, on system simulators intended for applications where the loops can be (implicitly) unrolled, or on the actual memory allocation/in-place storage reduction/address generation tasks for background memories. In the PHIDEO Scheduler only the starting times for the statements are optimized for data-path and memory cost during scheduling whereas the periods and the lengths of the original stream signals remain unaffected. See, P. Lippens, J. van Meerbergen, A. van der Werf, W. Verhaegh, B. McSweeney, J. Huisken, O. McArdle, "PHIDEO: a silicon compiler for high speed algorithms", Proc. European Design Autom. Conf., Amsterdam, The Netherlands, pp.436-441, February 1991. Global control flow optimizations intended to reduce the total memory cost for application-specific architectures have not been addressed.