There are many optimization techniques that can be applied to software programs to improve performance. Most optimizations take the form of a transformation or series of transformations of the program's structure to improve the exploitation of instruction-level parallelism and/or data locality. However, for a given program or algorithm, there are a myriad of possible transformations to choose from. Which of the possible transformations is best will depend heavily on the architecture of the target processor and the features of the target system's memory hierarchy (e.g., data cache). This means that many libraries of optimized software functions are not portable between different processor architectures, or even between members of the same central processor unit (CPU) family. Software code must often be re-optimized every time the target platform changes. Such optimizations are particularly important in the presence of deep pipelines, which are common when floating-point arithmetic is used. There has been much research into architecture-adaptive code optimization. One approach is to enhance the compiler with an accurate model of the target architecture so that the effects of different transformations can be predicted. In practice, such models are difficult to devise and even more difficult to maintain.
A common computation requirement in mobile communication systems based on MIMO (multiple-input, multiple-output) is to perform matrix triangulation, which is the process of reducing a matrix to a form in which the elements below the leading diagonal are zero. Matrix triangulation techniques are also employed in radar, sonar, and other beamforming-related applications. There are several algorithms for matrix triangulation, including QR decomposition, singular-value decomposition, and Cholesky factorization. Each of these algorithms has several variant implementations, but a unifying feature is that the computational structure is triangular. Such a triangular structure results because once a matrix element has been reduced to zero, the element takes no further part in the calculations. Another common feature is a requirement for a wide dynamic range in the intermediate calculations. This makes floating-point number representation desirable, thereby increasing the importance of code optimization for achieving high performance.
Present approaches to obtaining high performance on matrix triangulation algorithms in software are either to hand-optimize the code for a specific architecture, or to attempt to model the architecture within a sophisticated compilation environment capable of performing directed loop transformation. The former approach is labor intensive and not future-proof; the latter is barely feasible given the current state of the art (such compilers do exist, but are either research projects or very expensive and/or domain-specific). While it is possible to design custom hardware to implement matrix triangulation algorithms, in most wireless systems the utilization would not be high enough to justify the costs of the custom hardware. Accordingly, there exists a need in the art for a method and apparatus for producing optimized matrix triangulation routines.