1. Field of the Invention
The present invention relates to a computer architecture and a method of mapping operations to the architecture. In particular the present invention relates to a hierarchical reconfigurable computer architecture.
2. Discussion of the Related Art
The complexity of digital electronic products is increasing rapidly, but at the same time electronic product producers wish to reduce the time to market of their products, and to lower costs. Much of the time associated with bringing a product to market is spent in validating and testing hardware implementations.
Reconfigurable architectures provide a means of reducing the time to market by allowing designers to postpone commitment to a certain design until after silicon fabrication. Furthermore, updated designs can be loaded during the lifetime of a device to perform new functionality not envisaged at the time of first marketing the product. FPGAs (Field Programmable Gate Arrays) are an example of a reconfigurable architecture that operates at bit level, and uses lookup tables, but is unable to meet high processing power requirements of modern designs. A new form of reconfigurable architecture has been proposed that is a coarse-grained architecture and comprises multiple processors, each operating at approximately word level, for example 12 or 16 bits, and producing one or more words at their output.
A number of coarse-grained architectures have been proposed. Technical paper titled “MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications” (IEEE Transactions on Computers, vol. 49, No 5, May 2000) describes a model for a reconfigurable computing system, targeted at applications with inherent data-parallelism. The proposed architecture is a SIMD (Single Instruction Multiple Data) architecture. The architecture comprises an eight by eight reconfigurable cell comprising four blocks, each block comprising four by four cells. Within the reconfigurable cell array, cells may communicate directly with four nearest neighbors. Some degree of second level connectivity is provided at the inter quadrant level, wherein each cell can access the output of any other cell in its row or column. Inter quadrant express lanes provide further connectivity between cells in adjacent blocks allowing cells of a given row to output values to the cells of the same row in a different quadrant. Likewise, cells in a certain column may output data directly to cells in a same column of a different quadrant.
Each reconfigurable cell comprises two input multiplexers, an ALU (Arithmetic Logic Unit) and multiplier block, a shift register, an output register, and a register file, all controlled by a context register. The context register in each cell receives context words from a central context memory, these words containing the signals for controlling the cell hardware.
The MorphoSys system has a number of disadvantages. Due to the design of the hardware in each cell, scheduling of operations within the cells is hard to control. The SIMD architecture having a central context memory is not suitable for irregular algorithms such as those used for deblocking filters. The array structure of the reconfigurable cells as well as the column and row interconnections between the cells is limiting for some requirements, and also reduces the scalability of the hardware as linear enlargement or reduction in size of the hardware is difficult.
There is a need for an efficient method of mapping designs to a coarse-grained architecture. The MorphoSys paper does not discuss in detail methods for mapping operations onto the MorphoSys architecture. However, technical paper titled “DRESC: a Retargetable Compiler for Coarse-Grained Reconfigurable Architectures” discusses a compiling tool called DRESC (Dynamically Reconfigurable Embedded System Compiler), able to parse, analyze, transform, and schedule plain C source code to a family of compiler-friendly coarse-grained reconfigurable architectures. An architecture is proposed comprising an array of functional units and register files having nearest neighbor or column and row interconnectivity. The compiler itself comprises a Modulo Scheduling Algorithm stage that receives graphs representing both the program and the architecture, and then attempts to map the program graph to the architecture graph, and to perform scheduling to achieve optimal performance with respect to all dependencies.
A disadvantage of the DRESC compiler is that by performing mapping and scheduling as two separate steps the compiler is slow and inefficient. Furthermore, this compiler is not able to process large applications due to complexity levels that grow exponentially. When the complexity of the architecture increases, the execution time will explode and it may be impossible to find a satisfactory solution, or even any solution at all.
The technical paper titled “Space-Time Scheduling of Instruction-Level Parallelism on a Raw Machine” (International Conference on Architectural Support for Programming, P. 45-57), presents an alternative coarse-grained architecture and compiler. The architecture comprises an array of tiles, each comprising a five-stage pipeline, interconnected over a pipelined, point-to-point network. Each node in the array comprises a switch connected to its processor and its four neighbors. The compiler includes a data partitioning stage, a data and instruction placing stage, a communication code generating stage, and an event scheduling stage. Partitioning is performed to maximize instruction level parallelism.
The architecture of the RAW machine is not applicable to ASIC (Application Specific Integrated Circuit) designs as the application granularity is that of a workstation (multi task), each of the nodes in the RAW architecture being a full RISC (Reduced Instruction Set Computer) computer, and each being assigned a task. Thus the RAW machine is complex and demanding in resources, and not easily scaleable based on the tasks it is to perform. Furthermore, scheduling in RAW is dynamic, there being some asynchronism in the execution of the tasks, which also adds complexity. The compiler in RAW fails to tackle the problem of efficiently routing data within the network, but instead opts for a dynamic routing scheme in which no relevance is given to the distance between computers.
All of the architectures described above are further disadvantageous in that an increase in the number of processing nodes implies a significant increase in the distance between nodes, either due to increase in the number of switches that data needs to traverse in the RAW machine, or due to the limited row and column interconnects of the MorphoSys proposal. Longer connections use more energy, and these architectures do not provide an efficient structure for reducing the distance between nodes, and thus lack scalability.