A reconfigurable architecture in the present context is understood to refer to modules or units (VPUs) having a configurable function and/or interconnection, in particular integrated modules having a plurality of arithmetic and/or logic and/or analog and/or memory and/or internal/external interconnecting modules in one or more dimensions interconnected directly or via a bus system.
Conventional types of such modules includes, for example, systolic arrays, neural networks, multiprocessor systems, processors having a plurality of arithmetic units and/or logic cells and/or communicative/peripheral cells (IO), interconnection and network modules such as crossbar switches, and conventional modules of FPGA, DPGA, Chameleon, XPUTER, etc. Reference is made in this connection to the following patents and patent applications: P 44 16 881 A1, DE 197 81 412 A1, DE 197 81 483 A1, DE 196 54 846 A1, DE 196 54 593 A1, DE 197 04 044.6 A1, DE 198 80 129 A1, DE 198 61 088 A1, DE 199 80 312 A1, PCT/DE 00/01869, DE 100 36 627 A1, DE 100 28 397 A1, DE 101 10 530 A1, DE 101 11 014 A1, PCT/EP 00/10516, EP 01 102 674 A1, DE 198 80 128 A1, DE 101 39 170 A1, DE 198 09 640 A1, DE 199 26 538.0 A1, DE 100 50 442 A1, PCT/EP 02/02398, DE 102 40 000, DE 102 02 044, DE 102 02 175, DE 101 29 237, DE 101 42 904, DE 101 35 210, EP 01 129 923, PCT/EP 02/10084, DE 102 12 622, DE 102 36 271, DE 102 12 621, EP 02 009 868, DE 102 36 272, DE 102 41 812, DE 102 36 269, DE 102 43 322, EP 02 022 692, DE 103 00 380, DE 103 10 195 and EP 02 001 331 and EP 02 027 277. The full content of these documents is herewith incorporated for disclosure purposes.
The architecture mentioned above is used as an example for clarification and is referred to below as a VPU. This architecture is composed of any, typically coarsely granular arithmetic, logic cells (including memories) and/or memory cells and/or interconnection cells and/or communicative/peripheral (IO) cells (PAEs) which may be arranged in a one-dimensional or multi-dimensional matrix (PA). The matrix may have different cells of any design; the bus systems are also understood to be cells here. A configuration unit (CT) which stipulates the interconnection and function of the PA through configuration is assigned to the matrix as a whole or parts thereof. A finely granular control logic may be provided. Various methods are known for coupling reconfigurable processors with standard processors. They usually involve a loose coupling. In many regards, the type and manner of coupling still need further improvement; the same is true for compiler methods and/or operating methods provided for joint execution of programs on combinations of reconfigurable processors and standard processors.
The limitations of conventional processors are becoming more and more evident. The growing importance of stream-based applications makes coarse-grain dynamically reconfigurable architectures an attractive alternative. See, e.g., R. Hartenstein, R. Kress, & H. Reinig, “A new FPGA architecture for word-oriented datapaths,” Proc. FPL '94, Springer LNCS, September 1994, at 849; E. Waingold et al., “Baring it all to software: Raw machines,” IEEE Computer, September 1997, at 86-93; PACT Corporation, “The XPP Communication System,” Technical Report 15 (2000); see generally http://www.pactcorp.com. They combine the performance of ASICs, which are very risky and expensive (development and mask costs), with the flexibility of traditional processors. See, for example, J. Becker, “Configurable Systems-on-Chip (CSoC),” (Invited Tutorial), Proc. of 9th Proc. of XV Brazilian Symposium on Integrated Circuit, Design (SBCCI 2002), (September 2002).
The datapaths of modern microprocessors reach their limits by using static instruction sets. In spite of the possibilities that exist today in VLSI development, the basic concepts of microprocessor architectures are the same as 20 years ago. The main processing unit of modern conventional microprocessors, the datapath, in its actual structure follows the same style guidelines as its predecessors. Although the development of pipelined architectures or superscalar concepts in combination with data and instruction caches increases the performance of a modern microprocessor and allows higher frequency rates, the main concept of a static datapath remains. Therefore, each operation is a composition of basic instructions that the used processor owns. The benefit of the processor concept lies in the ability of executing strong control dominant application. Data or stream oriented applications are not well suited for this environment. The sequential instruction execution isn't the right target for that kind of application and needs high bandwidth because of permanent retransmitting of instruction/data from and to memory. This handicap is often eased by use of caches in various stages. A sequential interconnection of filters, which perform data manipulation without writing back the intermediate results would get the right optimisation and reduction of bandwidth. Practically, this kind of chain of filters should be constructed in a logical way and configured during runtime. Existing approaches to extend instruction sets use static modules, not modifiable during runtime.
Customized microprocessors or ASICs are optimized for one special application environment. It is nearly impossible to use the same microprocessor core for another application without loosing the performance gain of this architecture.
A new approach of a flexible and high performance datapath concept is needed, which allows for reconfiguring the functionality and for making this core mainly application independent without losing the performance needed for stream-based applications.
When using a reconfigurable array, it is desirable to optimize the way in which the array is coupled to other units, e.g., to a processor if the array is used as a coprocessor. It is also desirable to optimize the way in which the array is configured.
Further, WO 00/49496 discusses a method for execution of a computer program using a processor that includes a configurable functional unit capable of executing reconfigurable instructions, which can be redefined at runtime. A problem with conventionable processor architectures exists if a coupling of, for example, sequentional processors is needed and/or technologies such as a data-streaming, hyper-threading, multi-threading, multi-tasking, execution of parts of configurations, etc., are to be a useful way for enhancing performance. Techniques discussed in prior art, such as WO 02/50665 A1, do not allow for a sufficiently efficient way of providing for a data exchange between the ALU of a CPU and the configurable data processing logic cell field, such as an FPGA, DSP, or other such arrangement. In the prior art, the data exchange is effected via registers. In other words, it is necessary to first write data into a register sequentially, then retrieve them sequentially, and restore them sequentially as well.
Another problem exists if an external access to data is requested in known devices used, inter alia, to implement functions in the configurable data processing logic cell field, DFP, FPGA, etc., that cannot be processed sufficiently on a CPU-integrated ALU. Accordingly, the data processing logic cell field is practically used to allow for user-defined opcodes that can process data more efficiently than is possible on the ALU of the CPU without further support by the data processing logic cell field. In the prior art, the coupling is generally word-based, not block-based. A more efficient data processing, in particular more efficient than possible with a close coupling via registers, is highly desirable.
Another method for the use of logic cell fields that include coarse- and/or fine-granular logic cells and logic cell elements provides for a very loose coupling of such a field to a conventional CPU and/or a CPU-core in embedded systems. In this regard, a conventional sequential program can be executed on the CPU, for example a program written in C, C++, etc., wherein the instantiation or the data stream processing by the fine- and/or coarse-granular data processing logic cell field is effected via that sequential program. However, a problem exists in that for programming said logic cell field, a program not written in C or another sequential high-level language must be provided for the data stream processing. It is desirable to allow for C-programs to run both on a conventional CPU-architecture as well as on the data processing logic cell field operated therewith, in particular, despite the fact that a quasi-sequential program execution should maintain the capability of data-streaming in the data processing logic cell fields, whereas simultaneously the capability exists to operate the CPU in a not too loosely coupled way.
It is already known to provide for sequential data processing within a data processing logic cell field. See, for example, DE 196 51 075, WO 98/26356, DE 196 54 846, WO 98/29952, DE 197 04 728, WO 98/35299, DE 199 26 538, WO 00/77652, and DE 102 12 621. Partial execution is achieved within a single configuration, for example, to reduce the amount of resources needed, to optimize the time of execution, etc. However, this does not lead automatically to allowing a programmer to translate or transfer high-level language code automatically onto a data processing logic cell field as is the case in common machine models for sequential processes. The compilation, transfer, or translation of a high-level language code onto data processing logic cell fields according to the methods known for models of sequentially executing machines is difficult.
In the prior art, it is further known that configurations that effect different functions on parts of the area respectively can be simultaneously executed on the processing array and that a change of one or some of the configuration(s) without disturbing other configurations is possible at run-time. Methods and hardware-implemented means for the implementation are known to ensure that the execution of partial configurations to be loaded onto the array is possible without deadlock. Reference is made to DE 196 54 593, WO 98/31102, DE 198 07 872, WO 99/44147, DE 199 26538, WO 00/77652, DE 100 28 397, and WO 02/13000. This technology allows in a certain way a certain parallelism and, given certain forms and interrelations of the configurations or partial configurations for a certain way of multitasking/multi-threading, in particular in such a way that the planning, i.e., the scheduling and/or the planning control for time use, can be provided for. Furthermore, from the prior art, time use planning control means and methods are known that, at least under a corresponding interrelation of configurations and/or assignment of configurations to certain tasks and/or threads to configurations and/or sequences of configurations, allow for a multi-tasking and/or multi-threading.
With respect to a design of logic cell fields, reference is made here to the XPP architecture and previously published patent applications as well as more recent patent applications by the present applicant, these documents being fully incorporated herewith for disclosure purposes. The following documents should thus be mentioned in particular: DE 44 16 881 A1, DE 197 81 412A1, DE 197 81 483A1, DE 196 54 846A1, DE 196 54 593A1, DE 197 04 044.6A1, DE 198 80 129 A1, DE 198 61 088 A1, DE 199 80 312 A1, PCT/DE 00/01869, now U.S. Pat. No. 8,230,411, DE 100 36 627 A1, DE 100 28 397 A1, DE 10110530A1, DE 10111 014A1, PCT/EP00/10516 (can't find it in WIPO), EP 01102 674A1, DE 198 80 128A1, DE 10139170A1, DE 198 09 640A1, DE 199 26 538.0A1, DE 100 50 442A1, PCT/EP 02/02398, now U.S. Pat. No. 7,581,076, DE 102 40 000, DE 102 02 044, DE 102 02 175, DE 101 29 237, DE 101 42 904, DE 101 35 210, EP 01 129 923, PCT/EP 02/10084, now U.S. Pat. No. 7,577,822, DE 102 12 622, DE 102 36 271, DE 102 12 621, EP 02 009 868, DE 102 36 272, DE 102 41 812, DE 102 36 269, DE 102 43 322, EP 02 022 692, EP 02 001 331, and EP 02 027 277.
One problem in traditional approaches to reconfigurable technologies is encountered when the data processing is performed primarily on a sequential CPU using a configurable data processing logic cell field or the like and/or when data processing involving a plurality of processing steps and/or extensive processing steps to be performed sequentially is desired.
There are known approaches which are concerned with how data processing may be performed on both a CPU and a configurable data processing logic cell field.
WO 00/49496 describes a method for executing a computer program using a processor which includes a configurable functional unit capable of executing reconfigurable instructions, whose effect is redefinable in runtime by loading a configuration program, this method including the steps of selecting combinations of reconfigurable instructions, generating a particular configuration program for each combination, and executing the computer program. Each time an instruction from one of the combinations is needed during execution and the configurable functional unit is not configured using the configuration program for this combination, the configuration program for all the instructions of the combination is to be loaded into the configurable functional unit. In addition, a data processing device having a configurable functional unit is known from WO 02/50665 A1, where the configurable functional unit is used to execute instructions according to a configurable function. The configurable functional unit has a plurality of independent configurable logic blocks for executing programmable logic operations to implement the configurable function. Configurable connecting circuits are provided between the configurable logic blocks and both the inputs and outputs of the configurable functional unit. This allows optimization of the distribution of logic functions over the configurable logic blocks.
One problem with traditional architectures occurs when coupling is to be performed and/or technologies such as data streaming, hyperthreading, multithreading and so forth are to be utilized in a logical and performance-enhancing manner. A description of an architecture is given in “Exploiting Choice: Instruction Fetch and Issue on Implementable Simultaneous Multi-Threading Processor,” Dean N. Tulson, Susan J. Eggers et al., Proceedings of the 23rd Annual International Symposium on Computer Architecture, Philadelphia, May 1996.
Hyperthreading and multithreading technologies have been developed in view of the fact that modern microprocessors gain their efficiency from many specialized functional units and functional units triggered like a deep pipeline as well as high memory hierarchies; this allows high frequencies in the function cores. However, due to the strictly hierarchical memory arrangements, there are major disadvantages in the event of faulty access to caches because of the difference between core frequencies and memory frequencies, since many core cycles may elapse before data is read out of the memory. Furthermore, problems occur with branchings and in particular incorrectly predicted branchings. It has therefore been proposed that a switch be performed between different tasks as a simultaneous multithreading (SMT) procedure whenever an instruction is not executable or does not use all functional units.
The technology of the above-cited exemplary documents (not by the present applicant) involves, among other things, an arrangement in which configurations are loadable into a configurable data processing logic cell field, but in which data exchange between the ALU of the CPU and the configurable data processing logic cell field, whether an FPGA, DSP or the like, takes place via registers. In other words, data from a data stream must first be written sequentially into registers and then stored in these registers sequentially again. Another problem occurs when there is to be external access to data, because even then there are still problems in the chronological data processing sequence in comparison with the ALU and in the allocation of configurations, and so forth. Traditional arrangements, such as those known from protective rights not held by the present applicant, are used, among other things, for processing functions in the configurable data processing logic cell field, DFP, FPGA or the like, which are not efficiently processable on the ALU of the CPU. The configurable data processing logic cell field is thus used in practical terms to permit user-defined opcodes which allow more efficient processing of algorithms than would be possible on the ALU arithmetic unit of the CPU without configurable data processing logic cell field support.
In the related art, as has been recognized, coupling is thus usually word-based but not block-based, as would be necessary for data streaming processing. It is initially desirable to permit more efficient data processing than would be the case with close coupling via registers.
Another possibility for using logic cell fields of logic cells having a coarse and/or fine granular structure and logic cells and logic cell elements having a coarse and/or fine granular structure involves a very loose coupling of such a field to a traditional CPU and/or a CPU core with embedded systems. A traditional sequential program, e.g., a program written in C, C++ or the like, may run on a CPU or the like, data stream processing calls being instantiated by this program on the finely and/or coarsely granular data processing logic cell field. It is then problematic that in programming for this logic cell field, a program not written in C or another sequential high-level language must be provided for data stream processing. It would be desirable here for C programs or the like to be processable on both the traditional CPU architecture and on a data processing logic cell field operated jointly together with it, i.e., a data streaming capability is nevertheless maintained in quasi-sequential program processing using the data processing logic cell field in particular, whereas CPU operation, in particular using a coupling which is not too loose, remains possible at the same time.
It is also already known that within a data processing logic cell field system such as that known in particular from PACT02 (DE 196 51 075.9-53, WO 98/26356, now U.S. Pat. No. 6,728,871), PACT04 (DE 196 54 846.2-53, WO 98/29952(no US)), PACT08 (DE 197 04 728.9, WO 98/35299 (no US)), PACT13 (DE 199 26 538.0, WO 00/77652, now U.S. Pat. No. 8,230,411), PACT31 (DE 102 12 621.6-53, PCT/EP 02/10572, now U.S. Pat. No. 8,429,385), sequential data processing may also be provided within the data processing logic cell field. However, for example to save resources, to achieve time optimization and so forth, partial processing is achieved within a single configuration without this resulting in a programmer being able to automatically and easily implement a piece of high-level language code on a data processing logic cell field, as is the case with traditional machine models for sequential processors. Implementation of high-level language code on data processing logic cell fields according to the models for sequentially operating machines still remains difficult.
It is also known from the related art that multiple configurations, each triggering a different mode of functioning of array parts, may be processed simultaneously on the processor array (PA) and that a switch in one or more configurations may take place without any disturbance in others during runtime. Methods and arrangements for their implementation in hardware are known; processing of partial configurations to be loaded into the field may be performed without a deadlock. Reference is made here in particular to the patent applications pertaining to the FILMO technology, e.g., PACT05 (DE 196 54 593.5-53, WO 98/31102 (no US)), PACT10 (DE 198 07 872.2, WO 99/44147, now U.S. Pat. No. 6,480,937, WO 99/44120, now U.S. Pat. No. 6,571,381), PACT13 (DE 199 26 538.0, WO 00/77652, now U.S. Pat. No. 8,230,411), PACT17 (DE 100 28 397.7), WO 02/13000, now U.S. Pat. No. 7,003,660); PACT31 (DE 102 12 621.6, WO 03/036507, now U.S. Pat. No. 8,429,385). This technology already permits parallelization to a certain extent and, with appropriate design and allocation of the configurations, also permits a type of multitasking/multithreading of such a type that planning, i.e., scheduling and/or time use planning control, is provided. Time use planning control arrangements and methods are thus known per se from the related art, allowing multitasking and/or multithreading at least with appropriate allocation of configurations to individual tasks and/or threads to configurations and/or configuration sequences.