1. Field of the Invention
The invention relates to Heterogeneous Multiprocessor Network on Chip Devices, preferably containing Reconfigurable Hardware Tiles, Methods and Operating Systems for Control thereof, said Operating Systems handling run-time traffic management and task migration.
2. Description of the Related Technology
In order to meet the ever-increasing design complexity, future sub-100 nm platforms will consist of a mixture of heterogeneous computing resources (processing elements, or PEs), further denoted as tiles or nodes. [R. Tessier, W. Burleson, “Reconfigurable Computing for Digital Signal Processing: A Survey”, VLSI Signal Processing 28, p 7-27, 2001.] These loosely coupled (i.e. without locally shared memory) programmable/reconfigurable tiles will be interconnected by a configurable on-chip communications fabric or a Network-on-Chip (NoC), [S. Kumar, A. Jantsch, M. Millberg, J. berg, J. Soininen, M. Forsell, K. Tiensyrj, and A. Hemani, “A network on chip architecture and design methodology,” in Proceedings, IEEE Computer Society Annual Symposium on VLSI, April 2002.] [A. Jantsch and H. Tenhunen, “Will Networks on Chip Close the Productivity Gap”, Networks on Chip, Kluwer Academic Publishers, Dordrecht, The Netherlands, 2003, pages 3-18] [L. Benini, G. DeMicheli, “Networks on Chips: A new SOC paradigm?”, IEEE Computer magazine, January 2002, William J. Dally, Brian Towles, “Route packets, not wires: on-chip interconnection networks,” DAC 2001, p 684-689.].
Dynamically managing the computation and communication resources of such a platform is a challenging task, especially when the platform contains a special PE type such as fine-grain reconfigurable hardware (RH). Compared to the traditional PEs, RH operates in a different way, exhibiting its own distinct set of properties.
The (beneficial) use of a (flexible) Network-on-Chip to interconnect multiple heterogeneous resources has been illustrated before. [S. Kumar, A. Jantsch, M. Millberg, J. berg, J. Soininen, M. Forsell, K. Tiensyrj, and A. Hemani, “A network on chip architecture and design methodology,” in Proceedings, IEEE Computer Society Annual Symposium on VLSI, April 2002.] [T. Marescaux, A. Bartic, D. Verkest, S. Vernalde, R. Lauwereins: Interconnection Networks Enable Fine-Grain Dynamic Multi-Tasking on FPGAs. Proc. 12th Int. Conf. on Field-Programmable Logic and Applications, Springer LNCS 2438 pages 795-805, Montpellier, September 2002.]
In order to execute multiple heterogeneous applications, an operating system is required. Nollet et al. give a general overview of different operating system components [V. Nollet, P. Coene, D. Verkest, S. Vernalde, R. Lauwereins, “Designing an Operating System for a Heterogeneous Reconfigurable SoC”, Proc. RAW 2003, Nice, April 2003]
In the field of operating systems Singhal classifies the system depicted in FIG. 3A as a master-slave configuration. [Mukesh Singhal and Niranjan G. Shivaratri. “Advanced Concepts in Operating Systems: Distributed, Database and Multiprocessor Operating Systems”. McGraw-Hill Series in Computer Science. McGrawHill, New York, 1994, pages 444-445].
Daily advises the usage of NoCs in Systems-on-Chips (SoCs) as a replacement for top-level wiring because they outperform it in terms of structure, performance and modularity. Because reconfigurable SoCs are targeted there is an extra-reason to use NoCs since they allow dynamic multitasking and provide HW support to an operating system for reconfigurable systems [W. J. Dally and B. Towles: Route Packets, Not Wires: On-Chip Interconnection Networks, Proc. Design Automation Conference, June 2001.].
Simmler addresses “multitasking” on FPGAs (Field Programmable Gate Arrays). However, in this system only one task is running on the FPGA at a time. To support “multitasking” it foresees the need for task preemption, which is done by readback of the configuration bitstream. The state of the task is extracted by performing the difference of the read bitstream with the original one, which has the disadvantages of being architecture dependent and adding run-time overhead [H. Simmler, L. Levinson, R. Manner: Multitasking on FPGA Coprocessors. Proceedings 10 Intl Conf. Field Programmable Logic and Applications, pages 121-130, Villach, August 2000.]. The need for high-level task state extraction and real dynamic heterogeneous multitasking is addressed in U.S. Ser. No. 10/453,899, fully incorporated by reference.
Rijpkema discusses the integration of best-effort and guaranteed-throughput services in a combined router. [E. Rijpkema et al.: Trade Offs in the Design of a Router with both Guaranteed and Best-Effort Services for Networks On Chip. Proc. DATE 2003, pages 350-355, Munich, March 2003.]
Nollet et al. explains the design of the SW part of an operating system for reconfgurable system by extending a Real-Time OS with functions to manage the reconfigurable SoC platform. He introduces a two-level task scheduling in reconfigurable SoCs. The top-level scheduler dispatches tasks to schedulers local to their respective processors (HW tiles or ISP). Local schedulers order in time the tasks assigned to them. Task relocation is controlled in SW by the top-level scheduler. [V. Nollet, P. Coene, D. Verkest, S. Vernalde, R. Lauwereins, “Designing an Operating System for a Heterogeneous Reconfigurable SoC”, Proc. RAW 2003, Nice, April 2003] and U.S. patent application Ser. No. 10/453,899, fully incorporated by reference.
Mignolet presents the design environment that allows development of applications featuring tasks relocatable on heterogeneous processors. A common HW/SW behavior, required for heterogeneous relocation is obtained by using a unified HW/SW design language such as OCAPI-XL. OCAPI-XL allows automatic generation of HW and SW versions of a task with an equivalent internal state representation. [J.-Y. Mignolet, V. Nollet, P. Coene, D. Verkest, S. Vernalde, R. Lauwereins: Infrastructure for Design and Management of Relocatable Tasks in a Heterogeneous Reconfigurable System-on-Chip. Proc. DATE 2003, pages 986-992, Munich, March 2003] and U.S. patent application Ser. No. 10/453,899, fully incorporated by reference.
It has been previously demonstrated that using a single NoC enables dynamic multitasking on FPGAs. [T. Marescaux, A. Bartic, D. Verkest, S. Vernalde, R. Lauwereins: Interconnection Networks Enable Fine-Grain Dynamic Multi-Tasking on FPGAs. Proc. 12th Int. Conf. on Field-Programmable Logic and Applications, Springer LNCS 2438 pages 795-805, Montpellier, September 2002.] and U.S. patent application Ser. No. 10/453,899, fully incorporated by reference.
Experimentation on a first setup with a combined data and control NIC showed some limitations in the dynamic task migration mechanism. During the task-state transfer, the OS has to ensure that pending messages, stored in the network and its interfaces are redirected in-order to the computation resource the task has been relocated to. This process requires synchronization of communication and is not guaranteed to work on the first platform. Indeed, OS Operation and Management (OAM) communication and application data communication are logically distinguished on the NoC by using different tags in the message header. Because application-data can congest the packet-switched NoC, there is no guarantee that OS OAM messages, such as those ensuring the communication synchronization during task relocation, arrive timely. [T. Marescaux, A. Bartic, D. Verkest, S. Vernalde, R. Lauwereins: Interconnection Networks Enable Fine-Grain Dynamic Multi-Tasking on FPGAs. Proc. 12th Int. Conf. on Field-Programmable Logic and Applications, Springer LNCS 2438 pages 795-805, Montpellier, September 2002.]
Guerrier et al. provides structure to re-order the received packets. [Pierre Guerrier, Alain Greiner, “A Generic Architecture for On-Chip Packet-Switched Interconnections”, Proc. DATE 2000, pages 250-256]
Run-time task migration is not a new topic and has been studied extensively for multicomputer systems since the beginning of the 1980s. These algorithms are not suitable for a Network-on-Chip environment. The tiles in a NoC only have a limited amount of memory. In addition, the NoC communication protocol significantly differs from the general protocols used for computer communication. These general protocols provide a lot of flexibility, but very low performance. Due to the specific characteristics of an on-chip network, such as a very low error rate and higher bandwidth, a NoC communication protocol will provide a different trade-off between performance and flexibility [S. Kumar, “On packet switched networks for on-chip communication” In A. Jantsch and H. Tenhunen, editors, Networks on Chip, chapter 5, pages 85-106. Kluwer Academic Publishers, February 2003]. In addition, the granularity of task mapping will be different. Most likely, a tile will not contain a full-blown application. Instead, a tile will only contain a single or a few tasks belonging to that application. In contrast to the multicomputer environment, this does not pose a problem, since the extremely tight coupling of the processing elements allows heavily communicating tasks to be mapped on different computing resources.
When benchmarking task migration mechanisms, the following properties will allow us to compare different mechanisms. The ideal task migration mechanism should have                Minimal reaction time. The reaction time is defined as the time elapsed between selecting a task for migration until the task is actually ready to migrate (i.e. it reached its switchpoint).        Minimal freeze time. The migration mechanism should cause as little interruption as possible to the execution of the migrating task (and hence to the entire application). This means that the freeze time, illustrated by FIG. 19, needs to be minimized. This can be achieved on one hand by minimizing the time needed to capture and transfer the task state, on the other hand by minimizing the effort required to maintain message consistency.        Minimal residual dependencies. Once a migrated task has started executing on its new tile, it should no longer depend in any way on its previous tile. These residual dependencies are undesirable because they waste both communication and computing resources.        Minimal system interference. Besides causing minimal interference to the execution of the migrating task, the migration mechanism should avoid interference with other applications executing in the NoC or with the system as a whole.        Maximum scalability. This property determines how the migration mechanism copes with an increasing number of tasks and tiles in the NoC.Assessment of Existing Message Consistency Mechanisms        
The message consistency component of the migration mechanism described by Russ et al. [S. H. Russ, J. Robinson, M. Gleeson, J. Figueroa, “Dynamic Communication Mechanism Switching in Hector”, Mississippi State Technical Report No. MSSU-EIRS-ERC-97-8, September 1997.] is based on using end-of-channel messages and an unexpected message queue. In this case, communication consistency is preserved by emptying the unexpected message queue before receiving any other messages received after completion of the migration process.
A similar technique to preserve communication consistency is described by Steliner [G. Steliner, “CoCheck: Checkpointing and Process Migration for MPI”, Proceedings of the 10th International Parallel Processing Symposium, Honolulu Hi., April 1996.][G. Stellner, “Consistent Checkpoints of PVM Applications”, Proceedings of the First European PVM Users Group Meeting, Rome, 1994.] The migrating task sends a special synchronization message to the other tasks of the application. In turn, these tasks send a ready message to each other. Messages that still arrive before the last ready message are buffered. In order to ensure message consistency, the migrated task is served with the buffered messages first.
These mechanisms are not applicable in a NoC. Due to the extremely limited amount of message buffer space it is impossible to store all incoming messages after a task reached its migration point. This implies that messages might remain buffered in the communication path as shown in FIG. 18. Adding more buffer space to accommodate these messages is not an option, because on-chip memory is expensive and the maximum amount of required storage is application dependent.
The Amoeba distributed operating system C. Steketee, W. Zhu, P. Moseley, “Implementation of Process Migration in Amoeba.”, Proceedings of the 14th Conference on Distributed Computing Systems, pages 194-201, Poland, June 1994. offers a different way of dealing with the communication consistency issue: the consistency is built into the communication protocol. Incoming messages will be rejected while a task is migrating. The message source will be notified by a task is migrating or a not here reply message. This will trigger a lookup mechanism to determine the new location of the migrated task. In contrast to the previously described techniques, this technique does not require buffer space to queue the incoming messages during freeze time, which avoids a memory penalty in case of an upfront unknown amount of messages.
This technique is also not suited for a Network-on-Chip, since dropping and retransmitting packets reduces network performance and increases power dissipation [W. Daily and B. Towles, “Route Packets, Not Wires: On-Chip Interconnection Networks”, in Proceedings of 38th Design Automation Conference (DAC), pages 684-689, Las Vegas, June 2001.] To ensure reliable communication in a task-transparent way, this technique also requires (costly) additional on-chip functionality [A. Radulescu, K. Goossens, “Communication Services for Networks on Chip”, SAMOS II( ) pages 275-299, Samos, Greece, July 2002.] Furthermore, dropping messages potentially leads to out-of-order message delivery. Special message re-order functionality combined with extra buffer space is needed to get messages back in-order in a task-transparent way.
As explained, upon reaching a migration point, the task has to check if there for a pending switch request. In case of such a request, task migration needs to be initiated. One of the issues is the performance overhead this checking incurs during normal execution (i.e. when there is no pending switch request). Currently, the two main techniques to check for a pending switch request are:
Polling for a switch request. In this case, polling points are introduced into the execution code (into the source code by the programmer or into the object code by the compiler), where the task has a migration point. This technique is completely machine-independent, since the architectural differences will be taken care of by the compiler in one way or another. However, this technique potentially introduces a substantial performance cost during normal execution due to the continuous polling. This technique is used by task migration mechanisms implemented by [A. J. Ferrari, S. J. Chapin, and A. S. Grimshaw. Process Introspection: A Heterogeneous Checkpoint/Restart Mechanism Based on Automatic Code Modification. Technical Report CS-97-05, Department of Computer Science, University of Virginia.] [H. Jiang, V. Chaudhary, “Compile/run-time support for thread migration”, Proceedings International of the Parallel and Distributed Processing Symposium (IPDPS), pages 58-66, April 2002.].Dynamic modification of code (self-modification of code). Here the code is altered at run-time to introduce the migration-initiation code upon switch request. This way, these techniques can avoid the polling overhead. These techniques have their own downsides, like e.g. besides the fact that changing the code will most likely require a flush of the instruction cache, changing an instruction sequence the processor is currently executing can have a strange effect. This kind of technique is used by [Prashanth P. Bungale, Swaroop Sridhar and Vinay Krishnamurthy, “An Approach to Heterogeneous Process State Capture/Recovery, to Achieve Minimum Performance Overhead During Normal Execution*,” Proceedings of the 12th International Heterogeneous Computing Workshop (HCW 2003)—held as part of the 17th International Parallel and Distributed Processing Symposium (IPDPS 2003), Nice, France, Apr. 22, 2003.] [P. Smith, N. Hutchinson, “Heterogeneous Process Migration: The Tui System”, Software. Practice and Experience, 28(6), 611-639, May 1998.].
The communication QoS services offered by the AEthereal NoC are detailed in [A. Radulescu, K. Goossens, “Communication Services for Networks on Chip”, SAMOS, p 275-299, 2002]. The AEthereal system contains both an end-to-end flow control mechanism and a bandwidth reservation mechanism. The flow control mechanism ensures that a producer can only send messages when there is enough buffer space at the consumer side. In case no flow control was requested at connection setup, the packets are dropped according to a certain policy. The bandwidth reservation mechanism provides guarantees on bandwidth as well as on latency and jitter by reserving an amount of fixed sized TDMA slots for a connection. The routing is based on the use of time-slot tables. In order to avoid wasting time-slots (i.e. bandwidth), it is possible to define part (e.g. request command messages) of the connection as best effort, while the other part (e.g. data stream as a result of the command) enjoys guaranteed throughput. However, in order to allocate a time-slot for a single connection, the required time-slot needs to be available for every router along the path [Edwin Rijpkema, Kees G. W. Goossens, Andrei Radulescu, John Dielissen, Jef L. van Meerbergen, P. Wielage, E. Waterlander, “Trade Offs in the Design of a Router with Both Guaranteed and Best-Effort Services for Networks on Chip”, DATE 2003, p 350-355]. So finding a suitable (compile-time) time-slot allocation for all NoC connections is computationally intensive and requires heuristics that potentially provide sub-optimal solutions. Creating an optimal run-time time-slot allocation scheme requires a global (i.e. centralized) time-slot view, which is not scalable and slow. In contrast, distributed run-time slot allocation is scalable, but lacks a global view resulting in suboptimal resource allocations. Further research [J. Dielissen, A. R{hacek over (a)}dulescu, K. Goossens, E. Rijpkema, “Concepts and Implementation of the Philips Network-on-Chip”, IP/SoC, 2003], however, revealed that the time-slot table present in every AEthereal router takes up 25% of the router area. The control logic to enable this local time-slot table takes up another 25%. Since initial on-chip networks will be small, AEthereal authors opted for a centralized approach that does not require a time-slot table in every router. Classic computer networks expose an entire spectrum of QoS classes with best effort service on one end and deterministic guaranteed QoS on the other end. In between, there is predictive QoS and statistical QoS. Here, the QoS calculation is based on respectively the past behavior/workload or a stochastic value. Although with these techniques the requested QoS can be temporarily violated, they improve the usage of communication resources with respect to the deterministic guaranteed QoS. This is why AEthereal combines best effort with guaranteed throughput. Reisslein et al. detail a statistical QoS technique based on regulating the amount of traffic a node can inject into internet like packet-switched networks.