A major advance in electronic computation has been the development of systems that can perform multiple operations simultaneously. Such systems are said to perform parallel processing. Recently, cell processors have been developed to implement parallel processing on electronic devices ranging from handheld game devices to main frame computers. A typical cell processor has a main memory, power processor element (PPE) and up to 8 additional processor elements referred to as synergistic processing elements (SPE). Each SPE is typically a single chip or part of a single chip containing a processor known as a synergistic processor unit (SPU) and a local memory. All of the SPEs and the PPE can access the main memory, e.g., through a memory flow controller (MFC). The SPEs can perform parallel processing of operations in conjunction with a program running on the main processor. The local memory associated with each SPU is relatively small, currently about 256 kilobytes in one common implementation. This memory must be managed by software to transfer code and data to/from the local SPE memories.
The SPE have a number of advantages in parallel processing applications. For example, the SPE are independent processors that can execute code with minimal involvement from the PPE. Each SPE has a high direct memory access (DMA) bandwidth to RAM. An SPE can typically access the main memory faster than the PPE. In addition each SPE has relatively fast access to its associated local store. The SPE also have limitations that can make it difficult to optimize SPE processing. For example, the SPE have no coherent memory and no hardware cache. In addition, common programming models do not work well on SPE.
A typical SPE process involves retrieving code and/or data from the main memory, executing the code on the SPU to manipulate the data, and outputting the data to main memory or, in some cases, another SPU. To achieve high SPU performance it is desirable to optimize the above SPU process in relatively complex processing applications. For example, in applications such as computer graphics processing SPUs typically execute tasks thousands of times per frame.
One prior art task management system used for cell processors is based on a software concept referred to as “threads”. A “thread” generally refers to a part of a program that can execute independently of other parts. Operating systems that support multithreading enable programmers to design programs whose threaded parts can execute concurrently. When a thread is interrupted, a context switch may swap out the contents of an SPE's local storage to the main memory and substitute 256 kilobytes of data and/or code into the local storage from the main memory where the substitute data and code are processed by the SPU. A context switch is the computing process of storing and restoring the state of a SPE or PPE (the context) such that multiple processes can share a single resource.
A typical context switch involves stopping a program running on a processor and storing the values of the registers, program counter plus any other operating system specific data that may be necessary to the main memory. For example, to prevent a single process from monopolizing use of a processor certain parallel processor programs perform a timer tick at intervals ranging from about 60 ticks per second to about 100 ticks per second. If the process running on the processor is not completed a context switch is performed to save the state of the processor and a new process (often the task scheduler or “kernel”) is swapped in. As used herein, the kernel refers to a central module of the operating system for the parallel processor. The kernel is typically the part of the operating system that loads first, and it remains in main memory. Typically, the kernel is responsible for memory management, process and task management.
Frequent context switches can be quite computationally intensive and time consuming, particularly for processors that have a lot of registers. As used herein, a register refers to a special, high-speed storage area within a processor. Typically, data must be represented in a register before it can be processed. For example, if two numbers are to be multiplied, both numbers must be in registers, and the result is also placed in a register. The register may alternatively contain the address of a memory location where data is to be stored rather than the actual data itself. Registers are particularly advantageous in that they can typically be accessed in a single cycle. Program compilers typically make use of as many software-configurable registers as are available when compiling a program.
One prior art task management system used for cell processors is known as SPU Threads. A “thread” generally refers to a part of a program that can execute independently of other parts. Operating systems that support multithreading enable programmers to design programs whose threaded parts can execute concurrently. SPU Threads operates by regarding the SPUs in a cell as processors for threads. A context switch may swap out the contents of an SPU's local storage to the main memory and substitute 256 kilobytes of data and/or code into the local storage from the main memory where the substitute data and code are processed by the SPU. A context switch is the computing process of storing and restoring the state of a SPU or PPE (the context) such that multiple processes can share a single resource. Context switches are usually computationally intensive and much of the design of operating systems is to optimize the use of context switches.
Unfortunately, interoperating with SPU Threads is not an option for high-performance applications. Applications based on SPU Threads have large bandwidth requirements and are processed from the PPE. Consequently SPU-threads based applications are not autonomous and tend to be slow. Because SPU Threads are managed from the PPE, SPU context switching (swapping out the current running process on an SPU to another waiting process) takes too long. Avoiding PPE involvement in SPU management can lead to much better performance for certain applications.
To overcome these problems a system referred to as SPU Runtime System (SPURS) was developed. In SPURS, the memory of each SPU has loaded into it a kernel that performs scheduling of tasks handled by the SPU. Groups of these tasks are referred to as Tasksets. SPURS is described in PCT Application, PCT/JP2006/310907, to Keisuke Inoue and Seiji Murata filed May 31, 2006 entitled “METHOD AND APPARATUS FOR SCHEDULING IN A MULTI-PROCESSOR SYSTEM”, and in US Patent Application Publication No. 20050188373, to Keisuke Inoue, Tatsuya Iwamoto and Masahiro Yasue, Filed Feb. 20, 2004 and entitled “METHOD AND APPARATUS FOR TASK MANAGEMENT IN A MULTI-PROCESSOR SYSTEM”, and in US Patent Application Publication No. 20050188372 to Keisuke Inoue and Tatsuya Iwamoto filed Feb. 20, 2004 and entitled “METHOD AND APPARATUS FOR PROCESSOR TASK MIGRATION IN A MULTI-PROCESSOR SYSTEM” and in U.S. Provisional Patent Application No. 60/650,153 to Keisuke Inoue and Masahiro Yasue, filed Feb. 4, 2005 and entitled “PROCESSOR TASK MIGRATION OVER A NETWORK IN A MULTI-PROCESSOR SYSTEM”, the disclosures of all four of which are incorporated herein by reference.
It is within this context that embodiments of the present invention arise.