A major advance in electronic computation has been the development of systems that can perform multiple operations simultaneously. Such systems are said to perform parallel processing. One type of parallel processing system, known as a Cell processor, has been developed to implement parallel processing on electronic devices ranging from handheld game devices to main frame computers. A typical Cell processor has a main memory, power processor element (PPE) and up to 8 additional processor elements referred to as synergistic processing elements (SPE). Each SPE is typically a single chip or part of a single chip containing a processor known as a synergistic processor unit (SPU) and a local memory. All of the SPEs and the PPE can access the main memory, e.g., through a memory flow controller (MFC). The SPEs can perform parallel processing of operations in conjunction with a program miming on the PPE. The local memory associated with each SPU is relatively small, currently about 256 kilobytes in one common implementation. This memory must be managed by software to transfer code and data to/from the local SPE memories.
The SPE have a number of advantages in parallel processing applications. For example, the SPE are independent processors that can execute code with minimal involvement from the PPE. Each SPE has a high direct memory access (DMA) bandwidth to RAM. An SPE can typically access the main memory faster than the PPE. In addition each SPE has relatively fast access to its associated local store. The SPE also have limitations that can make it difficult to optimize SPE processing. For example, the SPE have no coherent memory and no hardware cache. In addition, many common programming models do not work well on SPE.
A typical SPE process involves retrieving code and/or data from the main memory, executing the code with the SPU to manipulate the data, and outputting results of the manipulation of the data to main memory or, in some cases, another SPU. To achieve high SPU performance it is desirable to optimize the above SPU process in relatively complex processing applications. For example, in applications such as computer graphics processing SPUs typically execute tasks thousands of times per frame.
One prior art task management system used for Cell processors and other types of processors is based on a software concept referred to as “threads”. A “thread” generally refers to a part of a program that can execute independently of other parts. Operating systems that support multithreading enable programmers to design programs whose threaded parts can execute concurrently. When a thread is interrupted, a context switch may swap out the contents of an SPE's local storage to the main memory and substitute 256 kilobytes of data and/or code into the local storage from the main memory where the substitute data and code are processed by the SPU. A context switch is the computing process of storing and restoring the state of a SPE or PPE (the context) such that multiple processes can share a single resource.
A typical context switch involves stopping a program miming on a processor and storing the values of the registers, program counter plus any other operating system specific data that may be necessary to the main memory. For example, to prevent a single process from monopolizing use of a processor certain parallel processor programs perform a timer tick at intervals ranging from about 60 ticks per second to about 100 ticks per second. If the process running on the processor is not completed a context switch is performed to save the state of the processor and a new process (often the task scheduler or “kernel”) is swapped in. As used herein, the kernel refers to a central module of the operating system for the parallel processor. The kernel is typically the part of the operating system that loads first, and it remains in main memory. Typically, the kernel is responsible for memory management, process and task management.
Frequent context switches can be quite computationally intensive and time consuming, particularly for processors that have a lot of registers. As used herein, a register refers to a special, high-speed storage area within a processor. Typically, data must be represented in a register before it can be processed. For example, if two numbers are to be multiplied, both numbers must be in registers, and the result is also placed in a register. The register may alternatively contain the address of a memory location where data is to be stored rather than the actual data itself. Registers are particularly advantageous in that they can typically be accessed in a single cycle. Program compilers typically make use of as many software-configurable registers as are available when compiling a program.
One prior art task management system used for cell processors is known as SPU Threads. A “thread” generally refers to a part of a program that can execute independently of other parts. Operating systems that support multithreading enable programmers to design programs whose threaded parts can execute concurrently. SPU Threads operates by regarding the SPUs in a cell as processors for threads. A context switch may swap out the contents of an SPU's local storage to the main memory and substitute 256 kilobytes of data and/or code into the local storage from the main memory where the substitute data and code are processed by the SPU. A context switch is the computing process of storing and restoring the state of a SPU or PPE (the context) such that multiple processes can share a single resource. Context switches are usually computationally intensive and much of the design of operating systems is to optimize the use of context switches.
Unfortunately, interoperating with SPU Threads is not an option for high-performance applications. Applications based on SPU Threads have large bandwidth requirements and are processed from the PPU. Consequently SPU-threads based applications are not autonomous and tend to be slow. Because SPU Threads are managed from the PPU, SPU context switching (swapping out the current miming process on an SPU to another waiting process) takes too long. Avoiding PPU involvement in SPU management can lead to much better performance for certain applications
To overcome these problems a system referred to as SPU Runtime System (SPURS) was developed. In SPURS, the memory of each SPU has loaded into it a kernel that performs scheduling of tasks handled by the SPU. Groups of these tasks are referred to as Tasksets. SPURS is described in PCT Application Publication number WO2007020739 to Keisuke Inoue and Seiji Murata entitled “SCHEDULING METHOD, AND SCHEDULING DEVICE”, and in US Patent Application Publication No. 20050188373, to Keisuke Inoue, Tatsuya Iwamoto and Masahiro Yasue entitled “METHOD AND APPARATUS FOR TASK MANAGEMENT IN A MULTI-PROCESSOR SYSTEM”, and in US Patent Application Publication No. 20050188372 to Keisuke Inoue and Tatsuya Iwamoto entitled “METHOD AND APPARATUS FOR PROCESSOR TASK MIGRATION IN A MULTI-PROCESSOR SYSTEM” and in US Patent Application Publication No. 20060190942 to Keisuke Inoue and Masahiro Yasue entitled “PROCESSOR TASK MIGRATION OVER A NETWORK IN A MULTI-PROCESSOR SYSTEM”, the disclosures of all four of which are incorporated herein by reference.
In traditional Multi-Threading an operating system (OS) scheduler will pre-empt any running thread to make a scheduling decision. The OS may manage any number of software threads that are configured by applications. Processors have a fixed number of hardware threads that is often fewer than the number of software threads in an application. Consequently, the hardware threads must be shared by the software threads, e.g., by time-slicing or cooperative yielding. The number of hardware threads determines how many software threads can run concurrently. One hardware thread can only process one software thread at a time. The number of hardware threads is dependent on the type of processor hardware involved. For example, the PPE of the Cell Processor has two hardware threads. Co-processors and cache may be shared among different software threads. Each thread has associated with it a context, which contains information relevant to the state of execution of the thread. Such information may include the values stored in registers, a program counter value, and the like.
Current implementations of SPU Threads (also referred to as SPE Threads) for the Cell processor apply a traditional thread model to SPUs. The overhead for context switching these SPU Threads can be very high compared to traditional threads because the entire SPE Local Store (currently 256 KB) and SPU registers (currently about 2 KB) must be saved and restored by the PPE. Certain SPU threads need to run on a regular basis, e.g., in certain Cell processor-based video game applications a high priority SPU thread runs every 16 ms. Currently, if a group of ganged SPU threads are running on all available SPUs the entire gang must be swapped out even though the high priority SPU thread may only require one SPU. The purpose of the ganged SPU Thread Group is so that the SPU Threads can safely communicate between each other's SPU Local Stores. If they were split up into individual SPU Threads, communication between SPU Local Stores is not supported by certain Operating Systems.
In addition to being detrimental to performance, context switches are often unnecessary. For example, an application often has its context resident in main memory. The space available in SPU Local Store is usually not big enough to store everything an SPU will need. Consequently some data and/or code may be stored in main memory where it is managed by the application. It is therefore often redundant for the operating system to do the context switch.
It is within this context that embodiments of the present invention arise.