A major advance in electronic computation has been the development of systems that can perform multiple operations simultaneously. Such systems are said to perform parallel processing. Recently, cell processors have been developed to implement parallel processing on electronic devices ranging from handheld game devices to main frame computers. A typical cell processor has a power processor unit (PPU) and up to 8 additional processors referred to as synergistic processing units (SPU). Each SPU is typically a single chip or part of a single chip containing a main processor and a co-processor. All of the SPUs and the PPU can access a main memory, e.g., through a memory flow controller (MFC). The SPUs can perform parallel processing of operations in conjunction with a program running on the main processor. A small local memory (typically about 256 kilobytes) is associated with each of the SPUs. This memory must be managed by software to transfer code and data to/from the local SPU memories.
The SPU have a number of advantages in parallel processing applications. For example, the SPU are independent processors that can execute code with minimal involvement from the PPU. Each SPU has a high direct memory access (DMA) bandwidth to RAM. An SPU can typically access the main memory faster than the PPU. In addition each SPU has relatively fast access to its associated local store. The SPU also have limitations that can make it difficult to optimize SPU processing. For example, the SPU have no coherent memory and no hardware cache. In addition, common programming models do not work well on SPU.
A typical SPU process involves retrieving code and/or data from the main memory, executing the code on the SPU to manipulate the data, and outputting the data to main memory or, in some cases, another SPU. To achieve high SPU performance it is desirable to optimize the above SPU process in relatively complex processing applications. For example, in applications such as computer graphics processing SPUs typically execute tasks thousands of times per frame.
One prior art task management system used for cell processors is based on a software concept referred to as “threads”. A “thread” generally refers to a part of a program that can execute independently of other parts. Operating systems that support multithreading enable programmers to design programs whose threaded parts can execute concurrently. When a thread is interrupted, a context switch may swap out the contents of an SPU's local storage to the main memory and substitute 256 kilobytes of data and/or code into the local storage from the main memory where the substitute data and code are processed by the SPU. A context switch is the computing process of storing and restoring the state of a SPU or PPU (the context) such that multiple processes can share a single resource.
A typical context switch involves stopping a program running on a processor and storing the values of the registers, program counter plus any other operating system specific data that may be necessary. For example, to prevent a single process from monopolizing use of a processor certain parallel processor programs perform a timer tick at intervals ranging from about 60 ticks per second to about 100 ticks per second. If the process running on the processor is not completed a context switch is performed to save the state of the processor and a new process (often the task scheduler or “kernel”) is swapped in. As used herein, the kernel refers to a central module of the operating system for the parallel processor. The kernel is typically the part of the operating system that loads first, and it remains in main memory. Typically, the kernel is responsible for memory management, process and task management.
Frequent context switches can be quite computationally intensive and time consuming, particularly for processors that have a lot of registers. As used herein, a register refers to a special, high-speed storage area within the processor. Typically, data must be represented in a register before it can be processed. For example, if two numbers are to be multiplied, both numbers must be in registers, and the result is also placed in a register. The register may alternatively contain the address of a memory location where data is to be stored rather than the actual data itself. Registers are particularly advantageous in that they can typically be accessed in a single cycle. Program compilers typically make use of as many software-configurable registers as are available when compiling a program.
The number of registers that a processor has and the size of each register (number of bits) affect the power and speed of the processor. For example a 32-bit processor is one in which each register is 32 bits wide. Therefore, each processor instruction can manipulate 32 bits of data. Although large register sizes allow faster processing, larger size registers take longer to store during a context switch, particularly if there are a large number of them. For example, in certain types of cell processors, the SPU may have 128 registers that are each 128 bits wide. If all these registers are used by one context, storing the contents of the registers can take a lot of time, even if the contents of the registers can be stored on the SPU local store. However, the SPU local store is relatively small and it may be necessary to store the contents of the registers in main memory, which takes even more time. Thus, it is desirable to avoid such context switches.
One prior art technique for avoiding context switches is to split the available registers for a processor amongst multiple threads. Since threads can operate independently the available registers may be divided up amongst the various threads of a program. For example, 128 registers for an SPU may be divided into two or more groups (e.g., two groups of 64, four groups of 32, etc.). The different groups of registers may be explicitly assigned to different program threads at compile time and these different program threads may run on the SPU simultaneously. The contents of registers assigned to a particular software thread need not be swapped out, e.g., during direct memory access (DMA). Unfortunately, each group of registers has to be explicitly assigned to a thread a compile time since the use of registers is not indexed. Consequently, this technique does not allow general threads to be reassigned to different registers during runtime.
In some prior art techniques certain special-purpose registers, such as stack pointers, are physically assigned in hardware to the kernel. However, even in these techniques the contents of general purpose registers (i.e., registers that are configurable in software) must be stored by a context switch when control of the processor is handed over to the kernel.
Thus, there is a need in the art, for a task scheduling method that avoids excessive use of context switches.