The Cell Broadband Engine Architecture defines a new processor structure based upon the 64-bit Power Architecture technology, but with unique features directed toward distributed processing and media-rich applications. The Cell Broadband Engine architecture defines a single-chip multiprocessor consisting of one or more Power Processing Elements (PPE) and multiple high-performances SIMD Synergistic Processor Elements (SPE).
The IBM Software Development Toolkit (SDK) for Cell Broadband Engine (Cell BE) is a complete package of tools to allow developers to obtain first-hand experience on this revolutionary Cell BE Processor. The SDK is composed of development tool chains, software libraries and sample source code, a system simulator, and a Linux kernel that fully support the capability of the Cell BE.
FIG. 1 is a block diagram of the structure of the Cell Broadband Engine. As shown in FIG. 1, the Cell BE has 1 PPE (PowerPC processor Element) and 8 SPE (Synergistic Processor Element). The PPE is a 64-bit PowerPC Architecture core. It is fully compliant with the 64-bit PowerPC architecture and runs 32-bit and 64-bit operating systems and applications. The SPEs are independent processors, each running its own separate application programs. The PPEs and SPEs communicate with each other through an Element Interconnect Bus (EIB) and communicate with a main storage and I/Os.
FIG. 2 is a block diagram of the structure of the PPE. As shown in the figure, the PPE contains two main components, a Power processing unit (PPU) and a Power Processor Storage Subsystem (PPSS).
FIG. 3 is a block diagram of the structure of the SPE. As shown in the figure, the SPE contains two components, a synergistic processor unit (SPU) and a memory flow controller (MFC). The MFC contains a DMA controller which supports DMA transfer.
The PPU accesses the main storage with load and store instructions that go between a private register file and the main storage. However, the SPUs access the main storage with direct memory access (DMA) commands that go between the main storage and a private local store used to store both instructions and data. SPU instruction-fetches and load and store instructions access this private local store, rather than the shared main storage.
The PPE and SPE communicate through three main communication mechanisms supported by the MFC of each SPE, which are mailboxes, signal notification registers and DMAs. Mailboxes are queues for exchanging 32-bit messages. Two mailboxes are provided for sending messages from the SPE to the PPE, and one mailbox is provided for sending messages from the PPE to the SPE. Signal notification registers are used to send signal notifications to the SPE from the PPE. DMA transfers between the local store of the SPE and the main storage can be initiated by either the SPU of the SPE, or the PPE or another SPE.
Each SPU contains a RISC core, and a 256 KB, software-controlled local store for instructions and data. The SPUs support a special SIMD instruction set, and rely on asynchronous DMA transfers to move data and instructions between the main storage and their local stores.
A PPE program starts a SPE program running by creating a thread on the SPE using, for example, a spe_create_thread call, which calls a SPU runtime management library. The spe_create_thread call loads the program image into the SPE local store (LS), sets up the SPE environment, starts the SPE program, and then returns a pointer to the SPE's new thread ID. These procedures have much overhead. The following shows a exemplary pseudocode of the spe_create_thread:
speid_t spe_create_thread(spe_program_handle handle){create a directory called /spu/spe-xxx (xxx is a unique name) to representthe SPE which will execute the program identified by handle;  create a file /spu/spe-xxx/mem to represent the local store of the SPE;create a file /spu/spe-xxx/mbox to represent a mailbox channel to the SPE;create a file /spu/spe-xxx/sig1 to represent a signal notification channel1 to the SPE;create a file /spu/spe-xxx/sig2 to represent a signal notification channel2 to the SPE;  write the content of the program into the file /spu/spe-xxx/mem;set the instruction register of the SPE to point to the beginning ofthe program;  start the SPE which will execute the program.}
In this exemplary spe_create_thread, a directory needs to be created first, and a series of files are created in the directory. Then the content of the program is written into the file /spu/spe-xxx/mem, thus loading the program image into the local store of the SPE. The instruction register of the SPE is set to point to the beginning of the program image, and the SPE is started to execute the program image. Obviously the spe_create_thread is an expensive call, and the PPE will spend much time to set up the environment of the program.
On the other side, the SPE has only a 256 KB local store, so the program image size can not exceed 256 KB. The programmer must take care of the size limit. Therefore, the programmer should divide a large program into separate pieces; each piece is a standalone SPE image. After finishing the execution of one SPE image, the SPE will be released and wait to be called next time. Then the following procedure is repeated:    1. PPU calls spe_create_thread to start a SPE running;    2. SPE runs the program image;    3. After finishing the running, SPE is released.
It can be conceived that, if the program is very huge, PPE needs to call spe_create_thread frequently. The overhead will be very heavy.
Obviously, there exists a demand in the art for speeding up the program image loading and running.