1) Field of the Invention
The present invention relates to a technology for creating load modules for a program that is executed by a multiprocessor computer system.
2) Description of the Related Art
Most present day computer systems are provided with a plurality of multiprocessors to which parts of a program are distributed in order to enhance a processing efficiency. The multiprocessors can be broadly categorized into shared-memory multiprocessors and distributed-memory multiprocessors.
FIG. 1 is a schematic diagram of a computer system that employs a shared-memory multiprocessor system. Each of the n number of processor elements (hereinafter, “PE”) 100 has a processor 101 and a cache 102.
The cache 102 is much smaller than the main memory but has a cache memory that can perform high speed reading and writing. The cache 102 carries out reading from or writing to cache memory or the main memory in response to a read write request from the processor 101. When carrying out reading and writing, the cache 102 keeps a copy of the contents (value) of the memory area of the main memory that was read from or written to the cache memory in order to use the locality of reference at the time of program execution. Therefore, reading and writing can be carried out speedily by accessing the cache memory and by avoiding having to access the main memory.
FIG. 2 is a schematic diagram of a computer system that employs a distributed-memory multiprocessor system. The n number of processor elements (PE #1 to PE #n) 200, each of which includes a processor 201 and a memory 202, are connected via an interconnection network 203.
FIG. 3 is a schematic diagram of memory space definition in the computer system shown in FIG. 2. Each processor 201 reads from and writes to the memory 202 of its own processor element 200.
In the systems that utilize distributed-memory multiprocessors, programs based on single-program multiple-data (SPMD) programming are mainly executed by using a transmitting mechanism, such as a message-passing interface (MPI).
FIG. 4 shows a sample program. The program is distributed in n number of memories 202 and each part of the program is executed by the respective processor 201. Even though a single program is being executed, the process branches according to the an identification number (ID) of the process element 200 and parallel processing by the n number of processor element 200 takes place.
For instance, in the sample program of FIG. 4, ‘my_rank’ is the ID. In the processor element other than that in which my_rank=0, the process under ‘if’ is executed. In the processor element in which my_rank=0, the process under ‘else’ is executed.
FIG. 5 is a flowchart that explains process steps of a load-module creation for the sample program shown in FIG. 4. First, a source code of the program is converted into an assembly code using a compiler (steps S501 to S503). An object is created from the assembly code using an assembler (steps S504 to S506). Plural objects are linked using a linker to create a load module for the program (steps S507 to S510).    (1) The shared-memory multiprocessor system needs to solve the problem of preservation of cache consistency as described in detail below:
Even though the processing speed of the system is enhanced by providing a cache 102 for each processor in a multiprocessor system, there is a disadvantage to it. When plural cache memories are involved, there is a possibility that the memory area value determined by the same address may not match between the cache memories and the main memory. As a result of this, when any of the processors accesses any memory area of the main memory, always the latest value secured in that memory area is returned, thereby causing what is known as a cache coherence problem.
Conventionally, the coherence problem was countered by providing a physical mechanism called ‘cache consistency mechanism’. This mechanism is based on the cache consistency protocol that monitors the location of data (hereinafter “shared data”) read and written by different processes of a program, prevents caching of old data prior to updation, and preserves cache consistency.
FIG. 6 is an explanatory drawing that shows a memory map in the case in which cache consistency is preserved using the cache consistency mechanism. A text area 600 holds instruction strings of a program, and a data area 601 holds data (both private and shared data) that is read or written by the program.
Both the areas, that is, the text area 600 and the data area 601, are cache target areas. In other words, data in the text area 600 and the data area 601 can be copied in the cache memory. Consequently, the shared data is copied in the cache memory of each of the plural processors that execute part of the program and the value of all the cache memory is made consistent with that of the main memory by this cache consistency mechanism.
However, this method of using a hardware as a cache consistency mechanism for maintaining consistency between the values of cache memory and the main memory can prove to be a complex proposition and is bound to make the processor circuitry bulky.
This did not pose much of a problem in the past as shared-memory multiprocessors were mainly used in high-end products. However, if shared-memory multiprocessors are to be made popular by providing them in printers, digital cameras, digital televisions, and the like, it is imperative that the processors are not made bulky or heavy for the only purpose of maintaining cache consistency. Also, the product cost should not go up because of the number of processors used.    (2) The distributed-memory multiprocessor system needs to solve the problem of solution for address straddling memory space as described in detail below:
The system employing distributed-memory multiprocessor shown in FIG. 2 is built using plural chips (and plural boards) due to limitations in the semiconductor integrated circuit technology that existed in the past. However, due to advancements in the semiconductor technology in recent years, it has become possible to pack plural processor elements 200 in one chip.
Conventionally, when it was not possible to pack plural processor elements in one chip, data transfer was done by packet transmission system. However, when plural processor elements are packed in one chip, the data exchange between the processor element 200 via the interconnection network 203 can be speedily performed by employing the shared-memory for storing and loading of data. The system in which a shared memory that allows reading from and writing to by plural processors is provided is called a distributed shared-memory multiprocessor system.
FIG. 7 is a schematic diagram of a computer system that employs a distributed shared-memory multiprocessor system. Unlike the distributed-memory multiprocessor system shown in FIG. 2, the distributed shared-memory multiprocessor system has two types of memory 702, namely, a shared memory (SM) that can be accessed by processors of other processor elements as well, and a local memory (LM) that can be accessed by only that processor which is contained in the same processor element.
FIG. 8 is an explanatory drawing that shows an example of memory space definition in the distributed shared-memory microprocessor system shown in FIG. 7. The shared memory of the first processor element (PE #1) is allocated in an overlapping manner in the memory space of the processor element PE #0 and the processor element PE #1.
Let us assume that the shared memory of the processor element PE #1 is allocated at the address 0x3000 in the memory space of the processor element PE #0 and at the address 0x2000 in the memory space of processor element PE #1. With this assumption, when the processor element PE #0 writes data to the address 0x2000, the processor element PE #1 can read the same data from the address 0x3000, thus effecting data transfer between the processor element PE #0 and the processor element PE #1.
The memory of processor elements PE #1 through PE #n is allocated in the memory space of processor element PE #0. Therefore, the processor element PE #0 is capable of referring or altering the data in the shared memory of the other processor elements. However, as the memory of other processor elements are not physically allocated in the memory space of the processor elements PE #1 through PE #n, these processor elements can refer or alter data in only their own local memory and shared memory.
Like the computer system using the distributed-memory multiprocessor system, the computer system employing the distributed shared-memory multiprocessor system can also execute the program, shown in FIG. 4, based on single-program multiple-data programming.
However, whether it is a distributed-memory multiprocessor system or a distributed shared-memory multiprocessor system, the entire program is distributed on each of the processor elements, even though only a part of the program is executed by each of the processors. Since the entire program needs to be stored in each processor element, memory requirement of the processor element increases, which results in increase in cost.
The problem of storing the entire program in all the processor elements can be circumvented, at least in the distributed-memory multiprocessor system, by creating programs based on multiple-program multiple-data programming (MPMD) instead of single-program multiple-data.
Unlike the single-program multiple-data programming in which a program resides in all the processor elements, in multiple-program multiple-data based programming, separate programs to be executed by specific processor elements are created. FIG. 9 is a sample program executed by the processor element PE #0 and FIG. 10 is a sample program executed by the processor elements PE #1 through PE #n. As the program to be executed by a particular processor element is exclusive to that processor element, requirement of memory can be reduced to that extent. The load modules of these programs are created according to the sequence of steps shown in the flowchart in FIG. 5.
On the other hand, in the distributed shared-memory multiprocessor system, data stored in an area is accessed by plural processor elements. The address in the memory space of the area being accessed is different for each processor element. Consequently, when resolving addresses using the linker, the address has to be changed for each processor unit even though the same area is accessed. However, in the conventional linker this function is not available.
As a result, all the programs that can be run in a computer system with the distributed shared-memory multiprocessor system can only be created by single-program multiple-data programming. Consequently, in the distributed shared-memory multiprocessor system, even though there may be portions of the program that will not be executed by a particular processor element, the entire program needs to be distributed in all the processor elements necessitating more memory.