1. Field of the Invention
The present invention relates to a processor, a coprocessor, and an expansion board having such processors mounted thereon which are used in an information processing apparatus such as a personal computer.
2. Description of the Related Art
Current processor architectures are directed to a reduced instruction sets computer (simply called a RISC system). The RISC system is characterized in that it does not contain complicated instructions and keeps its instructions constant in length. Since each instruction is simple, the RISC system offers a fast processing speed. Since each instruction has a constant length, the RISC system is enabled to read an instruction at one access to memory. These features enhance the processing speed.
In order to improve the processing performance of the RISC processor, the RISC processor employs a super scalar system as a current trend. The super scalar system is arranged so that operation units inside of the processor are dynamically scheduled on hardware when executing the operations. This system has a pro that it can use the conventional software resources without any change, while it has a con that it cannot use so much information for scheduling since the scheduling is done while it is being executed and thereby enhance its parallel level so much.
To achieve a higher performance than the super scalar system, there has been proposed a very long instruction word (simply called a VLIW system). This system is arranged to take the steps of statically scheduling the operation units contained in the processor on software when compiling the program, gathering a group of instructions to be executed in parallel as one instruction set, reading the instruction set at a time when the processor executes the program, and then executing the instruction set. The VLIW system does not need the hardware for the scheduling operation. Hence, it has a pro that the parallel level is increased. On the contrary, it has a con that an instruction length is made larger since plural instructions are gathered as one set.
The super scalar system and the VLIW system are introduced in "VLIW : The wave of the Future?" MICROPROCESSOR REPORT, pages 18 to 21, Feb. 14, 1994.
In general, a program contains a portion of high parallel level and a portion of low parallel level. A simple example of each portion will be described with reference to FIGS. 1 and 2. In these Figures, each row represents one process. For example, a first row process 800a shown in FIG. 1 indicates the flow of adding 1 to a content of a variable X0 and substituting the added result for a variable Y0. These processes are executed one by one.
FIG. 1 shows a portion of a high parallel level. In this portion, these processes are independent of one another, so that the processes are executed in parallel. For example, a value of a variable X1 used in a second row process 800b is determined before the execution of the first row process 800a. Hence, the first row process 800a and the second row process 800b are executed in parallel.
On the other hand, FIG. 2 shows a portion of a low parallel level. In this portion, a value for each process is calculated by the one previous process. Hence, until the one previous process is terminated, the current process cannot be started. For example, a value of a variable X1 used in the second row process 810b cannot be defined until the execution of the first row process 810a. Normally, the second row process 810b is not allowed to start before the first row process 810a is terminated.
FIG. 3 shows a VLIW system instruction (simply called a VLIW instruction) transformed from the program shown in FIG. 1. FIG. 4 shows a VLIW instruction transformed from the program shown in FIG. 2. These VLIW instructions enable the execution of the four processes at one time. In these Figures, one row corresponds to one VLIW instruction. These instructions are executed on time from top to bottom. The VLIW instructions shown in FIG. 3 are transformed from a portion of high parallel level (see FIG. 1). In these instructions, one instruction can execute four processes at a time. Hence, these VLIW instructions realize four times the performance of the conventional processor having only a sole processing unit.
On the other hand, the VLIW instructions shown in FIG. 4 are transformed from a process having a low parallel level (see FIG. 2). In these instructions, one instruction enables the execution of only one process at a time. Hence, even the processor employing the VLIW system (simply called a VLIW processor) for executing four processes at a time can realize only the same performance as the processor having only a sole processing unit. Since the instruction length has to be constant even in the portion having no process to be executed, it is necessary to insert a non-operational instruction (simply called an NOP) indicating no operation is provided. Hence, the instruction has a larger size than the content of the actual process.
When the VLIW processor executes a general program, an occupying rate of the NOP is made very high. It means that the NOPs occupy a main storage of an information processing apparatus having a VLIW processor (simply called a VLIW system) or much of an instruction cache memory (simply called a cache memory) located inside of the VLIW processor. The space of the main storage is wasted by the NOPs or the volume of the cache memory is made very large. This results in problems such as the performance of the VLIW processor is not made higher than it is expected, the VLIW system is made too costly, and the VLIW processor chip is overgrown and too much costly.
The problems about the VLIW processor are described in the writing "Basic Arrangement of Reconstructed VLSI Computer based on Execution Delay", Reports of Information Processing Society, Computer Architecture, Nos. 89 to 13, pages 87 to 93, Jul. 19, 1991, Information Processing Society.
The main storage of the VLIW system may be effectively used by avoiding the load of the NOP on the main storage. The technique of saving the memory volume by deleting the NOP from the main storage is briefly described in "A VLIW Architecture for a Trace Scheduling Compiler" IEEE, TRANSACTION ON COMPUTERS, VOL 37, No. 8, pages 967 to 979, August 1988.
In order to reduce the volume of the cache memory contained in the VLIW processor, there has been proposed a technique that takes the steps of compressing the instruction when it is stored in the cache memory, reading the compressed instruction out of the cache memory, and decompressing the instruction. This technique is briefly described in "Phillips Hopes to Displace DSPs with VLIW", MICROPROCESSOR REPORT, pages 12 to 15, Dec. 5, 1994, Micro Design Resources.
The system of deleting the NOP from the main storage makes it impossible to lower the cost and enhance the performance of the processor, because the volume of the cache memory contained in the VLIW processor is unchanged.
Further, the system of compressing the instruction when it is stored in the cache memory located inside the processor and decompressing the instruction when it is read out of the cache memory brings about a problem that a large loss takes place in a branch point. This occurs because the decompressing stage is inserted into a pipeline for executing the instruction and thus is located more deeply in the pipeline processing.
That is, one or two cycles are wasted by the decompression of the instruction together with a wire delay appearing inside of the processor chip. Hence, the pipeline for executing the instruction is extended by one or two stages. If the instructions are executed in proper order, the extension is often negligible. If the execution order of the instructions is changed by a branching instruction, a period when no instruction is executed takes place. In general, as the execution pipeline is deeper, such a period is longer.
This has led to a bigger problem in a processor such as a VLIW processor for executing instructions in parallel. Assuming that a period when no instruction is executed extends over two cycles, the conventional processor for executing only a sole process is disabled to execute only two processes in the worst case. However, the VLIW processor for executing four processes at a time is disabled to execute as many as eight processes in the worst case. As the processes to be executed in parallel are made more numerous, this loss is made worse.
In place of the reduced cache memory, the additional hardware for decomposing the instruction is required.