This application is based on an application Ser. No. 10-337186 filed in Japan, the content of which is hereby incorporated by reference.
(1) Field of the Invention
The present invention relates to a processor, compiling apparatus, and compile program recorded on a recording medium, and especially relates to technologies of reducing the number of execute cycles in parallel processing by the processor.
(2) Description of the Related Art
As apparatus with built-in microprocessors have improved functions and speeds, a microprocessor (referred to a xe2x80x9cprocessorxe2x80x9d in this specification) with more improved processing performance has been required.
For improved throughput of a plurality of instructions on a processor, the pipeline control is adopted. The pipeline control will be described below. An instruction is divided into a plurality of unit instructions that are to be continuously executed. The process of executing one instruction is also divided into a plurality of continuous smaller processes (referred to xe2x80x9cstagesxe2x80x9d in this specification). The processor has executing units (hardware) which each corresponding to different stages. Each of the unit instructions is continuously executed by a different executing unit at a different stage to execute the instruction. When two instructions are continuously executed, each of the unit instructions of the second instruction is executed by a different executing unit at a different stage one stage behind the first instruction. By doing so, a plurality of instructions are executed in parallel.
For more improved performance, parallel processing is adopted at individual instruction level. The parallel processing at instruction level is to simultaneously execute a plurality of instructions in one machine cycle. The parallel processing at instruction level is executed by dynamic scheduling and static scheduling.
One representative example of the parallel processing at instruction level by dynamic scheduling is the superscalar system. According to the superscalar system, the operations described below are executed when a plurality of instructions are executed on a processor. The instruction codes are decoded. Then, an instruction issuing control unit (hardware) of the processor analyzes the dependency relations of the plurality of instructions using the decoded instruction codes and judges whether the instructions can be executed in parallel. The processor executes instructions in parallel that can be executed in parallel.
On the other hand, one representative example of the static scheduling is the VLIW (Very Long Instruction Word) system. According to the VLIW system, the operations described below are executed. At the time of the generation of the execution code, the dependency relations among the plurality of instructions are analyzed using the compiler and the like. According to the analysis, instruction codes are moved to generate an instruction stream that is more efficiently executed. Generally, a plurality of instructions that can be simultaneously executed are described in an instruction supply unit of fixed length (referred to a xe2x80x9cpacketxe2x80x9d in this specification) in the VLIW system.
In each of the scheduling systems, hazard due to the dependency relations of data is avoided at the instruction parallel processing. More specifically, it is controlled so that an instruction to store a value in a register and an instruction to refer to the stored value are not issued in the same cycle according to the information on the names of registers to which is referred to for the data and in which the data is stored. According to the dynamic scheduling, the instruction issuing control unit controls so that the two instructions are not executed in parallel but executed in serial. On the other hand, according to the static scheduling, the compiler schedules so that a group of instructions that are issued in the same cycle does not include instructions that have data dependency relations at the time of compiling.
Recently, an increasing number of processors have adopted media processing instructions that deal with data whose size is larger than that of data dealt with by basic instructions as well as basic instructions for signal processing performance improvement. In the media processing instruction, a plurality of pieces of data are stored in a register whose length is larger than the length of registers used for basic instructions. The plurality of pieces of data are processed in parallel for the improvement of the signal processing performance. Some processors adopting the media processing instruction are not equipped with registers specifically for the media processing instruction. Instead, in those processors, the registers are shared for the basic instruction and the media processing instruction and data is written in part of the registers for the basic instruction.
When the dependency relations among a plurality of instructions are analyzed in those processors by referring to the register names shown in the instruction codes according to the instruction issuing control method that has been described, an instruction to update the upper half of one register and an instruction to update the lower half of the register are executed in serial since the same register name in the instruction codes is considered the data dependency relation between the instructions This is problematic. Here, the data dependency relation refers to the dependency relation between an instruction to store data in a resource and another instruction to refer to the stored data.
It is accordingly the object of the present invention to provide a processor, a compiling apparatus, and a compile program recorded on a recording medium that reduce the number of execute cycles when parallel processing is performed in a processor that execute a plurality of instructions in one cycle.
The above-mentioned object may be achieved by a processor that processes a plurality of instructions in one cycle, the processor may include: A) a register; B) an instruction fetching unit for fetching the plurality of instructions that include at least a first instruction and a second instruction from an external program, the first instruction including a first access indication for accessing a first area, which is at least part of an area in the register, the second instruction including a second access indication for accessing a second area, which is at least part of the area in the register, wherein when the first area is a whole of the register, the second area is the part of the register, when the second area is the whole of the register, the first area is the part of the register, and at least one of the first and second access indications is for storing data in at least the part of the register; C) a decoding unit for decoding each of the fetched instructions and outputting at least decoded information on the register and on areas in the register in one cycle, the decoded information including at least information on the register and on the first and second areas; and D) an access unit for accessing the first and second areas according to the decoded information in one cycle.
In the processor, an instruction to access the first-part in one register and another instruction to access the second part in the same register in a program can be executed in one cycle. As a result, the number of execute cycles is reduced compared with a conventional processor.
The above-mentioned object may be also achieved by the processor, wherein the first area, which is an object of the first access indication, and the second area, which is an object of the second access indication, are parts of the register and have no overlap, the first instruction includes an indication for storing data in the first area and the second instruction includes an indication for referring to data in the second area, and the access unit stores data in the first area and refers to data in the second area in one cycle.
In the processor, an instruction to store data in the first part of one register and another instruction to refer to data in the second part in the same register can be executed in one cycle. As a result, the number of execute cycles is reduced compared with a conventional processor.
The above-mentioned object may be also achieved by the processor, wherein the first area, which is an object of the first access indication, and the second area, which is an object of the second access indication, are parts of the area in the register and have no overlap, the first instruction includes an indication for storing data in the first area and the second instruction includes an indication for storing data in the second area, and the access unit stores data in the first and second areas in one cycle.
In the processor, an instruction to store data in the first part of one register and another instruction to store data in the second part in the same register can be executed in one cycle. As a result, the number of execute cycles is reduced compared with a conventional processor.
The above-mentioned object may be also achieved by the processor, wherein the first area, which is an object of the first access indication, and the second area, which is an object of the second access indication, have an overlap, which is a third area, the first instruction includes an indication for storing data in the first area and the second instruction includes an indication for storing data in the second area, and the access unit stores data in the first area excluding the third area, the second area excluding the third area, and the third area in one cycle.
In the processor, an instruction to store data in part of one register and another instruction to store data in part or the whole of the same register can be executed in one cycles As a result, the number of execute cycles is reduced compared with a conventional processor, in which data is written in one register only by one instruction in one cycle.
The above-mentioned object may be also achieved by the processor, wherein the decoding means may include: A) an instruction decoding unit for decoding a plurality of instructions of the fetched instructions and outputting at least decoded information on the register and on areas in the register in one cycle, the decoded information according to indications for decoding instructions, the instruction decoding unit for stopping decoding an instruction in the fetched instructions according to an indication for stopping decoding the instruction in one cycle, wherein the plurality of fetched instructions include at least the first and second instructions, and wherein the decoded information includes at least the information on the register and on the first and second areas; and B) an instruction issuance control unit for controlling the instruction decoding unit by outputting an indication for decoding an instruction for each of the fetched instructions in one cycle so that the instruction decoding unit decodes the fetched instructions, the instruction issuance control unit for controlling the instruction decoding unit by receiving the decoded information that includes at least the information on the register and on the first and second areas after the instruction decoding unit decodes the fetched instructions, by judging whether the first and second areas are the same area, and by outputting an indication for stopping decoding the second instruction to the instruction decoding unit when it is judged that the first and second areas are the same area so that the instruction decoding unit stops decoding the second instruction.
In the processor, when the same part of one register is accessed by two instructions, it is considered that there is no data dependency relations between the two instructions and the decoding of one of the instructions is stopped As a result, when different parts of one register are accessed by two instructions, the two instructions can be executed in one cycle. Accordingly, the possibility that two instructions are executed in parallel is enhanced, and the number of execute cycles is reduced compared with a conventional processor.
The above-mentioned object may be also achieved by a compiling apparatus that generates object codes from a source program described in a high-level language, the compiling apparatus may include: A) a storage unit for storing the source program; B) an execution code generating unit for reading the source program from the storage unit and performing translation processing on the read source program to generate an executive program, the executive program including at least one executive instruction, the executive instructions including information on a register; C) an instruction scheduling unit for rearranging the executive instructions according to information included in the executive instructions on areas that are parts of an area in the register so that a plurality of executive instructions that are to be executed in parallel are adjacent to each other; and D) an object code generating unit for generating the object codes according to the rearranged executive instructions.
In the compiling apparatus, a plurality of executive instructions are rearranged in units of parts of registers that are to be accessed by the executive instructions. As a result, when object codes that have been output from the compiling apparatus are executed in the object processor, the possibility that a plurality of executive instructions are executed in parallel is enhanced, and the number of execute cycles is reduced. Accordingly, the compiling apparatus can generate execute codes that are executed in less number of execute cycles.
The above-mentioned object may be also achieved by the compiling apparatus, wherein the instruction scheduling unit includes: A) a dependency relation analysis unit for generating dependency relation information that indicates dependency relations between the executive instructions according to order in which the executive instructions are arranged and the information on the areas that are parts of the area in the register; B) an instruction rearrangement unit for determining groups containing at least one instructions that are to be executed in parallel according to the dependency relation information and rearranging the executive instructions; and C) an execution boundary adding unit for adding parallel execution information to each of the determined groups that indicates whether instructions are to be executed in parallel.
In the compiling apparatus, the data dependency relations between a plurality of executive instructions are analyzed, parallel execution information is added to for each group of instructions that are to be executed in parallel, and the executive instructions are rearranged in units of parts of registers that are to be accessed by the executive instructions. As a result, when object codes that have been output from the compiling apparatus are executed in the object processor, the processor easily detects groups of instructions that are executed in parallel using the parallel execution information, the possibility that a plurality of executive instructions are executed in parallel is enhanced, and the number of execute cycles is reduced. Accordingly, the compiling apparatus can generate execute codes that are executed in less number of execute cycles.