1. Field of the Invention
The present invention relates to a system module and a data relay method in the system module.
2. Description of the Related Art
There has been proposed a multiprocessor system in which system modules including a plurality of processors are connected with each other by a bus. For example, Japanese Patent Application Laid-open No. 2001-167069 discloses such a system. FIG. 1 is a block diagram of an example of a multiprocessor system. In this example, two system modules 10a and 10b including a multiprocessor are connected with each other via a bus 31. The two system modules have the same configuration, and includes two central processing units (CPU) 11 including a cache memory, respectively, a main memory 12 including a dual inline memory module (DIMM) or the like, a memory access controller (MAC) 13 that controls an access signal to the main memory 12 and the like, and a system controller 14 that relays a packet when a data packet is transmitted over a plurality of system modules 10a and 10b. The system controller 14, the CPU 11, and the MAC 13 are connected with each other via a bus 21.
Described below is a conventional data packet transmission method between the system modules 10a and 10b in the multiprocessor system. Data from a processor such as the CPU 11 or the MAC 13 in a certain system module 10 to the MAC 13 or the CPU 11 in another system module 10 is packetized, and transmitted to the system controller 14. Upon receipt of the packet, the system controller 14 obtains a priority (right of use) for the bus 31 for transmitting the packet to the system module 10 at a destination, and upon receipt of the packet received from the processor, transmits the packet to the system module 10 at the destination. On the other hand, when having received the packet, the system controller 14 in the system module 10 at the destination transmits the packet to the processor at the destination based on destination information of the packet. When transmission of all data (packets) has finished, the bus 31 is released.
In such a data packet transmission method, for example, when a fault occurs in a CPU 11a or a bus 21a during packet transmission from the CPU 11a in the system module 10a to a system controller 14a, and the system controller 14a cannot receive the packet from the CPU 11a to the end normally, the system controller 14a cannot transmit the packet from the CPU 11a onto the bus 31 connecting between the system controllers 14a and 14b indefinitely. Therefore, transmission of the packet from the system controller 14a to the system controller 14b is interrupted. On the side of the system controller 14a, because packet transmission from the CPU 11a cannot be completed, the bus 31 with respect to the system controller 14b cannot be released. As a result, transmission of a data packet having no direct relation with the fault part, such as in a route from a CPU 11b in the system module 10a to the system controller 14a, to the system controller 14b in the system module 10b, and to a CPU 11c, cannot be performed. That is, blocking of the bus 31 connecting between the system controllers 14a and 14b occurs due to the fault in the CPU 11a or the bus 21a between the CPU 11a and the system controller 14a, thereby making the packet transmission impossible in the entire system.
To avoid such blocking of the bus 31, there is a method in which when the system controller 14a detects an error during transmission of the packet from the CPU 11a, transmission of the packet from the system controller 14a to the system controller 14b is discontinued. That is, when an error is detected during transmission of the packet from a certain CPU 11, transmission of the packet from the CPU 11 is discontinued, and a first packet of different data transmitted from another CPU 11 to the system controller 14 is transmitted after the packet, whose transmission has been discontinued as an abnormal packet.
FIG. 2 is a timing chart when the data packet is transmitted between the system controllers by using this method. The timing chart depicts a status where (1) a normal transmission process of first data 100A from the CPU 11a in the system module 10a to the system module 10b, (2) an abnormal transmission process of second data 100B from the CPU 11a in the system module 10a to the system module 10b, and (3) a normal transmission process of third data 100C from the CPU 11b in the system module 10a to the system module 10b are performed continuously. In the timing chart, latency in the system controller 14 is set to 5τ (cycles).
First, a case (1) that the first data 100A is transmitted from the CPU 11a in the system module 10a to the system module 10b is explained. In this case, it is assumed that the first data 100A transmitted from the CPU 11a to the system module 10b is divided into five packets and transmitted. At time [1], a first packet of the first data 100A is transmitted from the CPU 11a to the bus 21a, and the last fifth packet is transmitted to the bus 21a at time [5]. Because the latency in the system controller 14 is 5τ, the first packet is transmitted to the bus 31 connecting the system controllers 14a and 14b at time [6], and the last fifth packet is transmitted to the bus 31 at time [10].
A case (2) that the second data 100B is transmitted from the CPU 11a in the system module 10a to the system module 10b is explained next. Also in this case, it is assumed that the second data 100B transmitted from the CPU 11a to the system module 10b is divided into five packets and transmitted. At time [7], a first packet of the second data 100B is transmitted from the CPU 11a to the bus 21a, and the fourth packet is transmitted to the bus 21a at time [10]. Thereafter, it is assumed that the fifth packet has not reached the system controller 14a due to a fault in the CPU 11a or the bus 21a. At this time, the first packet is transmitted to the bus connecting between the system controllers 14a and 14b at time [12], and the fourth packet is transmitted to the bus 31 at time [15].
A case (3) that the third data 100C is transmitted from the CPU 11b in the system module 10a to the system module 10b is explained. Also in this case, it is assumed that the third data 100C transmitted from the CPU 11b to the system module 10b is divided into five packets and transmitted. At time [8], a first packet of the third data 100C is transmitted from the CPU 11b to a bus 21b connecting between the CPU 11b and the system controller 14a, and the fifth packet is transmitted to the bus 21b at time [12]. The system controller 14a receives the packet from the CPU 11b; however, at the time of reception, the bus connecting between the system controllers 14a and 14b is used for data transmission from the CPU 11a to the system module 10b in (2), and therefore the received packet is temporarily stored in a buffer.
After the system controller 14a has received the fourth packet of the second data 100B, the fifth packet does not arrive, for example, at an expected timing. Therefore, the system controller 14a determines that a fault has occurred in the CPU 11a or the bus 21a connecting the CPU 11a and the system controller 14a, and discontinues the data transmission from the CPU 11a. At time [16], the system controller 14a starts to transmit the first packet of the third data 100C to the bus 31, and at time [20], transmits the fifth packet to the bus 31. Data transmission between the system modules 10a and 10b is performed in this manner.
In such a method of discontinuing the data transmission, however, on the side of the system controller 14b, an abnormal packet from the CPU 11a and a normal packet from the CPU 11b are received continuously. More specifically, all the packets from the CPU 11a are not delivered, and a packet from the CPU 11b is delivered, and therefore the packet is received in an abnormal protocol. That is, on the side of the system controller 14b, there is a fault between the system controllers 14a and 14b. As a result, even in this case, the fault in the CPU 11a or the transmission line (the bus 21a) between the CPU 11a and the system controller 14a affects the entire system.
Further, the system including a plurality of CPUs 11 as shown in FIG. 1 can be used in a state where the system is logically divided for each CPU 11 by partitioning or the like. In such a case, it is not desirable from a viewpoint of system reliability that the bus commonly used by the CPUs is blocked due to a fault in the CPU 11a or errors occur in a chain reaction.
A reliable method for solving the above problems in the conventional technology is that a chip for relaying, such as the system controller 14, discards data having an abnormal packet, after having received the data packets from each processor in the system module 10 to the end, and transmits the data having only the normal packets to the chip (the system controller 14) on the next path.
FIG. 3 is an example of a timing chart when the transmission process of the packet to the system module is performed after all the packets from the processor have been received. It is also assumed in the timing chart that processes of (1) to (3) are performed as in FIG. 2. In the timing chart, the latency in the system controller 14 is set to 5τ (cycles).
First, the case (1) that the first data 100A is transmitted from the CPU 11a in the system module 10a to the system module 10b is explained. In this case, it is assumed that the first data 100A transmitted from the CPU 11a to the system module 10b is divided into five packets and transmitted. At time [1], a first packet of the first data 100A is transmitted from the CPU 11a to the bus 21a, and the last fifth packet is transmitted to the bus 21a at time [5]. In this transmission method, because the system controller 14a transmits only data including normal packets after having received all the packets constituting the data, and the latency in the system controller 14 is 5τ, after having received the fifth packet normally, transmission of the first packet is started and the first packet is transmitted to the bus 31 connecting the system controllers 14a and 14b at time [10], and the last fifth packet is transmitted to the bus 31 at time [14].
The case (2) that the second data 100B is transmitted from the CPU 11a in the system module 10a to the system module 10b is explained next. Also in this case, it is assumed that the second data 100B transmitted from the CPU 11a to the system module 10b is divided into five packets and transmitted. At time [7], a first packet of the second data 100B is transmitted from the CPU 11a to the bus 21a, and the fourth packet is transmitted to the bus 21a at time [10]. Thereafter, it is assumed that the fifth packet has not reached the system controller 14a due to a fault in the CPU 11a or the bus 21a connecting between the CPU 11a and the system controller 14a. At this time, after the system controller 14a has received the fourth packet of the second data 100B, the fifth packet does not arrive within an expected timing. Therefore, the system controller 14a determines that there is an error, and discards the first to fourth packets received as the abnormal second data 100B. Accordingly, the second packet is not transmitted to the bus 31 connecting between the system controllers 14a and 14b. 
The case (3) that the third data 100C is transmitted from the CPU 11b in the system module 10a to the system module 10b is explained. Also in this case, it is assumed that the third data 100C transmitted from the CPU 11b to the system module 10b is divided into five packets and transmitted. At time [8], a first packet of the third data 100C is transmitted from the CPU 11b to the bus 21b connecting between the CPU 11b and the system controller 14a, and the fifth packet is transmitted to the bus 21b at time [12]. The system controller 14a receives the first to fifth packets from the CPU 11b normally. After having received the last fifth packet, the transmission process of the third data 100C is started, and at time [17], the third data 100C is transmitted to the bus 31 connecting the system controllers 14a and 14b, and the last fifth packet is transmitted to the bus 31 at time [21].
This method is desired from viewpoint of reliability; however, the latency in the packet transfer increases because after all the packets have been received, the packets need to be transmitted to another chip (the system controller 14). That is, when there is no error in the CPU 11 and the bus 21 connecting between the CPU 11 and the system controller 14, as the size of data to be transmitted increases, the latency increases because the system controller 14 needs to wait for arrival of all the data.