1. Field of the Invention
This invention relates to a technique for speeding up the execution of a program in a simulation system through parallelization.
2. Description of the Related Art
Recently, multiprocessor systems have been used in the fields of scientific computation, simulation and the like. In such a system, an application program generates multiple processes and allocates the processes to individual processors. As an example, these processors go through a procedure while communicating with each other using a shared memory space.
In the field of simulation, the development of which has been particularly facilitated only recently, there is simulation software for plants of mechatronics such as robots, automobiles and airplanes. With the benefit of the development of electronic components and software technology, most parts of a robot, an automobile, an airplane or the like are electronically controlled by using wire connections laid like a network of nerves, a wireless LAN and the like.
Although these mechatronics products are mechanical devices in nature, they also incorporate large amounts of control software. Therefore, the development of such a product requires much time, enormous costs and a large pool of manpower to develop a control program and to test the program.
As a conventional technique for such a test, there is HILS (Hardware In the Loop Simulation). Particularly, an environment for testing all the electronic control units (ECUs) in an automobile is called full-vehicle HILS. In the full-vehicle HILS, a test is conducted in a laboratory according to a predetermined scenario by connecting a real ECU to a dedicated hardware device emulating an engine, a transmission mechanism, or the like. The output from the ECU is input to a monitoring computer, and further displayed on a display unit to allow a person in charge of the test to check if there is any abnormal action while viewing the display.
However, in HILS, the dedicated hardware device is used, and the device and the real ECU have to be physically wired. Thus, HILS involves a lot of preparation. Further, when a test is conducted by replacing the ECU with another, the device and the ECU have to be physically reconnected, requiring even more work. Further, since the test uses the real ECU, it takes time to conduct the test, resulting in an immense amount of time to test many scenarios. In addition, the hardware device for emulation of HILS is generally very expensive.
A recently introduced technique using software without using such an expensive emulation hardware device is called SILS (Software In the Loop Simulation). Using this technique, components to be mounted in the ECU, such as a microcomputer and an I/O circuit, a control scenario, and all plants such as an engine and a transmission, are configured by using a software simulator. This enables the test to be conducted without the hardware of the ECU.
As a system for supporting such a configuration of SILS, for example, there is a simulation modeling system, MATLAB®/Simulink® available from Mathworks Inc. In the case of using MATLAB®/Simulink®, functional blocks indicated by rectangles are arranged on a screen through a graphical interface as shown in FIG. 1, and a flow of processing as indicated by arrows is specified, thereby enabling the creation of a simulation program. The diagram of these blocks represents processing for one time step of simulation, and this is repeated predetermined times so that the time-series behavior of the system to be simulated can be obtained.
Thus, when the block diagram of the functional blocks or the like is created on MATLAB®/Simulink®, it can be converted to source code of an equivalent function in an existing computer language, such as C language, using the function of Real-Time Workshop®. This C source code is compiled such that simulation can be performed as SILS on another computer system.
FIG. 1 is a diagram schematically showing a loop of typical functional blocks of MATLAB®/Simulink®. The functional blocks are roughly divided into blocks having an internal state and blocks without any internal state. In FIG. 1, hatched blocks A and B are blocks having an internal state and blocks a, b and c without hatching are blocks without any internal state. In the blocks without any internal state, output data is calculated directly from input data as shown in FIG. 2(a).
On the other hand, in the blocks having an internal state, a value obtained by performing a predetermined calculation on the previous input data is held as internal data 202 as shown in FIG. 2(b), and output data is calculated using the internal data 202. Thus, the current input data is not used to calculate the current output data, and held as the internal data 202 for use in calculating the next output data after the completion of calculation of the current output data.
The following describes processing for the structure of the block diagram shown in FIG. 1. Here, it is assumed that the output of block A is f1, the output of block a is f2, the output of block B is f3, the output of block b is f4 and the output of block c is f5. f1 is input to block a, f2 is input to block B, f3 is input to block b, f4 is input to block c, and f5 is input to block A. Since block A and block B are blocks having an internal state, direct inputs f5 and f2 are not used to calculate f1 and f3, respectively. This is written in the following pseudo-code:
 while (ts<EOS) {// outputf1=Aout(sA)f2=a(f1)f3=Bout(sB)f4=b(f3)f5=c(f4)// update statesA=Ain(f5)sB=Bin(f2)// update timets++}
The above pseudo-code shows that the loop is repeated until time is reaches EOS (end of simulation). In this code, Aout( ) is a function for causing the block A to calculate output based on the internal state, Ain( ) is a function for causing the block A to calculate an internal-state variable based on the input, a( ) is a function for causing the block a to calculate output based on the input, and so on.
As seen from this pseudo-code, the block A uses its internal state to calculate output whereas the block a uses the output of the block A. Here, the outputs of the blocks B, b and c are not used.
On the other hand, the blocks B, b and c do not use both of the outputs of the blocks A and a. This suggests that A, a and B, b, c are executed in parallel, respectively. As shown in FIG. 3, after allocating processes of A and a and processes of B, b and c preferably to different processors or cores, respectively, and executing the processes in parallel, the system inputs the output of the block a into the block B and the output of the block c into the block A to advance to the following parallel execution. In other words, when a flow that ends at a block having an internal state is erased, each disconnected portion becomes executable in parallel for only a one-time iteration.
However, it is often the case that a model cannot be divided completely, i.e., parallelization is not possible just by erasing the flow that ends at a block having an internal state. For example, in the case of FIG. 1, all blocks remain connected, and as a result, parallelization cannot be performed at all. This is a phenomenon that occurs because each portion is not disconnected when there are blocks without any internal state for combining two or more signals, and this tendency is likely to prevail in many models. Therefore, high parallelism cannot be expected by the above simple method alone.
Japanese Patent Application Publication No. 2003-91422 relates to a method of automatically converting a non-parallelized source code having a multi-loop structure into a parallelized source code executable by multiple processors. Disclosed is a program P for automatically generating a ultra-parallelized source code for multiple repetition processing for automatically generating parallelized source codes executable in parallel by m processors (where m is an integer equal to two or more) from a non-parallelized source code including an n-fold nested loop (where n is an integer equal to two or more), in which an initial value expression for each n-fold loop of the non-parallelized source code SC is rewritten to an initial value expression Sj represented by using m consecutive integers iak (k=0, . . . , m−1) starting from 0 and given to m processors to uniquely identify each processor and incremental values δj each specified for each loop j (j=1, . . . , n), and using the rewritten initial value expression Sj and the incremental values δj, a function for converting the n-fold loop structure into a structure capable of being processed by the m processors in a shared manner is realized by a CPU.
Published Japanese Translation of PCT International Application Publication No. JP-T-2007-511835 discloses that a network processor is configured into a D-stage processor pipeline, a sequential network application program is transformed into multiple D-pipeline stages, and the D-pipeline stages are executed in parallel within the D-stage processor pipeline. In this case, for example, the transformation of the sequential application program is performed by modeling the sequential network program as a flow network model and selecting from the flow network model into a plurality of preliminary pipeline stages.
However, these conventional techniques do not mention any technique for enhancing parallelism within an iteration between functional blocks having a dependence upon each other across a loop (loop carried dependence).