1. Field of the Invention
The present invention relates to a technique for reducing program execution time by parallelizing processes in a simulation system.
2. Description of Related Art
In recent years, a so-called multiprocessor system, including multiple processors, has been widely used in fields such as scientific computation and simulation. In such a system, an application program generates multiple processes and assigns the processes to individual processors. Then, the processors perform the processes in parallel while communicating with each other by using, for example, a shared memory space.
A simulation technology has been developed for this. The simulation system uses software for simulation in the mechatronics plants of a robot, a vehicle, an airplane, and the like. The development in electronic components and software technology has enabled electronic control of a major part of a machine such as a robot, a vehicle, or an airplane, by using a wireless LAN, wire connections, or the like spread over the machine as nerves are.
Although such a machine is fundamentally a mechanical device, it has massive control software installed therein. Accordingly, in product development, a great amount of time, cost, and people are required for the development of control programs and tests of the programs.
Hardware in the loop simulation (HILS) is a technique that has been conventionally used for such tests. In particular, an environment for testing the electronic control units (ECU) of an entire vehicle is called full-vehicle HILS. In full-vehicle HILS, actual ECUs are connected to a special hardware device for emulating an engine mechanism or a transmission mechanism, for example, in a laboratory. Tests are then carried out for predetermined scenarios. Outputs from the ECUs are inputted to a monitoring computer, and are then displayed on a display. Thus, the test operator checks for any abnormal operation while looking at the display.
However, in HILS, a special hardware device is required, and physical wiring needs to be made between the special hardware device and actual ECUs. Thus, HILS involves much advance preparation. In addition, when a test is to be performed by replacing ECUs with different ones, the wiring needs to be physically rearranged. This requires time and effort. Moreover, since this tool uses actual ECUs, real-time testing is needed. Accordingly, when tests are performed for many scenarios, a large amount of time is required. Furthermore, a hardware device for HILS emulation is generally extremely expensive.
To address the disadvantages of HILS, a technique using software without using any expensive emulation hardware device, called software in the loop simulation (SILS), has been recently proposed. In SILS, plants such as a microcomputer mounted in the ECU, an input/output circuit, control scenarios, an engine, a transmission, and the like are all emulated by a software simulator. By use of this technique, a test can be carried out without using actual ECU hardware.
An example of a system for supporting implementation of SILS is MATLAB®/Simulink®, which is a simulation modeling system available from The MathWorks, Inc. By using MATLAB®/Simulink®, a simulation program can be created by arranging functional blocks on a display through a graphical interface, and then specifying process flows as shown by arrows in FIG. 1. The block diagram represents a process in one time-step of the simulation. Time-series behaviors of a system to be simulated can be obtained by iterative execution of this process a predetermined number of times.
When a block diagram including the functional blocks and the like is created by MATLAB®/Simulink®, each function can be transformed into a source code describing an equivalent function in a known computer language, such as C language, by a function of Real-Time Workshop®. By compiling the C source code, a simulation can be performed as an SILS in a different computer system.
FIG. 1 shows a schematic diagram of a loop of typical functional blocks in MATLAB®/Simulink®. Functional blocks are mainly classified into blocks with internal state and blocks without internal state. In FIG. 1, hatched blocks A and B are blocks with internal state, and non-hatched blocks a, b and c are blocks without internal state.
In blocks without internal state, output data is calculated immediately from input data and then is outputted as shown in FIG. 2A.
On the other hand, in blocks with internal state, a value obtained by certain computing on previously inputted data is held as internal data 202, and output data is calculated by use of the internal data 202, as shown in FIG. 2B. To be more specific, currently inputted data is not used for calculation of data to be currently outputted, but is held as the internal data 202 for calculation of the next output data, after completion of the calculation of data to be currently outputted.
A description is given of a configuration of the block diagram shown in FIG. 1. Here, reference letter f1 denotes output from block A; f2, output from block a; f3, output from block B; f4, output from block b; and f5, output from block c. In this case, f1 is inputted into block a; f2, into block B; f3, into block b; f4, into block c; and f5, into block A. However, the blocks A and B have internal states, and thus do not directly use inputs f5 and f2 to calculate f1 and f3, respectively, as described above. The following shows a pseudo code describing the above:
 while (ts < EOS) {// Output f1 = Aout(SA) f2 = a(f1) f3 = Bout(SB) f4 = b(f3) f5 = c(f4)// Update state SA = Ain(f5) SB = Bin(f2)// Update time ts++}
The pseudo code above shows that a while loop is repeated until a time is reaches the end of simulation (EOS). In the code, for example, Aout( ) is a function for the block A to calculate output based on the internal state; Ain( ) a function for the block A to calculate an internal state variable based on the input; and a( ) a function for the block a to calculate output based on the input.
As seen from the pseudo code, in order to calculate outputs, the block A uses its internal state, and the block a uses the output from the block A. These calculations do not use output from the blocks B, b and c.
On the other hand, the blocks B, b and c do not use the output from the blocks A and a, either. This suggests that a process for A and a, and a process for B, b and c can be executed in parallel. As shown in FIG. 3, in preferable execution, the system assigns the process for A and a, and the process for B, b and c to different processors or cores, and then executes the processes in parallel. Subsequently, the system inputs output from the block a to the block B, and inputs output from the block c to the block A. Thereafter, the system proceeds to the next parallel process execution. In other words, processes divided by erasing a flow to each block with internal state can be executed in parallel in one iteration only.
However, in many cases, such simply erasing of a flow to each block with internal state does not lead to sufficient division of a model, that is, it does not enable parallelization. For example, in a case in FIG. 13, even after the erasing of flows, all the blocks are consequently connected to each other, and are not parallelized at all. This phenomenon occurs because the blocks cannot be divided due to the presence of a block without internal state which receives and unifies two or more signals. Many models tend to cause such a phenomenon. Accordingly, high parallelization is not expected by only a simple method as described above.
Japanese Patent Application Publication No. 2003-91422 relates to a method for automatically converting a non-parallelized source code having a multiple loop structure into a parallelized source code executable by multiple processors and discloses an automatic generation program P of massively-parallelized source code for multiple iterative processing. This program P automatically generates a parallelized source code executable in parallel by m processors (m is an integer of 2 or more) from a non-parallelized source code including an n-fold nested loop (n is an integer of 2 or more). The program P causes a CPU to implement a function to transform the n-fold loop structural part into a structure of processes divided to be executable by the m processors. For this transformation, an initial value formula of each of the n-fold loops of a non-parallelized source code SC is rewritten to an initial value formula Sj expressed by using m continuous integers iak (k=0, . . . , m−1) and an incremental value δj defined for each iteration of a loop j (j=1, . . . , n). Here, the integers iak start from 0 and are assigned to the m processors to uniquely identify the m processors. Then, the n-fold loop structural part is transformed by using the rewritten initial value formula Sj and the incremental value δj.
Japanese Patent Application Publication No. 2007-511835 discloses that a network processor is configured into a D-stage processor pipeline, a sequential network application program is transformed into D-pipeline stages, and the D-pipeline stages are executed in parallel within the D-stage processor pipeline. In the transformation of a sequential application program, for example, the sequential network program is modeled as a flow network model and multiple preliminary pipeline stages are selected from the flow network model.
These conventional techniques, however, suggest no technique for enhancing parallelization in one iteration for functional blocks having loop carried dependence.
Hence, the inventors of the present application proposed a technique for enhancing parallelization in one iteration for functional blocks, in the specification of commonly owned Japanese Patent Application No. 2009-251044, “Parallelization Method, System and Program.” Note that a set of functional blocks executed in parallel is referred to as a strand in the specification of commonly owned Japanese Patent Application No. 2009-251044 and thus the term is used herein in the same meaning.
The technique described in the specification of commonly owned Japanese Patent Application No. 2009-251044 has enhanced the parallelization. However, since the algorithm described therein does not necessarily take the sizes of generated strands into consideration, a balance in calculation time among strands is lost. In this case, a strand involving the maximum calculation time influences the total parallel processing time, and thus prevents speeding up of the processing.