The present invention is in the general field of timing related bug detectors which aim at detecting data races in multi-threaded computer programs applications.
A general computer program is a list of statements, instructions, and commands to be executed properly and in a well ordered fashion. The operating system (OS hereafter) is the computer software that manages all the activities taking place in the computer. The OS is responsible to run the program on the computer.
The task of the computer program is the results it generates during its execution.
The computer may have more than one processor available to fulfill the OS needs and requirements. The OS might allocate more than one processor to execute a given program. If one processor is allocated to run the program, then the program""s instructions are executed one-by-one in a well ordered fashion to generate the expected results. This sequential run of the program generates the sequential results, which are the results that are designed to be generated by the program. The order the program""s instructions are executed in the sequential run is the sequential order of the program""s instructions. A computer program may be split into several structures each consisting of several instructions of the computer program. Each of the program""s structures usually, but not necessarily, have a well defined task.
In general, a program can be described as a set of structures, along with their respective relationships and interconnections. In addition, due to the nature of these interconnections, a program can be also described as a hierarchy of several levels. In this case, the program set of structures, is distributed over these levels, where each structure is connected to one or more of the structures located in the level above it. This hierarchy is defined by the order that these structures are to be executed. FIG. 1A illustrates a naive example of hierarchical program (90) where each level consists of one structure and FIG. 1B illustrates another, more complex, hierarchical program (20) where level (22) contains two parallel structures (A (24) and B (26)), and level (28) which contains two parallel levels (C (30) and D (32)).
A general computer program may contain two or more parallel structures, as is exemplified in FIG. 1B. In the more general case, a program""s structure may include several levels each containing two or more parallel structures.
A thread is a sequence of structures that are to be executed one after the other in a sequential fashion. Thus, a thread may consist of a sequence of structures that belong to consecutive levels in the hierarchy, and which are connected to each other. The results that are generated during the execution of this well ordered sequence of a program""s structures are the thread""s results. Reverting now to FIG. 1A, the program consists of only one thread starting with a first structure called begin (12), and its last structure is the end structure (14) of the program. This thread is called a total thread seeing that it concerns only one total order.
In the other example of FIG. 1B, the program has two different threads {A, C} and {B, D}. A thread is a program segment defined to execute as a xe2x80x98lightxe2x80x99 program, with its own local variables, possibly, but not necessarily, on a different processor. Thus, if a partition of the structures is given, a thread is an assignment of each partition structure to a processor. The partition should meet the requirement that its structures can be ordered in an order that does not contradict the order that is defined by the hierarchy. For example, with reference to FIG. 1B {A, C} and {B, D} is an adequate partition considering that Axe2x86x92C and Bxe2x86x92D do not contradict the hierarchy of FIG. 1B, and accordingly the assignment of {A, C} to a first processor and {B, D} to a second processor is feasible.
In addition to the fact that the thread consists of several structures executed one after the other, the thread is also associated with a well defined memory domain. A cell is the smallest unit of the memory that the computer program refers to. The thread""s memory domain is the part of the computer memory, which the thread writes to and/or reads to data.
Therefore, a thread is defined by the following three major components:
(1) The sequence order of thread""s structures
(2) The thread""s memory domain. This memory domain or parts of it may be used also by other threads
(3) The output domain where the thread writes its relevant results. The output domain is never used in a xe2x80x9cread modexe2x80x9d
The thread""s execution trace is a list of all its sequential structures"" instructions that where executed during its full execution. Similarly, the program""s execution trace is a list of all the program instructions that where executed during its full execution of the program. Here, each instruction is accompanied by:
(1) The appropriate time that it was executed (statement""s execution time stamp)
(2) The ID of the thread that has executed this instruction, and
(3) The map of each of the thread""s memory domain at each of the time stamps.
Part of the program""s execution trace is the memory trace, which is the list of the memory maps, each taken in a different time, ordered sequentially.
In case the program contains at one of its points N parallel structures, then it can be split into at most N parallel threads. Therefore if a multi-processors computer is available to execute this program, then the OS can allocate each of the parallel threads to a different processor. Alternatively, in the case of single processor architecture, the OS can simulate the allocation of threads to respective processors.
Two parallel threads are connected to each other if parts of the memory domains overlap. These parts make-up the two-threads overlap memory domain or common memory. At a specific memory cell that belongs to the overlap memory domain of two threads, the following scenarios might happened:
(1) both threads write into this cell
(2) one writes into the cell and the other reads information out of it
(3) both read data from this memory cell
A data race between two parallel threads is the situation where the two threads are connected and both contain scenario (1) and/or scenario (2) on their overlap common memory. In this case, the two parallel threads compete, regardless of whether they are implemented in a single-processor or multi-processor architecture.
A competing point of two competing threads is the memory cell which belongs to their overlap memory domain and there is a data race on this cell. Two threads may have more then one competing point. For example, assume that structures A and B in FIG. 1B belong to two connected threads, TA and TB respectively. In case of scenario (1) if TA reads and TB writes to the same competing point, then TA can get the value of the contents of the competing point either before or after TB wrote values into this common cell, depends on the order of execution. Thus, when terminated, TA might contain different values at its memory domain for the different cases that might take place.
When parallel structures are allocated to different parallel processors, and if no synchronization exists, the parallel processors can start and end the execution of their allocated structures in some undetermined time, giving rise to different possible interleavings among the parallel structures and consequently to parallel threads. In the case that the parallel threads compete, the results of one or more of the threads may be different than that of sequential program results which is obviously undesired. Thus, in general, the existence of a competing point in a multi-threaded parallel program is a source for inconsistency in its results. Depending on the computer""s OS""s activities taking place at the same time that the program is executed, different results can be obtained for different runs of the program. Therefore, by using appropriate system mechanisms, usually known as synchronization calls, the connected threads can be synchronized at each relevant competing point. The synchronization calls sometimes implemented as library calls and sometimes implemented as programming language primitives (as is the case in the Java language).
Based on this, the data race occurs when parallel structures are not synchronized, leading to results which depend on the schedule that the OS executes these parallel structures, or on the schedule the OS activates the processors that execute their associated structures.
Two different runs of two connected threads are equivalent if their two respective memory traces are identical. The execution of a program is unique if all its connected threads are equivalent to each other, and, of course equivalent to the sequential result of the program.
If the two runs of a program, that use the same input, give rise to different results, then the program has a data race in respect to at least one of its competing points, and one of the following conclusions holds true:
neither of the results is the correct one
one of the results is the correct one, and it is not known which one it is
it is not known, in general, which thread gave rise to what result as the trace can be in a different abstraction level.
All the results are correct, as the race might be intentional, e.g., in order to improve performance.
A sync control is an OS synchronization service used to enforce order among competing structures (or portion thereof). A sync service is applied to the entire structure (i.e., a series of instructions) or to a sub set of the specified set of instructions including the specific case of only one instruction. The sync service synchronizes the connected structures and includes, as a rule, two basic controls lock and unlock. Whenever the OS for the benefit of a given thread locks a memory cell, then any other thread that needs access to the memory cell is put on hold till the OS unlocks this seizing of the cell by this thread. After unlocking this cell it will lock it again for the benefit of another thread. The processes of locking and unlocking memory cells by the OS are well defined to the OS before the program starts its execution.
The sync control is seemingly the ideal solution which copes with the possible inconsistencies in a multi-thread computer program as it synchronizes the connected structures and imposes a predefined sequential order which brings about one result.
Regretfully, in a multi-threaded computer program it is quite common that even a proficient programmer/developer, will fail to identify all existing racing points and consequently will fail to introduce the appropriate sync controls in the program. As specified above, this may lead to an interleaving sequence or sequences that bring about inconsistent results which are different that those anticipated by the programmer. Normally, the larger the level of parallelism (number of interleaving) the higher the prospects for obtaining inconsistent results (This situation is referred to also as time related (TR) bugs). Obtaining inconsistent results in succession runs of a computer program may lead to dire consequences in a multi-threaded computer program applications incorporated in, say, military oriented applications or medical related applications (e.g., a computer application which monitors the operation of medical equipment for intensive care purposes.
Various solutions have been proposed in accordance with the prior art in order to cope with the inconsistent results obtained in running a multi-threaded program. The most straight-forward approach is to conduct so called xe2x80x9cstress testsxe2x80x9d where the program under test is constrained to operate in varying operational conditions and the program""s execution trace and/or results are logged and compared. In the case of discrepancy between two or more runs, one can assume that data race has been encountered at least in respect of one memory cell. This naive approach has some significant limitations. For one, even if data race is encountered, it is difficult to identify the specific interleaving which gave rise to the defective result, since no data is provided as to the exact scheduling order of the structures to the parallel processors by the OS. Moreover, regardless of whether data race has been encountered or not, it is not guaranteed that even under very demanding stress test all possible interleavings for a given partial order occur. This being the case, the stress test can never be regarded as sufficiently reliable considering that those interleavings which were not encountered may lead to the inconsistent results. It should be noted that partial order is normally determined by the input (i.e. different partial orders may be defined by respective different inputs).
In Assure(trademark) (Assure is a trademark of Kuck and Associates, Inc.) User""s Manual Version 1.0, Document #9801002, it was suggested to monitor the entire memory and intercept any data read (R) and data write (W) to a memory cell. Any read/write conflict that is encountered is analyzed in order to determine whether or not there exists a data race in respect of this cell.
Reference is also made to Eraser, A Dynamic Data Race Detector for Multi-Threaded Programs by Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, Thomas Andersen.
The most obvious shortcoming of the specified techniques is that every access to the memory is analyzed, posing thus undue overhead considering that only few memory cells may indeed be subject to a data race. Moreover, even if a given memory cell is subject to a data race, it is required to ascertain whether the xe2x80x9csuspectedxe2x80x9d memory cell is or is not in a scope of a sync control command. If in the affirmative (i.e., it is within the scope of a sync command), then it does not constitute a competing memory cell. For a better understanding of the foregoing, consider the following sequence of instruction:
F( )
{
lock( )
h( )
}
h( )
{
l( )
}
l( )
{
X=3
}
As shown, function f( ) call function (h) which is synchronized by a lock( ) synchronization command. h( ) in its turn calls function l( ), in which the variable X is assigned with the value 3. Since X resides (indirectly) in the scope of the synchronized function h( ), it may not constitute a competing cell. However, according to the specified techniques, the test is triggered only when the variable X is accessed (i.e., when the command X=3 is executed). At this stage, according to the prior art techniques, it is very difficult and time consuming to realize that there is no need to check X (for determining whether or nor it is subject to data race) considering that X (i.e., memory cell being representative of X) is under a scope of a lock( ) synchronization command.
There are known in the art formal verification techniques (refer to, e.g., xe2x80x98Model Checking for Programming Languages Using VeriSoftxe2x80x99 by Patrice Godefroid). This category of tools can apply formal methods to verify properties of concurrent programs, such as race conditions. Experience shows that they are only applicable to relatively small software applications.
There is accordingly a need in the art for providing testing tools and appropriated methodologies to help increase the confidence that a program is free of timing related (TR) bugs that stem from data races in respect of common memory.
The invention aims at providing an automatic detection tool for detecting TR bugs, i.e. Time Related Bug detector (hereafter TRBD), which is a new concurrent testing tool for testing the concurrent aspects of a multi-threading program (hereafter MTP).
The TRBD provides sufficient confidence in the program correctness in terms of TR bugs that related to unexpected data races.
According to a first aspect of the invention, there is provided a multi-threaded computer program partitioned into structures of which at least one structure is parallel to at least one other structure. The multi-threaded computer program is executed in a multi or single processor environment under the control of an OS which utilizes a scheduler (optionally replaceable scheduler).
Preferably, the TRBD has a private scheduler that partially or fully replaces the OS scheduler.
The TRBD runs the program successively and during each cycle the private scheduler synchronizes the structures according to a given partial order. Thus, in a first run cycle a given interleaving is implemented that meets a given partial order. In the next run cycle, a different interleaving is implemented that meets the same partial order. This procedure of successively running the program is continued until all the intrerleavings that meet the specified partial order are covered and results are obtained in respect of each separate run.
The TRBD has a mechanism to verify discrepancies between the so obtained results. In the case that all the results are identical for the same input this indicates in a high degree of confidence that the computer program is data race free. If, on the other hand, there appears to be a discrepancy between one (or possibly more than one) of the results obtained in a given cycle (or cycles) as compared to other result(s), this not only indicates on the fact there exists a data race, but also on the specific interleaving which gave rise to the defective results.
Those versed in the art will readily appreciate that an underlying premise of the invention is that different results obtained in two interleavings of the same partial order indicates, with a high degree of confidence, that there exists a race. As will be explained in greater detail below, in the specific case of Java(trademark) (Java is a trademark of Sun Microsystems) in order to meet the specified underlying premise, the interleavings of a given partial order that are subjected to the method step of the invention are a priori selected so that they meet the so called release consistency requirement. Put differently, in Java, had one or more of the interleavings (of a given partial order) that are subject to the technique of the invention not met the release consistency requirement, and assuming that different results are obtained for different interleavings, this would not necessarily indicate a race condition.
The indication on the relevant interleaving that is associated with a given result which is suspected to result from a run where data race occurred, assists the programmer/developer in identifying the common memory cell or cells which are subject to competition (and which were overlooked by the programmer when he/she incorporated sync commands in the program), and thereby render the computer program xe2x80x9crace freexe2x80x9d in a higher degree of confidence.
It should be noted that in many real-time applications programmers tend to limit the use of sync commands only to those cases where they consider it absolutely necessary in order to optimize the program performance. This optimizing approach is risky since one or more program sections which necessitate synchronization may be overlooked. The TRBD tool of the invention may be employed in order to overcome or substantially reduce this limitation. Thus, for example, in the case of a Java program the programmer may utilize the TRBD tool of the invention for accomplishing program optimization. In the case of inconsistent results (which suggest that a race has been encountered,) the programmer can modify the program by moving the acquire and/or release sync commands a (that correspond to the specified lock and unlock commands) few program statements forward or backward and repeatedly use the tool until TR-free program is obtained. Accordingly, a repeated use of the tool on the corrected program helps to check if the optimization is correct.
There are various known per se techniques which may be utilized to compare between the results obtain in different cycles.
Accordingly, the present invention provides for, in a computer system running under the control of an OS having a scheduler; the computer system further includes a multi-threaded computer program that is partitioned into structures of which at least one structure is parallel to at least one other structure,
a Time-Related-Bug-Detector (TRBD) method for detecting data races between parallel structures in respect of common memory structures, comprising:
(a) coupling a private scheduler to the OS;
(b) running the program in few cycles and, during each cycle of program run, the private scheduler synchronizing the structures according to a respective interleaving of a partial order and for each cycle logging the respective full or partial results of the program, until substantially every possible interleaving of said partial order has been tested;
(c) comparing the results, and in the case that they are identical indicating that said program is race free in a degree of confidence, otherwise indicating that said program is susceptible to at least one data race in respect to a common memory.
In the context of the invention, a first structure is parallel to a second structure if the former commences execution before the latter terminates execution or vise versa. Common memory should be construed as any memory unit including but not limited to the smallest memory unit (e.g. a given memory address, or memory cell) which is accessible to the processor. Memory should be construed as any physical storage medium.
Computer program should be construed as encompassing any computer code (and its associated data) adapted to be executed on processor (multi-threaded environment on a single processor) or processors, regardless of the physical arrangement of the code.
The term results refers typically (although not necessarily) to the input-output relation (i.e. outputs obtained for given input), or to the program""s execution trace after so called conditional switch (see below), which the case may be.
By one embodiment, the private scheduler is implemented in accordance with the concurrent testing tool, see xe2x80x9cTiming-Dependent Bugsxe2x80x9d, by Michael Factor, Eitan Farchi and Yoram Talmor, published in Software Testing Analysis and Review CD, 1998. (referred to herein also as king scheduler).
The operation of a TRBD system or method in accordance with the first aspect of the invention requires the obtainment of a partial or full set of results (i.e. output-input relation) in response to running respective interleavings of the same partial order of the computer program. It should be noted in this connection that, generally, a given partial order is determined by the input that is fed to the computer program. In other words, different inputs may give rise to different partial orders.
In some real life applications, it is difficult to obtain and log results, or, alternatively, even if results (or partial results) are obtained it is difficult to determine the difference between them. A non-limiting example of the latter is a graphic user interface (GUI) application where the xe2x80x9cresultxe2x80x9d of the program is portrayed on the screen and it is difficult to indicate the differences between the screens generated by respective different runs of the computer program application.
In accordance with a second aspect of the invention and similar to the first aspect, the Time-Related-Bug-Detector (TRBD) system and method synchronizes the structures in the manner specified. Thus, instead of analyzing the output-input results (in the sense specified above) of the computer program application in respective different runs (interleavings), the program""s execution trace (constituting also xe2x80x9cresultsxe2x80x9d) after so called conditional switch points is logged and compared to the trace obtained in successive (and previous) runs that meet the same partial order. In the case that the trace is consistent in respect of all the switch points in each one of the interleavings, then the program is data race free in a high degree of confidence. Otherwise, there exists a data race.
Conditional switch point, in this context, is any instruction in the program where a condition is tested and the program switches to an execution of a command depending upon the result of the condition. Typical, yet not exclusive, examples of conditional switch points (in the C++ programming language) are if statements, do while statements and others.
Accordingly by this aspect the invention provides for: in a computer system running under the control of an OS having a scheduler; the computer system further includes a multi-threaded computer program that is partitioned into structures of which at least one structure is parallel to at least one other structure, the program includes at least one conditional switching command where the program tests a condition and switches to a different target location depending upon the result of said condition,
a Time-Related-Bug-Detector (TRBD) method for detecting data races between parallel structures in respect of common memory structures, comprising:
(a) coupling a private scheduler to the OS;
(b) running the program a few times and, during each cycle of program run, the private scheduler synchronizing the structures according to a respective interleaving of a partial order and for each cycle logging the at least one target location that the program switches to in response to the execution of the at least one conditional switching command, until substantially every possible interleaving of a partial order has been tested;
(c) comparing the target locations obtained in the cycles of executions and in the case that they are identical indicating that said program is race free in a degree of confidence, otherwise indicating that said program is susceptible to at least one data race in respect to a common memory.
Still further, the invention provides for a storage medium storing at least one computer file holding data being representative of a Time-Related-Bug-Detector (TRBD) computer program that can be applied to a multi-threaded computer program which is partitionable into structures of which at least one structure; the (TRBD) computer program is capable of detecting data races between parallel structures in respect of common memory structures, by executing the steps that include:
(a) coupling a private scheduler to an Operating System;
(b) running in a computer system the multi-threaded program in a few cycles and, during each cycle of program run, the private scheduler synchronizing the structures according to a respective interleaving of a partial order and for each cycle logging the respective full or partial results of the multi-threaded program, until substantially every possible interleaving of the partial order has been tested;
(c) comparing the results, and in the case that they are identical indicating that said multi-threaded program is race free in a degree of confidence, otherwise indicating that said program is susceptible to at least one data race in respect to a common memory.