FIG. 1 illustrates a control flow graph for a computer program. In the control flow graph of FIG. 1, there are ten computer instructions or ten segments of code (referred to also as basic blocks of computer code) represented as nodes "a"-"j" in a directed graph. The ten nodes of FIG. 1 are labeled "a" through "j" and correspond to ten different basic blocks of computer code. In the control flow graph of FIG. 1, the computer instruction(s) in basic block a are executed first in time in the execution path of the computer program. Since basic block "a" is the endpoint of a feedback path or looping path from basic block "j" back to basic block "a", basic block a may contain, for example, a while loop instruction, a for loop instruction, a repeat instruction, a do loop, or a like looping structure or basic block "j" can contain a branch instruction which has a destination address of the beginning of basic block "a".
After the basic block "a" is executed, sequential execution results in basic block "b" being executed following every execution of basic block "a" as illustrated in the control flow graph of FIG. 1. Execution flow will split in one of two directions after basic block "b" is executed depending upon a software condition. Therefore, basic block "b" contains either an if-then-else instruction, or a like flow construct which involves branching down one of two distinct and different execution flow paths. If one condition or set of constraints is detected in the basic block "b", basic block c is executed. If another condition or set of constraints are determined to exist in basic block "b", then the basic block d is executed. In either case, one of "c" or "d" is executed at a time after "b" is executed as illustrated in FIG. 1. Both basic blocks "c" and "d" converge back to basic block "e" in a manner similar to an if-then-else flow control. In other words, after executing one of either "c" or "d", the code contained in basic block "e" will be executed.
From basic block "e" or node "e" of the directed graph of FIG. 1, execution flow continues so that basic block "f" is executed. The basic blocks "f", "g", "h" and "i" of FIG. 1 are of a construct very similar to basic blocks "b", "c", "d" and "e" discussed above, and therefore these two sets of basic blocks are executed in a similar or identical execution flow manner. Once the basic block "j", which is a loop termination point as discussed above, determines that no more loops need to be made through the nodes of FIG. 1, then the execution flow of the computer program exists the construct of FIG. 1 via the exit path from node "j".
The execution flow of the computer program illustrated in FIG. 1 can be analyzed to determine efficient rearrangement of computer basic blocks in memory so that software executes in an efficient manner. In order to do so, FIG. 2 illustrates that an execution tracing routine is performed to collect data from the execution of the computer program graphically illustrated in FIG. 1. This trace process creates a trace data file in memory. The trace data file illustrated in FIG. 2 records the time-sequential execution flow of the computer program graphically illustrated as basic blocks of code in FIG. 1. The trace data stores block execution order in a time sequential manner. Spaces (" ") are used in FIG. 2 to separate different executed passes of the loop a-j from each other.
Therefore, in order to create the trace file in FIG. 2, an empty trace data file is first created and execution of the basic blocks a-j begins. The time sequential order of the basic blocks executed in a first loop through basic blocks a through "j" is {abcefgij}. Therefore, in a first loop, recorded in a left-hand side of FIG. 2, the {b-c} path is taken in FIG. 1 and the {f-g} path is taken in FIG. 1 resulting in the blocks {abcefgij} being executed in a time sequential order. The basic block "j" directs the execution flow back to basic block "a", and the second loop sequence in FIG. 2 is {abcefgij}. Therefore, the same instruction sequence {abcefgij} executed twice in a row, one right after another, a time sequential manner via the loop from block "j" to block a. This time sequential execution flow is continually recorded for a period of time and stored in the trace data file for further analysis at a subsequent time.
A computer is then able to graphically model the computer software as illustrated in FIG. 3 by analyzing the trace data of FIG. 2. It is important to note that when first executing the computer program containing blocks a-j to generate the trace data file in FIG. 2, the computer has no idea of the execution flow of the software as illustrated in FIG. 1. The trace file of FIG. 2 is analyzed to obtain the execution flow structure of FIG. 3 which also contains the same information contained in FIG. 1.
The directed graph of FIG. 3 is constructed by scanning the trace data in FIG. 2 from left to right and analyzing pairs of basic blocks that are adjacent each other in time. Initially, no data structure is present when the algorithm begins (FIG. 3 is blank in a starting state). The algorithm then takes the first pair of basic blocks in FIG. 2, which is the pair ab. In FIG. 3, a node "a" is created, a node "b" is created and an edge "ab" from node "a" to node "b" is created with a weight or count of 1. In a second access to the data of FIG. 2, the pair "bc" is next analyzed. Since the node "b" has been previously created in FIG. 3, the computer simply creates a node "c" and an edge "bc" from "b" to "c" with a weight of 1. This interconnection and/or creation of nodes and edges and the incrementing of weights of the edges between nodes as further pairs of nodes are encountered continues for the entire data segment illustrated in FIG. 2 to result in the completed data structure illustrated in FIG. 3. As illustrated in FIG. 3, the basic block b follows basic block a nine times in FIG. 2 whereas basic block c follows basic block b only five times in FIG. 2 as evident from the weights on the edges "ab" connecting nodes "a" and "b" and the edge bc connecting nodes "b" and "c" illustrated in FIG. 3.
Once the data structure of FIG. 3 is created from the trace file of FIG. 2, a method illustrated in the flowchart of FIG. 4 can be performed to analyze the data structure of FIG. 3 to determine an efficient manner of ordering basic blocks in memory so that cache performance may be improved and pipeline flushing may be minimized resulting in improved processor performance. The efficient output order of basic blocks (the output file resulting from the method of FIG. 4) is illustrated in FIG. 5. In order to discuss FIG. 4 of the prior art restructuring method, it is important to refer to FIG. 5, which is the output of the method of FIG. 4.
Initially, the method of FIG. 4 begins via an initialization step 100 which prepares for the formation of a sequence chain or reordered basic blocks of instructions. In step 102, the node in FIG. 3 that has not been so far selected with the highest exiting path/edge value is selected. In FIG. 3, the nodes "a", "e", and "i" are tied in numerical value for the highest path value where this path/edge value is 9 in FIG. 3. Nine is the greatest edge value in FIG. 3. In this case of a tie, the first node in the execution flow, which is a in this case, is selected arbitrarily. The basic block a is then placed in a restructured computer file as illustrated in step 1 of FIG. 5. An execution chain (a sequential list of basic block(s)) is then created with the beginning of the chain being set to the node determined in step 102 (which in this case is node "a"). Therefore, step 106 sets the beginning of the chain to the node a in FIG. 3. Step 108 is used to determine which nodes a-j are reachable from node "a" in FIG. 3. From node "a" only one node is reachable and that node is node b in FIG. 3. Therefore, step 108 (in a first loop) produces a single node which is node "b".
Node "b" is then analyzed in step 110, and since the set of nodes determined a latest execution of step 108 contains only the node "b", the node "b" is selected in the step 110 as being the node with the highest path value. In step 110, node "b" is then inserted into the restructured computer file of FIG. 5 as illustrated in step 2 of FIG. 5. The restructured computer file now contains the instruction chain or sequence "ab".
Step 108 then determines that node "c" and "d" are reachable from step "b" as illustrated in FIG. 3. Step 110 then analyzes "c" and "d" and determines that node "c" has a path value of 5 and node "d" has a path value of 4. Therefore, step 108 and 110 in FIG. 4 insert the basic block "c" into the restructured data file of FIG. 5 after block "b" and a step 3 of FIG. 5 illustrates that node "d" is ignored and is not inserted into the chain of FIG. 5 at this point in time since node "d" did not have the highest weight value. Continuing from node "c", basic block "e" (represented by node "e" in FIG. 3) is inserted in a step 4 of FIG. 5 using the algorithm of FIG. 4. Step "f" is then inserted in a step 5 of FIG. 5 using the process outlined in FIG. 4. Between nodes "h" and "g" in FIG. 3, steps 108-110 will determine that node "h" has a greater path value from node "f" than node "g" and insert basic block "h" after block "f" in a step 6 of FIG. 5. Code represented by node "i" is then inserted via step 7 of FIG. 5, and "j" is inserted via a step 8 in FIG. 5. Once node "j" is inserted in step 8, there are no more unselected nodes which can be reached from step "j" in FIG. 3 since node "a" has already been analyzed and inserted into FIG. 5 in step 1 of FIG. 5. Therefore, step 108 sends the control of FIG. 4 back to step 102 and step 102 finds a new unselected node which has the highest weight value. In summary, by step 8 of a left portion of FIG. 5, the chain of blocks {abcefhij} is now fully sequentially inserted into the restructured computer file as illustrated graphically via a region 90 illustrated in a left portion of FIG. 5.
Returning to steps 102-106, the only remaining unselected nodes in FIG. 3 are "d" and "g", which have equal edge weight values and therefore, by default, node "d" which is the earlier node is chosen via the process of FIG. 4. Node "d" is inserted via step 9 in FIG. 5. Since the node "e" is reachable from node "d" in FIG. 3 but has already been previously selected (see step 4 of FIG. 5) and placed into the file of FIG. 5, step 108 determines that there is nothing more to process from node "d" and step 102 is once again executed. The only node remaining is node "g" and step 10 of FIG. 5 determines that node "g" should be inserted in a step 10 of FIG. 5.
Therefore, when a compiler is ordering the basic blocks of the program flow illustrated in FIG. 3, the final ordering of instructions or basic blocks in memory is performed as illustrated in step 10 of FIG. 5 with the goal of attempting to improve processor performance.
However, the prior art method illustrated in FIGS. 1-5 is flawed. By looking at FIG. 2, one can easily determine that if the path bc is taken, it is most likely that the path {fg} is also taken in conjunction with path {bc}. One can also determine if the path {bd} is taken, then the path {fh} is also more likely to be taken. In other words, the correlation between paths bc and paths {fg} is high whereas the correlation between paths {bd} and {fh} is high. Therefore, the most efficient organization of basic blocks in step 10 of FIG. 5 would be to couple the paths {bc} with {fg} in some serial order or couple the path {bd} with {fh} in some serial order. However, the algorithm illustrated via prior art FIGS. 4 and 5 results in the path {bc} being coupled and serially positioned with the path {fh} (see this illustrated graphically in the right portion of FIG. 5). This choosing of the wrong pairs to the detriment of CPU execution performance results because the prior art algorithm of FIG. 4 does not look ahead to more distant nodes and paths in the data structure of FIG. 3 but only looks at directly adjacent pairs of basic blocks or nodes in FIG. 3. The result is that the prior art of FIG. 4 and 5 performs basic block restructuring in a limited fashion which obtains limited performance benefit. Therefore, it is more advantageous to design a basic block restructuring process which identifies these correlations between more distant paths and performs improved sequencing of instructions to result in fewer cache misses, fewer external memory accesses, fewer page misses, fewer pipeline flushes and or stalls, and increase program execution speed .