This application is related to our copending patent application entitled METHOD AND APPARATUS FOR ANALYZING CONTROL FLOW, filed of even date herewith and assigned to the assignee hereof.
This application is related to our copending patent application entitled METHOD AND APPARATUS FOR SEQUENCING COMPUTER INSTRUCTION EXECUTION IN A DATA PROCESSING SYSTEM, filed of even date herewith and assigned to the assignee hereof.
This invention generally relates to compiler and profiler technology for microprocessors and specifically relates to sequencing instructions for optimal data processor execution.
FIG. 1 illustrates a control flow graph for a computer program. In the control flow graph of FIG. 1, there are ten computer instructions or ten segments of code (referred to also as basic blocks of computer code) represented as nodes xe2x80x9caxe2x80x9d-xe2x80x9cjxe2x80x9d in a directed graph. The ten nodes of FIG. 1 are labeled xe2x80x9caxe2x80x9d through xe2x80x9cjxe2x80x9d and correspond to ten different basic blocks of computer code. In the control flow graph of FIG. 1, the computer instruction(s) in basic block a are executed first in time in the execution path of the computer program. Since basic block xe2x80x9caxe2x80x9d is the endpoint of a feedback path or looping path from basic block xe2x80x9cjxe2x80x9d back to basic block xe2x80x9caxe2x80x9d, basic block a may contain, for example, a while loop instruction, a for loop instruction, a repeat instruction, a do loop, or a like looping structure or basic block xe2x80x9cjxe2x80x9d can contain a branch instruction which has a destination address of the beginning of basic block xe2x80x9caxe2x80x9d.
After the basic block xe2x80x9caxe2x80x9d is executed, sequential execution results in basic block xe2x80x9cbxe2x80x9d being executed following every execution of basic block xe2x80x9caxe2x80x9d as illustrated in the control flow graph of FIG. 1. Execution flow will split in one of two directions after basic block xe2x80x9cbxe2x80x9d is executed depending upon a software condition. Therefore, basic block xe2x80x9cbxe2x80x9d contains either an if-then-else instruction, or a like flow construct which involves branching down one of two distinct and different execution flow paths. If one condition or set of constraints is detected in the basic block xe2x80x9cbxe2x80x9d, basic block c is executed. If another condition or set of constraints are determined to exist in basic block xe2x80x9cbxe2x80x9d, then the basic block d is executed. In either case, one of xe2x80x9ccxe2x80x9d or xe2x80x9cdxe2x80x9d is executed at a time after xe2x80x9cbxe2x80x9d is executed as illustrated in FIG. 1. Both basic blocks xe2x80x9ccxe2x80x9d and xe2x80x9cdxe2x80x9d converge back to basic block xe2x80x9cexe2x80x9d in a manner similar to an if-then-else flow control. In other words, after executing one of either xe2x80x9ccxe2x80x9d or xe2x80x9cdxe2x80x9d, the code contained in basic block xe2x80x9cexe2x80x9d will be executed.
From basic block xe2x80x9cexe2x80x9d or node xe2x80x9cexe2x80x9d of the directed graph of FIG. 1, execution flow continues so that basic block xe2x80x9cfxe2x80x9d is executed. The basic blocks xe2x80x9cfxe2x80x9d, xe2x80x9cgxe2x80x9d, xe2x80x9chxe2x80x9d and xe2x80x9cixe2x80x9d of FIG. 1 are of a construct very similar to basic blocks xe2x80x9cbxe2x80x9d, xe2x80x9ccxe2x80x9d, xe2x80x9cdxe2x80x9d and xe2x80x9cexe2x80x9d discussed above, and therefore these two sets of basic blocks are executed in a similar or identical execution flow manner. Once the basic block xe2x80x9cjxe2x80x9d, which is a loop termination point as discussed above, determines that no more loops need to be made through the nodes of FIG. 1, then the execution flow of the computer program exists the construct of FIG. 1 via the exit path from node xe2x80x9cjxe2x80x9d.
The execution flow of the computer program illustrated in FIG. 1 can be analyzed to determine efficient rearrangement of computer basic blocks in memory so that software executes in an efficient manner. In order to do so, FIG. 2 illustrates that an execution tracing routine is performed to collect data from the execution of the computer program graphically illustrated in FIG. 1. This trace process creates a trace data file in memory. The trace data file illustrated in FIG. 2 records the time-sequential execution flow of the computer program graphically illustrated as basic blocks of code in FIG. 1. The trace data stores block execution order in a time sequential manner. Spaces (xe2x80x9c xe2x80x9d) are used in FIG. 2 to separate different executed passes of the loop a-j from each other.
Therefore, in order to create the trace file in FIG. 2, an empty trace data file is first created and execution of the basic blocks a-j begins. The time sequential order of the basic blocks executed in a first loop through basic blocks a through xe2x80x9cjxe2x80x9d is {abcefgij}. Therefore, in a first loop, recorded in a left-hand side of FIG. 2, the {b-c} path is taken in FIG. 1 and the {f-g} path is taken in FIG. 1 resulting in the blocks {abcefgij} being executed in a time sequential order. The basic block xe2x80x9cjxe2x80x9d directs the execution flow back to basic block xe2x80x9caxe2x80x9d, and the second loop sequence in FIG. 2 is {abcefgij}. Therefore, the same instruction sequence {abcefgij} executed twice in a row, one right after another, a time sequential manner via the loop from block xe2x80x9cjxe2x80x9d to block a. This time sequential execution flow is continually recorded for a period of time and stored in the trace data file for further analysis at a subsequent time.
A computer is then able to graphically model the computer software as illustrated in FIG. 3 by analyzing the trace data of FIG. 2. It is important to note that when first executing the computer program containing blocks a-j to generate the trace data file in FIG. 2, the computer has no idea of the execution flow of the software as illustrated in FIG. 1. The trace file of FIG. 2 is analyzed to obtain the execution flow structure of FIG. 3 which also contains the same information contained in FIG. 1.
The directed graph of FIG. 3 is constructed by scanning the trace data in FIG. 2 from left to right and analyzing pairs of basic blocks that are adjacent each other in time. Initially, no data structure is present when the algorithm begins (FIG. 3 is blank in a starting state). The algorithm then takes the first pair of basic blocks in FIG. 2, which is the pair ab. In FIG. 3, a node xe2x80x9caxe2x80x9d is created, a node xe2x80x9cbxe2x80x9d is created and an edge xe2x80x9cabxe2x80x9d from node xe2x80x9caxe2x80x9d to node xe2x80x9cbxe2x80x9d is created with a weight or count of 1. In a second access to the data of FIG. 2, the pair xe2x80x9cbcxe2x80x9d is next analyzed. Since the node xe2x80x9cbxe2x80x9d has been previously created in FIG. 3, the computer simply creates a node xe2x80x9ccxe2x80x9d and an edge xe2x80x9cbcxe2x80x9d from xe2x80x9cbxe2x80x9d to xe2x80x9ccxe2x80x9d with a weight of 1. This interconnection and/or creation of nodes and edges and the incrementing of weights of the edges between nodes as further pairs of nodes are encountered continues for the entire data segment illustrated in FIG. 2 to result in the completed data structure illustrated in FIG. 3. As illustrated in FIG. 3, the basic block b follows basic block a nine times in FIG. 2 whereas basic block c follows basic block b only five times in FIG. 2 as evident from the weights on the edges xe2x80x9cabxe2x80x9d connecting nodes xe2x80x9caxe2x80x9d and xe2x80x9cbxe2x80x9d and the edge bc connecting nodes xe2x80x9cbxe2x80x9d and xe2x80x9ccxe2x80x9d illustrated in FIG. 3.
Once the data structure of FIG. 3 is created from the trace file of FIG. 2, a method illustrated in the flowchart of FIG. 4 can be performed to analyze the data structure of FIG. 3 to determine an efficient manner of ordering basic blocks in memory so that cache performance may be improved and pipeline flushing may be minimized resulting in improved processor performance. The efficient output order of basic blocks (the output file resulting from the method of FIG. 4) is illustrated in FIG. 5. In order to discuss FIG. 4 of the prior art restructuring method, it is important to refer to FIG. 5, which is the output of the method of FIG. 4.
Initially, the method of FIG. 4 begins via an initialization step 100 which prepares for the formation of a sequence chain or reordered basic blocks of instructions. In step 102, the node in FIG. 3 that has not been so far selected with the highest exiting path/edge value is selected. In FIG. 3, the nodes xe2x80x9caxe2x80x9d, xe2x80x9cexe2x80x9d, and xe2x80x9cixe2x80x9d are tied in numerical value for the highest path value where this path/edge value is 9 in FIG. 3. Nine is the greatest edge value in FIG. 3. In this case of a tie, the first node in the execution flow, which is a in this case, is selected arbitrarily. The basic block a is then placed in a restructured computer file as illustrated in step 1 of FIG. 5. An execution chain (a sequential list of basic block(s)) is then created with the beginning of the chain being set to the node determined in step 102 (which in this case is node xe2x80x9caxe2x80x9d). Therefore, step 106 sets the beginning of the chain to the node a in FIG. 3. Step 108 is used to determine which nodes a-j are reachable from node xe2x80x9caxe2x80x9d in FIG. 3. From node xe2x80x9caxe2x80x9d only one node is reachable and that node is node b in FIG. 3. Therefore, step 108 (in a first loop) produces a single node which is node xe2x80x9cbxe2x80x9d.
Node xe2x80x9cbxe2x80x9d is then analyzed in step 110, and since the set of nodes determined a latest execution of step 108 contains only the node xe2x80x9cbxe2x80x9d, the node xe2x80x9cbxe2x80x9d is selected in the step 110 as being the node with the highest path value. In step 110, node xe2x80x9cbxe2x80x9d is then inserted into the restructured computer file of FIG. 5 as illustrated in step 2 of FIG. 5. The restructured computer file now contains the instruction chain or sequence xe2x80x9cabxe2x80x9d.
Step 108 then determines that node xe2x80x9ccxe2x80x9d and xe2x80x9cdxe2x80x9d are reachable from step xe2x80x9cbxe2x80x9d as illustrated in FIG. 3. Step 10 then analyzes xe2x80x9ccxe2x80x9d and xe2x80x9cdxe2x80x9d and determines that node xe2x80x9ccxe2x80x9d has a path value of 5 and node xe2x80x9cdxe2x80x9d has a path value of 4. Therefore, step 108 and 110 in FIG. 4 insert the basic block xe2x80x9ccxe2x80x9d into the restructured data file of FIG. 5 after block xe2x80x9cbxe2x80x9d and a step 3 of FIG. 5 illustrates that node xe2x80x9cdxe2x80x9d is ignored and is not inserted into the chain of FIG. 5 at this point in time since node xe2x80x9cdxe2x80x9d did not have the highest weight value. Continuing from node xe2x80x9ccxe2x80x9d, basic block xe2x80x9cexe2x80x9d (represented by node xe2x80x9cexe2x80x9d in FIG. 3) is inserted in a step 4 of FIG. 5 using the algorithm of FIG. 4. Step xe2x80x9cfxe2x80x9d is then inserted in a step 5 of FIG. 5 using the process outlined in FIG. 4. Between nodes xe2x80x9chxe2x80x9d and xe2x80x9cgxe2x80x9d in FIG. 3, steps 108-110 will determine that node xe2x80x9chxe2x80x9d has a greater path value from node xe2x80x9cfxe2x80x9d than node xe2x80x9cgxe2x80x9d and insert basic block xe2x80x9chxe2x80x9d after block xe2x80x9cfxe2x80x9d in a step 6 of FIG. 5. Code represented by node xe2x80x9cixe2x80x9d is then inserted via step 7 of FIG. 5, and xe2x80x9cjxe2x80x9d is inserted via a step 8 in FIG. 5. Once node xe2x80x9cjxe2x80x9d is inserted in step 8, there are no more unselected nodes which can be reached from step xe2x80x9cjxe2x80x9d in FIG. 3 since node xe2x80x9caxe2x80x9d has already been analyzed and inserted into FIG. 5 in step 1 of FIG. 5. Therefore, step 108 sends the control of FIG. 4 back to step 102 and step 102 finds a new unselected node which has the highest weight value. In summary, by step 8 of a left portion of FIG. 5, the chain of blocks {abcefhij} is now fully sequentially inserted into the restructured computer file as illustrated graphically via a region 90 illustrated in a left portion of FIG. 5.
Returning to steps 102-106, the only remaining unselected nodes in FIG. 3 are xe2x80x9cdxe2x80x9d and xe2x80x9cgxe2x80x9d, which have equal edge weight values and therefore, by default, node xe2x80x9cdxe2x80x9d which is the earlier node is chosen via the process of FIG. 4. Node xe2x80x9cdxe2x80x9d is inserted via step 9 in FIG. 5. Since the node xe2x80x9cexe2x80x9d is reachable from node xe2x80x9cdxe2x80x9d in FIG. 3 but has already been previously selected (see step 4 of FIG. 5) and placed into the file of FIG. 5, step 108 determines that there is nothing more to process from node xe2x80x9cdxe2x80x9d and step 102 is once again executed. The only node remaining is node xe2x80x9cgxe2x80x9d and step 10 of FIG. 5 determines that node xe2x80x9cgxe2x80x9d should be inserted in a step 10 of FIG. 5.
Therefore, when a compiler is ordering the basic blocks of the program flow illustrated in FIG. 3, the final ordering of instructions or basic blocks in memory is performed as illustrated in step 10 of FIG. 5 with the goal of attempting to improve processor performance.
However, the prior art method illustrated in FIGS. 1-5 is flawed. By looking at FIG. 2, one can easily determine that if the path bc is taken, it is most likely that the path {fg} is also taken in conjunction with path {bc}. One can also determine if the path {bd} is taken, then the path {fg} is also more likely to be taken. In other words, the correlation between paths {bc} and paths {fg} is high whereas the correlation between paths {bd} and {fh} is high. Therefore, the most efficient organization of basic blocks in step 10 of FIG. 5 would be to couple the paths {bc} with {fg} in some serial order or couple the path {bd} with {fh} in some serial order. However, the algorithm illustrated via prior art FIGS. 4 and 5 results in the path {bc} being coupled and serially positioned with the path {fh} (see this illustrated graphically in the right portion of FIG. 5). This choosing of the wrong pairs to the detriment of CPU execution performance results because the prior art algorithm of FIG. 4 does not look ahead to more distant nodes and paths in the data structure of FIG. 3 but only looks at directly adjacent pairs of basic blocks or nodes in FIG. 3. The result is that the prior art of FIG. 4 and 5 performs basic block restructuring in a limited fashion which obtains limited performance benefit. Therefore, it is more advantageous to design a basic block restructuring process which identifies these correlations between more distant paths and performs improved sequencing of instructions to result in fewer cache misses, fewer external memory accesses, fewer page misses, fewer pipeline flushes and or stalls, and increase program execution speed.