The present invention relates generally to decompose PBIs from i-streams in the compiled program and to convert the i-streams to BLI- or non-BLI-streams before prefetching and fetching BLI- or non-BLI-streams for look-ahead branch prediction with PBIs in a sequential and/or parallel manner. More specifically, it relates to hide latency of branch prediction and to increase instruction fetch bandwidth, which is a number of instructions fetched per the BL microprocessor clock cycle. The invention also relates to identify the PBIs, NPBIs, and NBIs. More specifically, a PBI represents a BLI-stream comprising the PBI at the beginning and a single or plurality of NPBIs and/or NBIs after. The PBI contains information to predict branch operation, obtaining branch target location of the PBI, and others if necessary.
The invented branch look-ahead (BL) system includes a branch look-ahead compilation (BLC) software system to decompose i-streams, which generally contain pairs of prediction-required branches and branch target instructions and vice versa. In addition, the BLC software system creates a PBI to represent a BLI-stream as a single prediction-required branch instruction for predicting next path if necessary. In particular, the BLI-stream comprises the branch instructions with non-branch instructions in a loop or a subroutine. The BLC software system relocates any PBI at the end of the i-stream to the first location of the BLI-stream for fetching the PBI before or currently fetching any other instructions from the BLI-stream. Therefore, a PBI is fetched for predicting taken- or not-taken branch operation and for obtaining branch target address to take a branch.
The invented BLC software system generates BLI-streams comprising PBIs and associated NPBIs and/or NBIs from the compiled program, such as the assembly program. The BLI-streams are sequentially and/or concurrently prefetched and/or fetched through separate paths of the branch look-ahead instruction memory (BLIM) systems if necessary. A PBI initiates to access a single or plurality of NPBIs and/or NBIs in general. Thus, the NPBIs and/or NBIs are only prefetched or fetched after prefetching or fetching the PBI. This results in look-ahead branch prediction and the sequential and/or concurrent instruction prefetching and fetching while hiding taken-branch latency.
The BLC software system composes a PBI comprising an associated opcode to identify it as a prediction-required branch, such as conditional branch, and other information including the last instruction of the associated BLI-stream, the information of the branch target location, and/or other information, for prefetching and fetching the next BLI- or non-BLI-streams.
The BL system apparatus and method is designed for enhancing bandwidth of fetching the BLI- or non-BLI-streams, hiding latencies of the BLI cache access, hiding branch prediction latencies, and improving the overall performance of the BL microprocessors. The invented BL system uses a branch look-ahead instruction prefetching (BLIP) system and fetching (BLIF) system integrated to a single or plurality of concurrently accessible hierarchical BLIM systems.
The invented BLIP/BLIF systems prefetch and/or fetch a single or plurality of instructions in BLI- or non-BLI-streams concurrently for branch prediction and/or instruction decode to the BL microprocessors while delivering a single or plurality of BLI- or non-BLI-streams in their compatible fetching order for instruction decode and execution to the BL microprocessors after predicting each PBI. The BLIP/BLIF systems prefetch and fetch instructions in BLI- or non-BLI-streams from the single or plurality of concurrently accessible main BLI memories via a single or plurality of levels of concurrently accessible BLI caches and delivering the instructions of BLI- or non-BLI-streams to the BL microprocessors.
The invented BLIP/BLIF systems are capable of branch look-ahead prefetching the single or plurality of instructions of BLI- or non-BLI-streams from the locations of the main BLI memories via the single or plurality of levels of BLI caches by obtaining a single or plurality of addresses from the instructions of the BLI- or non-BLI-streams to a single or plurality of locations in the main BLI memories and/or BLI caches. The BLIP system prefetches the next prospective BLI- or non-BLI-streams from both of taken- and not-taken branch paths and continuously prefetches instructions of the BLI- or non-BLI-streams from a single or plurality of next paths while the BLIF system fetches the instructions of the BLI- or non-BLI-streams to the BL microprocessors.
The BL system apparatus and method for the BL microprocessors permits hiding a number of taken-branch prediction latencies while providing the compatible instruction prefetching and fetching. In addition, the BL system apparatus and method for the BL microprocessors allows fragmenting an i-stream to a single or plurality of the fragmented instructions to prefetch and fetch multiple instructions in the same i-stream in parallel and quickly while continuously providing the code compatibility. Alternatively, the BLC software system directly produces the BLI- or non-BLI-streams from high-level language programming.
The BL system apparatus and method effectively utilizes available instruction caches in terms of the cache size, power consumption, and operational speed. The invention also prefetches in a look-ahead manner the PBIs, NPBIs, and NBIs on both of the prospective paths in the program flow concurrently or sequentially before fetching and branch predicting PBIs and fetching NPBIs and NBIs concurrently or sequentially. Furthermore, the invention fetches PBIs, NPBIs, and NBIs in an accurate manner by fetching PBIs, NPBIs, and NBIs from the BLI caches. Since the PBIs do not change any operation results, the NPBIs and NBIs provide compatibility if the NPBIs and NBIs are fetched and executed in the same or compatible order. Therefore, changing order of PBIs in program from the last locations of the i-streams to the first locations still maintain important information regarding the order of the NPBIs and NBIs. However, the PBIs are fetched to a branch predictor for predicting a single or plurality of cycles in advance to fetch next i-stream.
Through this invention, one can decompose their own compatible and ciphered instructions as PBIs, NPBIs, and NBIs and prefetch and fetch them sequentially and/or concurrently from the main BLI memories via the levels of the BLI caches. More specifically, a single or plurality of branch prediction results is obtained by look-ahead prefetching and/or fetching of next PBIs and the associated NPBIs and NBIs to a single or plurality of the BL microprocessors, which predicts branches in advance and decodes and executes in compatible order dynamically.