The present invention generally relates transforming each basic block in the program compiled in prior arts to a pair of an LI and a single or plurality of CIs during the compilation with the BBC. However, some LIs are not required to have any associated CI. The invention also relates identifying the various LIs according to the types of the basic blocks, including (1) basic blocks with conditional branches, (2) basic blocks of subroutines, (3) basic blocks with unconditional branches, (4) instruction segments without any flow control, where the instruction segment is a sequence of instructions, and (5) the other single or plurality of consecutive instructions in the program.
The invention relates prefetching and fetching a single or plurality of the LIs and the associated CIs in different order from the order of instructions in the basic block found in prior arts. Thereby, the invention allows performing lookahead branch prediction by prefetching and fetching the LIs in a sequential or parallel manner before doing the associated CIs. More specifically, the lookahead branch prediction relates hiding the branch prediction latency and reducing the instruction fetch cycle.
The invented CAL memory system includes the BBC to transform the basic blocks in the program found in prior arts. In addition, the BBC creates a pair of an LI and the associated single or plurality of CIs from a basic block. More specifically, the pair of the LI and the associated CIs is created from a loop or a subroutine. The pair of the LI and the associated CIs permits to prefetch and fetch in a lookahead manner, which the LI of a conditional branch at the end of the basic block is prefetched and fetched prior to do the associated CIs transformed from the remaining instructions located prior to the last instruction in the same basic block. A single or plurality of LIs and the associated CIs are allocated to the LI and CI main memories for sequential and/or concurrent prefetching and/or fetching via separate paths of the LI and CI main memories and caches.
The BBC creates an LI comprising an opcode to identify type of the LI, the encoded branch target address, the compatible information, including the last CI identifier of the associated CIs, the loop identifier, and/or other information for prefetching and fetching the next LI and the associated CIs.
The invented CAL memory system apparatus is designed for reducing the deliver cycles of the necessary LIs and the associated CIs by hiding the accessing latencies of the LI caches and hiding the branch prediction latencies of the LIs. Thereby, the overall performance of the processors is enhanced. The invented CAL memory system is designed for performing a lookahead prefetching and fetching with the associated prefetchers and fetchers.
The invented CAL memory system apparatus prefetches and/or fetches a single or plurality of LIs and/or CIs concurrently for branch prediction of the LIs and/or for decoding the CIs. More specifically, the CAL memory system apparatus reorders the fetched LIs and associated CIs before decoding the fetched LIs and CIs prior to forward the decoding results of the reordered CIs and LIs to the backend processors in order to provide compatibility of the transformed LIs and CIs.
The invented CAL memory system apparatus prefetches and fetches LIs and/or CIs from the single or plurality of concurrently accessible LI and/or CI main memories via a single or plurality of levels of concurrently accessible LI and CI caches to the frontend processors. More specifically, the CAL memory system apparatus is capable of prefetching the single or plurality of LIs and/or CIs in a lookahead manner from the locations of the LI main memories to the locations of the single or plurality of levels of the LI caches or the locations from the upper-level LI caches to the locations of the lower-level LI caches. The CAL memory system apparatus produces a single or plurality of the LI addresses from the accessed LIs if needed and uses for accessing the LIs from a single or plurality of locations addressed from the LI main memories and/or the LI caches. In addition, the next prospective LIs and the associated CIs from both of taken-branch and fall-through paths are prefetched while repeating the LI and CI prefetch operations for limited times. More specifically, the LI and the CI prefetch operations are resumed if any LI or CI cache miss is occurred, where a cache miss is a state where the data requested to access is not found in the cache memory.
The CAL memory system apparatus provides functional compatibility while enhancing performance of the processors by performing lookahead branch prediction and lookahead prefetch/fetch of the LIs and CIs. More specifically, the multi-cycle taken-branch prediction latencies are overwrapped by the lengthy CI fetch cycles. Thus, the branch prediction latencies are hidden. The LI and CI cache miss latencies are also overwrapped by early launching prefetch and fetch operations and by the short access time of the LI/CI caches. Furthermore, the CAL memory system apparatus fragments a long basic block to a plurality of the short instruction streams to prefetch and fetch LIs and CIs in order to increase parallel accesses and to decrease the overall prefetch and fetch cycles. Alternatively, the BBC directly produces the LIs and the associated CIs from high-level language program in prior arts.
The CAL memory system apparatus adaptively utilizes available instruction caches in terms of the cache access time, size, and energy consumption. In addition, the invention is able to fetch the various types of LIs and the associated CIs in an accurate manner by only fetching the CIs unlike additional instructions are often fetched in case of the misaligning between a number of the instructions found in the basic block and a number of the instructions stored in the cache block or accessed via the instruction fetch bus width.
Through this invention, one can decompose their own compatible and ciphered instructions as various LIs and the associated CIs and prefetch and fetch the LIs and the CIs in sequential and/or parallel from the LI and CI main memories via the levels of the LI and CI caches.