This application relates generally to data processing systems, and more specifically, to instruction fetching in data processing systems.
As data processing systems are becoming more widely used for a variety of applications, both speed and cost are becoming greater concerns. The goal in most designs is to reduce latency in order to improve speed and performance. For example, in many data processing systems, a central processing unit (CPU) increases instruction fetching efficiency by incorporating a number of instruction buffers and a wider data bus to memory. As the width of these instruction buffers and data buses increases, the bandwidth of data transfers increases, thus allowing for a more efficient CPU pipeline utilization. For example, a CPU may utilize a 32-bit bus which allows for 32-bit accesses. Therefore, for a processor having a 16-bit instruction length, two instructions may be accessed each cycle from a device that supports 32-bit accesses. However, in such data processing systems, a need exists to be able to also access instructions from devices, such as memories, supporting only 16-bit accesses. Devices having 16-bit access ports are generally cheaper and easier to manufacture than devices having 32-bit access ports since smaller port sizes allow for smaller packages. In the case of these 16-bit devices, the increased bandwidth offered by the 32-bit data busses internal to the data processing system may present a performance penalty rather than a performance improvement when the CPU requests a pair of 16-bit instructions since the 16-bit device is not capable of supplying a pair of instructions with the same latency as a single instruction.
For example, FIG. 1 illustrates, in timing diagram form, the operation of a data processing system having a CPU utilizing 16-bit instructions coupled to a 32-bit internal data bus, a 16-bit external data bus, and a 16-bit external memory device. In this case, the CPU requests and fetches two instructions during each instruction access, since the internal data bus supports 32-bit fetches. In many sequences of instructions, though, greater pipeline stalls occur due to the fact that two instructions must be accessed before returning the fetched instructions to the CPU. For example, as illustrated in FIG. 1, a pair of instructions located at addresses 0 and 2 are accessed during the first two cycles by placing address 0 on the internal address bus (INT ADDR) and requesting a 32-bit fetch. The requested address corresponds to an external 16-bit memory, thus two 16-bit fetches must be performed (to address 0 and 2 respectively) in order to satisfy the CPU""s request. In the instruction stream illustrated in the table of FIG. 1, the first two instructions stored at addresses 0 and 2, are branch (BRANCH) and instruction 1 (INST 1), respectively. Once the branch and instruction 1 are placed on the external data bus (EXT DATA) by the device being accessed, they are provided to the CPU as shown in FIG. 1 via the internal data bus (INT DATA). Therefore, the CPU does not begin to decode the branch instruction until both the branch and instruction 1 have been fetched from the accessed device.
While the branch is in the decode stage of the CPU pipeline, an access of the next two instructions has already been initiated, as illustrated by INT ADDR receiving address 4, indicating that address 4 has been accessed. No data is returned to the CPU until both instructions 2 and 3 (INST 2 and INST 3) corresponding to addresses 4 and 6, respectively, are placed on the external data bus. However, prior to completing the access of addresses 4 and 6, the branch was decoded and a target address generated. Because the branch instruction causes a change of flow in the instruction execution stream, the prefetched instructions 2 and 3 (located at addresses 4 and 6 respectively) will be discarded, and are not executed. Since the fetches of addresses 4 and 6 were already initiated, the CPU is stalled until both instructions 2 and 3 are fetched. Therefore, the fetch of instructions 2 and 3 introduces stall 2 into the CPU pipeline. Only after the access of instructions 2 and 3 can the access of the target instruction (TARGET) of the branch located at address 10 begin. Furthermore, the target of the branch is not received until after both the target and target 2 instructions (at addresses 10 and 12) have been placed on the external data bus and returned to the CPU, since a pair of instructions was requested, thus introducing stall 4 into the CPU pipeline.
The introduction of stalls 1 through 4 into the CPU pipeline results in increased latency and decreased performance of the data processing system. FIG. 1 illustrates one example of the latencies introduced into a data processing system; however, similar latencies arise in many data processing systems utilizing similar instruction fetches, especially when attempting to interface a data processing device with an external device having a smaller access port than the width of the data processing device""s internal data bus. Therefore, a need exists for improved instruction fetching in order to reduce latency and achieve a more efficient data processing system.