Referring to FIG. 1, a typical computer system includes a microprocessor (10) having, among other things, a CPU (12), a memory controller (14), and an on-board cache memory (16). The microprocessor (10) is connected to an external cache memory (22) and a main memory (18) that both hold data and program instructions to be executed by the microprocessor (10). Internally, execution of program instructions is carried out by the CPU (12). Data needed by the CPU (12) to carry out an instruction are fetched by the memory controller (14) and loaded into internal registers (20) of the CPU (12). A memory queue (not shown) maintains a list of outstanding memory requests. The memory controller adds requests into the memory queue and also loads registers with values from the memory queue. Upon command from the CPU (12), the memory controller (14) searches for the data first in the on-board cache (16), then in external cache memory (level 2 cache) (22), and finally in the slow main memory (18).
Physically, different kinds of memory have significant differences in the performance characteristics. Such performance characteristics include: the time to read/write data in the particular location in memory; the total volume of information that can be stored; and the unit costs of storing a given piece of information. To optimize the performance, in general, a memory is organized into a hierarchy with the highest performing and the most expensive devices at the top, and with progressively lower-performing and less costly devices in succeeding layers. For example, cache memories, commonly Static Random Access Memory (SRAM), belong to the higher performing group. In contrast, main memories, commonly Dynamic Random Access Memory (DRAM), belong to the lower-performing group.
A memory may be considered as a two-dimensional array including a number of memory cells. Each cell holds one bit of information and is identified uniquely by using row and column addresses. The addresses are derived through row and column decoders according to instructions. FIG. 2 shows an example of cache memory configuration. When a CPU needs data, the memory controller looks for the data in the cache memory. The instructions are fed into inputs (120, 134) of row (122) and column decoders (124), which derive the addresses for the data. After the data is found in the memory (132), all or a part of the data may be selected for a specific operation according to the instruction. If the data needs to be written at the addresses, control unit (126) selects write unit (128) to feed data through input line (data_in) to write the data at the addresses. If the data needs to be read at the addresses, the control unit (126) chooses read unit (130) to read the data out from output line (data_out). Then, the data may be processed or transferred to the CPU through various elements in the microprocessor.
FIG. 3 shows an example of data transfer from a cache memory to another element in the microprocessor. In this example, data in SRAM (32) is transferred to stretcher (STR) (140), which adjusts the timing of signal. That is, the STR (140) shrinks or extends signal of the data to adjust the timing during data transfer. After the timing is adjusted, the data is transferred to a multiplexer (MUX) (34). At MUX (34), a part of the data may be selected using signal (36). The chosen data is then transferred to an aligner (38), which arranges the data in appropriate order and, if necessary, may assign a unique extension for data bits according to the instructions. Assigning a unique extension is explained below. Then, the aligner (38) transfers the data (40) into the other element in the microprocessor.
The data transfer may vary depending on the memory configuration. For example, a cache memory may be divided into banks. A bank is a memory block that typically is arranged to match the bit width of the data bus. A data bus is a path used to transfer data in a microprocessor. In this configuration, data from a cache memory may be transferred along multiple paths for each of the banks.
Referring to FIG. 4, a cache memory is divided into four banks (150, 152, 154, 156) and each bank outputs 64-bit data. The 64-bit data may be divided into four, 16-bit data arrays. For example, Bank 1 outputs four arrays to STR (140), which may extend or shrink the signal of the four arrays to adjust the timing to transfer data. Then, after the process is complete, the four arrays are transferred into MUX (34). A select signal (36) chooses one of the four arrays at the MUX (34). Finally, The chosen 16-bit data is transferred to aligner (38).
Thus, in this example, 16 bits of data are transferred from one of the four banks. In the same manner, 16 bits of data are transferred from each bank at a time Therefore, in this example, 64-bit data is transferred to aligner (38). Then, the Aligner (38) arranges the 64-bit data according to the instructions before transferring the data to another element in the microprocessor. If the 64-bit data must be converted to another type, the aligner (38) assigns a unique extension to the data. For example, if the 64-bit data must be converted into 32 bits, the aligner (38) may assign a 32-bit extension to the data. This process is known as signing data bits.
The latency of the above system is generally determined by the signing process, because that process consumes the most time during the data transfer.