FIG. 1 is a simplified block diagram of a conventional microprocessor system 100 having a central processing unit (CPU) 110 coupled to a memory system 120. CPU includes an address generator 112, a data aligner 114 and various pipelines and execution units (not shown). Address generator 112 provides a memory address ADDR to memory system 120. Memory address ADDR is used to activate a row of memory system 120. In general a memory address includes a row portion that forms a row address for memory system 120. The remaining bits of the memory address designate a specific portion of the memory row. For clarity, the description herein assumes that the bottom row of memory system 120 has a row address of 0. Each successive row has a row address that is one row offset more than the previous row. The row offset depends on the number of words on each row of the memory. Furthermore, memory system 120 is described as being 64 bits wide and is conceptually divided into 4 16 bit half words. CPU 110 uses data aligner 114 to load data from or store data to memory system 120. Specifically, data aligner 114 couples a 64 bit internal data bus I_DB to memory system 120 using four 16 bit data buses DB0, DB1, DB2, and DB3. Conceptually internal data bus I_DB contains four 16 bit data half words that can be reordered through data aligner 114.
CPU 110 may access memory system 120 with multiple store and load instructions of different data width. For example, CPU 110 may support instructions that work with 8, 16, 32, 64, 128, 256 or 512 bit data widths. Furthermore, CPU 110 may support storing and loading of multiple data words simultaneously using a single access. For example, CPU 110 may write four 16 bit data words simultaneously as a single 64 bit memory access.
The ability to access data having different data widths may result in unaligned data. As illustrated in FIG. 1, memory system 120 contains data sets A, B, C, D, E, and F. Each data set is separated as one or more half words in memory system 120. For example, data set A includes half words A1, A1, A2, and A3. Data set B includes half word B0. Data set C includes half words C0 and C1. Data set D includes half words D0, D1, D2, and D3. Data set E includes half word E0 and E1. Data set F includes half words F1, F2, F3, and F4 (not shown). Data set A, which is located completely in row 0, is aligned data and can easily be retrieved in one memory access. However, data set D is located in both row 1 and row 2. To retrieve data set D, a conventional CPU 110 must access memory system 120 twice. First to retrieve half word D0 in row 1 and then to retrieve half words D1, D2, and D3 in row 2.
Because memory bandwidth is one of the main factors limiting the performance of microprocessor system 100, requiring multiple memory access to retrieve a single data set greatly decreases the performance of microprocessor system 100. For microprocessor system 100, memory system 120 decreases performance by up to fifty percent. Replacing memory system 120 with a dual ported memory can eliminate the need for two memory accesses. However, dual ported memories greatly increase silicon cost (i.e. area) of the memory system as well as the power consumption of the memory system. Furthermore, dual ported memories typically have lower access times than single ported memories. Hence there is a need for a method or system that provides fast unaligned access to a memory system without requiring high power utilization or large silicon area.