FIG. 1(a) is a simplified block diagram of a conventional microprocessor system 100a having a central processing unit (CPU) 110 coupled to a memory system 120. CPU includes an address generator 112, a data aligner 114 and various pipelines and execution units (not shown). Address generator 112 provides a memory address ADDR to memory system 120. Memory address ADDR is used to activate a row of memory system 120. In general a memory address includes a row portion that forms a row address for memory system 120. The remaining bits of the memory address designate a specific portion of the memory row. For clarity, the description herein assumes that the bottom row of memory system 120 has a row address of 0. Each successive row has a row address that is one more than the previous row. Furthermore, memory system 120 is described as being 64 bits wide and is conceptually divided into 4 16 bit half words. CPU 110 uses data aligner 114 to load data from or store data to memory system 120. Specifically, data aligner 114 couples a 64 bit internal data bus I_DB to memory system 120 using four 16 bit data buses DB0, DB1, DB2, and DB3. Conceptually internal data bus I_DB contains four 16 bit data half words that can be reordered through data aligner 114.
CPU 110 may access memory system 120 with multiple store and load instructions of different data width. For example, CPU 110 may support instructions that work with 8, 16, 32, 64, 128, 256 or 512 bit data widths. Furthermore, CPU 110 may support storing and loading of multiple data words simultaneously using a single access. For example, CPU 110 may write four 16 bit data words simultaneously as a single 64 bit memory access.
The ability to access data having different data widths may result in unaligned data. As illustrated in FIG. 1, memory system 120 contains data sets A, B, C, D, E, and F. Each data set is separated as one or more half words (i.e., 16 bits wide) in memory system 120. For example, data set A includes half words A0, A1, A2, and A3. Data set B includes half word B0. Data set C includes half words C0 and C1. Data set D includes half words D0, D1, D2, and D3. Data set E includes half word E0 and E1. Data set F includes half words F1, F2, F3, and F4 (not shown). Data set A, which is located completely in row 0, is aligned data and can easily be retrieved in one memory access. However, data set D is located in both row 1 and row 2. To retrieve data set D, CPU 110 must access memory system 120 twice. First to retrieve half word D0 in row 1 and then to retrieve half words D1, D2, and D3 in row 2.
Because memory bandwidth is one of the main factors limiting the performance of microprocessor system 100a, requiring multiple memory access to retrieve a single data set greatly decreases the performance of microprocessor system 100a. Replacing memory system 120 with a dual ported memory can eliminate the need for two memory accesses. However, dual ported memories greatly increases silicon cost (i.e. area) of the memory system as well as the power consumption of the memory system.
Furthermore as illustrated in FIG. 1(b) some microprocessor systems such as microprocessor system 100b includes a cache 130 to increase memory performance. As is well known in the art, caches are generally small fast memories, that store recently used data so that repeated access to the data can be performed very quickly. In general when data is read from, or written to the main memory (i.e. memory system 120) a copy is also saved in cache 130 along with the memory address of the data. Cache 130 monitors subsequent reads and writes and determines whether the requested data is already in cache 130. When the data is already in cache 130 (i.e. a cache hit) the data in cache 130 is used rather than memory system 120. Because data in cache 130 can be accessed faster than memory system 120 the performance of the overall system is improved. Furthermore, data is generally transferred between memory system 120 and cache 130 in a cache line, which is generally several times larger than a memory access of the CPU 110. Using large cache lines generally improves cache hit ratios because data that is in close proximity in memory are generally used together. Furthermore, large cache lines improve burst transfers on busses for write back and refilling. While caches that support aligned access are straight forward and well known in the art, caches supporting unaligned access have even larger problems than described above with respect to memory system 120, because the cache lines are larger and in general more lines are read at the same time.
Hence there is a need for a method or system that provides fast unaligned access to a memory system without requiring high power utilization or large silicon area.