The present invention relates generally to computer memory, and more particularly to packed loading and storing of data to/from memory.
Computer systems often require a considerable amount of high speed memory, such as random access memory (RAM), to hold information, such as data and programs, when a computer is powered and operational. Memory device demands have continued to grow as computer systems have increased performance and complexity.
Communication from a main processor to locations on memory devices can involve relatively long data access times and latency. The time it takes for the main processor to access memory can be, for example, several hundred cycles, including time to realize the data is not in cache (for memory reads), time to traverse from a processor core of the main processor to I/O, across a module or other packaging, arbitration time to establish a channel to memory in a multi-processor/shared memory system, and time to get the data into or out of a memory cell.
A vector processor may support multiple memory accesses in parallel. When data types being accessed have different widths, particularly smaller than a nominal width, gathering the data in parallel and storing the data in parallel can result in inefficient use of memory, such as leaving potentially unusable gaps between data elements stored and read from memory.