1. Field of the Invention
This invention relates in general to the field of microelectronics, and more particularly to a technique for incorporating non-temporal memory attribute control at the instruction level into an existing microprocessor instruction set architecture.
2. Description of the Related Art
Since microprocessors are fielded in the early 1970's, their use has grown exponentially. Originally applied in the scientific and technical fields, microprocessor use has moved over time from those specialty fields into commercial consumer fields that include products such as desktop and laptop computers, video game controllers, and many other common household and business devices.
Along with this explosive growth in use, the art has experienced a corresponding technology pull that is characterized by an escalating demand for increased speed, expanded addressing capabilities, faster memory accesses, larger operand size, more types of general purpose operations (e.g., floating point, single-instruction multiple data (SIMD), conditional moves, etc.), and added special purpose operations (e.g., digital signal processing functions and other multi-media operations). This technology pull has resulted in an incredible number of advances in the art which have been incorporated in microprocessor designs such as extensive pipelining, super-scalar architectures, cache structures, out-of-order processing, burst access mechanisms, branch prediction, and speculative execution. Quite frankly, a present day microprocessor is an amazingly complex and capable machine in comparison to its 30-year-old predecessors.
But unlike many other products, there is another very important factor that has constrained, and continues to constrain, the evolution of microprocessor architecture. This factor—legacy compatibility—furthermore accounts for much of the complexity that is present in a modern microprocessor. For market-driven reasons, many producers have opted to retain all of the capabilities that are required to insure compatibility with older, so-called legacy application programs as new designs are provided which incorporate new architectural features.
Nowhere has this legacy compatibility burden been more noticeable than in the development history of x86-compatible microprocessors. It is well known that a present day virtual-mode 32-/16-bit x86 microprocessor is still capable of executing 8-bit, real-mode, application programs which were produced during the 1980's. And those skilled in the art will also acknowledge that a significant amount of corresponding architectural “baggage” is carried along in the x86 architecture for the sole purpose of supporting compatibility with legacy applications and operating modes. Yet while in the past developers have been able to incorporate newly developed architectural features into existing instruction set architectures, the means whereby use of these features is enabled—programmable instructions—are becoming scarce. More specifically, there are no more “spare” instructions in certain instruction sets of interest that provide designers with a way to incorporate newer features into an existing architecture.
In the x86 instruction set architecture, for example, there are no remaining undefined 1-byte opcode states. All 256 opcode states in the primary 1-byte x86 opcode map are taken up with existing instructions. As a result, x86 microprocessor designers must presently make a choice either to provide new features or to retain legacy compatibility. If new programmable features are to be provided, then they must be assigned to opcode states in order for programmers to exercise those features. And if spare opcode states do not remain in an existing instruction set architecture, then some of the existing opcode states must be redefined to provide for specification of the new features. Thus, legacy compatibility is sacrificed in order to make way for new feature growth.
One particular problem area that concerns microprocessor designers today relates to the efficient employment of cache structures by application programs. As cache technologies have evolved, more and more features have been provided that allow system programmers to control when and how memory caches are employed in a system. Early cache control features only provided an on/off capability. By setting bits in an internal register of a microprocessor, or by asserting certain external signal pins on its package, designers could enable caching of memory or they could render an entire memory space as uncacheable. Uncacheable memory references (i.e., loads/reads and stores/writes) are always provided to a system memory bus and thus incur the latencies commensurate with external bus architectures. Conversely, memory references, or accesses to a cache are provided to the system memory bus only when a cache miss occurs (i.e., when the object of a memory reference is either not present or is not valid within internal cache). Cache features have enabled application programs to experience dramatic improvements in execution speed, particularly for those programs that make repeated references to the same data structure in memory.
More recent microprocessor architecture improvements have allowed system designers to more precisely control how cache features are employed. These improvements permit the designers to define the properties of a range of addresses within a microprocessor's address space in terms of how references to those addresses are executed by the microprocessor with regard to its cache hierarchy. Generally speaking, references to those addresses can be defined as uncacheable, write combining, write through, write back, or write protected. These properties are known as memory attributes, or memory traits. Hence, store references to an address having a write back attribute are provided to cache and are speculatively executed. Store references to a different address having an uncacheable trait are sent to the system bus and are not speculatively executed.
It is not within the scope of the present application to provide in-depth description of memory attributes and how specific attributes are processed by a microprocessor with regard to its cache. It is sufficient herein to understand that the state of the art enables designers to assign a memory attribute to a region of memory and that all subsequent memory references to addresses within that region will be executed according to the cache policy associated with the prescribed memory attribute.
Although present day microprocessor designs allow different regions of memory to be assigned different memory traits, the designs are limited in two significant respects. First, microprocessor instruction set architectures restrict execution of instructions for defining/changing memory traits to a privilege level that is inaccessible by user-level applications. Accordingly, when a desktop/laptop microprocessor boots up, its operating system establishes the memory traits for virtual memory space prior to invocation of any user-level application program. The user-level applications are thus precluded from changing the memory traits of the host system. Secondly, the level of granularity provided by a present day microprocessor for establishing memory traits is page level at best. In conventional architectures that allow memory paging, the memory attributes of each memory page are defined by the operating system within page directory/table entries. Hence, all references to addresses within a particular page will employ the memory attribute assigned to the particular page during execution of the associated memory access operation.
For many applications, the above control features have allowed user-level programs to experience marked improvements in execution speed, but the present inventors have noted that other applications are limited because present day memory trait controls are not available for employment at the user level, and furthermore because memory attributes can only be established with page-level granularity. For example, a user program that makes repeated accesses to a first data structure will suffer when an incidental reference to a second data structure occurs, under the conditions where the cache entries of the first data structure must be flushed to provide space within the cache for the second data structure. Because operating systems have no a priori knowledge of the frequency of references to data structures by user-level application programs, application data spaces are typically assigned a write back trait, thus setting up the conditions for the above noted conflict. And an application programmer has no means to alter the assigned trait to force the incidental reference to go to the memory bus (e.g., assign an uncacheable trait to the second data structure), thereby precluding the conflict.
Within the art, data that is repeatedly accessed by an application program is referred to as temporal data and data associated with incidental references is called non-temporal data. One skilled in the art will also appreciate that filling up a cache with non-temporal data (i.e., cache pollution) is very disadvantageous. Consequently, more recent advances in the art have provided existing instruction sets with a limited set of non-temporal store instructions that allow application programmers to move data from internal registers to memory without polluting the cache. However, no means currently exists whereby a programmer can direct that a memory reference specified by an existing instruction (e.g., an instruction prescribing an arithmetic or logical operation that employs one or more memory operands) be executed non-temporally, thus bypassing cache altogether.
Therefore, what is needed is an apparatus and method that incorporate instruction level non-temporal memory reference control features into an existing microprocessor architecture having a completely full opcode set, where incorporation of the memory reference control features allows a conforming microprocessor to retain the capability to execute legacy application programs while concurrently providing application programmers with the capability to specify non-temporal memory accesses.