This invention pertains generally to the field of computer system device write operations and more particularly to system, apparatus, method, and computer program for performing multiple write operations of data and/or commands from a processor or other command or data source to a hardware device in a manner that the processor""s or executing procedure""s intended order of receipt by the device is preserved.
Heretofore, programs and/or processes that generate data and/or commands under programmatic control, such as in a device driver program, applications program, or the like, have frequently been forced to employ a conservative memory management strategy when the target for the command or data is a hardware device, so that the intended order of receipt of data or commands by the device is assured. Hardware devices (such as printer devices, modems, graphics processors, and the like, to name a few) may be problematic because such devices do or may respond immediately upon receipt of the particular command or data item, and may not typically wait to receive all of the data or commands that will be sent from a processor, microprocessor, or computing system. Waiting to receive all the data or comments would provide an opportunity to sort the data or commands into the temporal order intended by the application executing on the computing system and being communicated to the hardware device. In some instances, it would not even be possible for the hardware device to reconstruct the intended time order as insufficient information is provided to the device respective of the intended temporal order. Often the temporal order is an indicator of the identity of particular data or commands, so that out-of-order receipt causes the data on command to be interpreted by the receiving device in an unintended manner. A memory on the other hand, can typically wait for all of the anticipated data and/or commands to arrive before accessing it, and if required, restore it to the proper temporal order, before accessing it.
While this approach may be applicable to some hardware devices, for such hardware devices, this conservative approach (sometimes referred to as sequential, in-order, or strong memory management) unfortunately results in some degradation in performance, typically manifested as reduction of available bandwidth. On the other hand, if a less conservative memory management strategy (sometimes referred to as out-of-order or weak memory management) could be employed for hardware devices, then performance sacrifices could be minimized.
In the embodiment of a computer system 102 illustrated in FIG. 1, level 1 (L1) cache memory 252 is coupled to processor 250 via a bus 258, and level 2 (L2) cache 254 is coupled to processor 250 by bus 256. Bridge circuits as are known in the art may be interposed between the structure. The inventive structure and method described hereinafter are also applicable to multi-processor environments and multi-processor computers; however, we use the term processor or CPU generally to refer to single processor environments, dual-processor environments, and other multiple processor environments and computer or information processing systems. Caches 252, 254 serve to provide temporary memory storage for processing that may or will be needed for near-term execution cycles within the processor. For non-short term storage the system memory 278 would generally be used rather than caches 252, 254. The use of a cache memory in association with a processor 250 in a computing system 102 system of the type illustrated in FIG. 1 is known, and not described further.
System memory 278 may, for example comprise solid-state addressable Random Access Memory (RAM) of which there are many conventional varieties, and is used to store commands, addresses, data, and procedures for use by the computer system 102. System memory 278 may for example, store all, or portions of hardware drivers for operating devices 290, 292, 110 and in the inventive graphic processor 210 described above.
Processor 250 is also connected to a write buffer 204 by address bus (ADDR) 260, and data bus (DAT) 262. Write buffer 204 is interposed between processor 250 and memory controller 268 which controls the flow of command/control/address/data between write buffer 204 and either system memory 278 or devices attached to one or more peripheral busses, such as a graphics processor 110 on a Advanced Graphics Processor (AGP) Bus 286, or Device xe2x80x9cAxe2x80x9d 290 or Device xe2x80x9cBxe2x80x9d 292 on a Personal Computer Interface (PCI) Bus 288. Devices xe2x80x9cAxe2x80x9d or xe2x80x9cBxe2x80x9d could for example, comprise printers, cameras or other sensors, modems, secondary processors, other graphics processors, and any other conventionally known computer device or system.
It should also be understood that such devices need not be PCI Bus compatible devices, but may also include for example AGP Bus, SCSI, ISA, Universal Serial Bus (USB), fibre channel, fire wire, or other compatible devices, and that such devices may be configured to operate internal to a computer housing such as within a slot on the computer motherboard, or as external peripheral devices connected by cable or wireless connection. The types of computer system devices or hardware devices include the types used for IBM compatible personal computers (PCs), MacIntosh PowerMac, Power PC, iMAC, and the like computers made by Apple Computer, workstations (such as, for example, the Sun Microsystems, SPARC workstation), specialized microprocessors, or even mainframe type computer systems.
Processor 250 may be of the type having internal or external caches with or without chipsets connecting to I/O or graphics processor buses, or where multiple processors are connected tightly or distributively sharing or not storing memory. Such microprocessor may for example implement RISC, CISC, VWIS, or other instruction sets and may support speculative execution, or the like advanced processing concepts. For example, the Intel Pentium, Intel Pentium II, Intel Pentium III, Intel Merced, ARM, Advanced Micro Devices K6, Advanced Micro Devices K6-3 or K7, Compaq Alpha, IBM Power PC, Sun Microsystems SPARC, Silicon Graphics (SGI) MIPS or any other processor, microprocessor or CPU may be used. Systems may also include a plurality of the same or different processors.
Of particular interest are the Intel Pentium(copyright) II and III microprocessors (and other successor processors that utilize the functionality) which utilize fast writes and uncached write combine operations. Other modem processors also generate results out-of-order, for example as a result of speculative execution, branch operations, parallel processing, and the like. Generally, uncached write operations refer to program-generated data written directly to system memory, rather than to an L1 or L2 cache. This may also be called uncached speculative write combining (USWC), and part of the address space of the processor may be specified to be of the UWSC type. The advantage of USWC-type memory is the ability to receive out-of-order write operations shortly after the processor generates a write operation, avoiding synchronization with other write operations, thereby increasing processing throughput.
Write buffer 204 is of conventional type and may for example be implemented with a static RAM. Usually, processor 250, L1 cache 252, and write buffer 204 are formed on a single common substrate within a single chip. Write buffer 204 may be envisioned as including a plurality (for example xe2x80x9cnxe2x80x9d) of cache lines 205 for temporarily storing command/address/data sent from processor 250 to memory controller 268 and ultimately to either system memory 278 or other input/output or peripheral devices, including for example device xe2x80x9cAxe2x80x9d 290, device xe2x80x9cBxe2x80x9d 292, or hardware device 110.
In the embodiment illustrated in FIG. 1, the hardware device includes a hardware device processor 134 (such as a graphics pipeline of a graphics processor), and a First-In-First-Out (FIFO) memory 120 interposed between AGP bus 286 communicating information from the host processor 207 to the hardware device processor 134. FIFO memories or buffers are known in the art and not described further here, except in order to distinguish conventional structure or operation from the inventive structure, operation, and method. Conventional structures, lines, signals, and the like, not central to understanding the invention are implied but are not shown in the drawings to avoid obscuring the invention.
We now describe the some problems associated with out-of-order generation of datum (include data and commands) by the computer system. In high-performance computer systems, there is a desire to execute instructions as rapidly and efficiently as possible. This often means that either intermediate or final xe2x80x9cresultsxe2x80x9d are generated out-of-order from the order they will be used, or out-of-order relative to the desired order of receipt by some other process or device. Usually, if the results are only to be written to a memory, such as to memory system memory 278, the order in which such results (datum) are generated is not important since either the subsequent process can wait until all results have been generated, or the results (datum) will be retrieved from memory in the order desired. Usually, the results are written to particular address locations and proper ordering is inherent in reading the final memory contents at the completion of the process. So for example, if it is ultimately desired to read the contents of memory locations 001h-008h (h=hexadecimal) in order of ascending address location, but the contents of these memory locations were generated in the order 002h, 001h, 005h, 006h, 004h, 003h, 008h, 007h, it is only necessary to read the results from memory in the proper ascending order after the values have been written to memory.
However, a problem arises in a computer system where the processor 250 treats a device, such as graphics processor 110, or devices xe2x80x9cAxe2x80x9d or xe2x80x9cBxe2x80x9d as memory. This paradigm is sometime referred to as the xe2x80x9cmemory mapped I/Oxe2x80x9d model. A system using memory mapped I/O devices are addressed at certain reserved address ranges on the main memory bus, these addresses cannot therefore be used for system memory, and when memory mapped I/O is used, it may not be possible for the processor or memory controller to treat datum destined for system memory to be treated differently from datum destined for the I/O devices. This problem arises when the operation of a device depends on the correct order of receipt of commands or data where there is no opportunity to delay the expression of a received data or command item until it is reordered.
In one simple example of this problematic situation, consider a printer or printing device that prints each character as it is received. The order in which each character is received is important to the correct operation of the printer device. If the intended characters and/or words xe2x80x9cdog ran down the streetxe2x80x9d are received out of order, the printer might print each letter as it is received and erroneously print xe2x80x9cgod ran down the streetxe2x80x9d, xe2x80x9cstreet ran down the dogxe2x80x9d, xe2x80x9cthe street ran down godxe2x80x9d, or something entirely unintelligible. Preserving order is important.
The out-of-order result is due at least in part to the use of cached or uncached write combine mode in a system where the processor can not determine or does not determine that it is writing to an I/O device or other device where order may be important rather than to a memory.
One conventional approach to eliminate the occurrence of the out-of-order result is to apply a so called xe2x80x9cstrong memory modelxe2x80x9d to the I/O access rather than a so called xe2x80x9cweak memory modelxe2x80x9d. Conventionally, a strong memory model assumes that all read and write from all processors are in sequential order and as a result, the I/O devices will receive them in the same order in which they have been issued. In a weak memory model, there is an assumption that memory reads and writes can go out-of-order from the order in which they are issued by the same or other processors so that a synchronization and reordering is required on the receiving side to ensure correct processing at the receiver. There is somewhat of a continuum between the strong and weak memory models so that intermediate levels of performance (and problems) may be realized.
Therefore, absent some additional mechanism for preserving order, datum or other results may reach a device out of order from that intended by the process generating the results, for example, out of the order intended by the applications program or device driver.
One such order preserving mechanism applicable to a limited class of situations but which does not solve the problem for reasons described hereinafter, is the xe2x80x9cwrite fencexe2x80x9d. A write fence is a special processor operation (included in some Intel processors) or command in the form of an instruction that asserts signals between the processor and the write buffer or otherwise communicates with the processor and the write buffer, to signify that the later (second) write block on one side of the write fence is to be held (not sent to the I/O device) until the earlier (first) write block has been sent to the I/O device. Here, earlier (first) refers to the intended programmatic order and later (second) refers to subsequent programmatic order, rather than to the actual temporal order of the result. A conventional write fence command is a low level (usually an assembly language code level) primitive that does not typically exist in high level programming languages.
This type of fence can provide some order presentation between write blocks, but unfortunately, a fence written by the conventional Write Fence command, while present in the instruction memory, does not get sent to an I/O or hardware device and is therefore not visible to such I/O devices, and cannot be used to solve the problems in these conventional systems. Also, even if the write fence could be seen by the hardware I/O device, the write fence would not generally assist in maintaining temporal order or result identity within a single cache line, and some mis-ordering or scrambling would still occur.
For a system in which a particular device, which benefits from receiving data and commands in the temporal order intended by the device driver of other program, is coupled to a processor 250 and write buffer 204 having conventional design, there therefore remains a need for system, apparatus, and method that maintains the ordering intended by the software or firmware driver program.