This invention relates generally to the organization of digital computing apparatus and more specifically the recording or logging of memory updates in support of auditing, recovery, device output, networking and multiple cache consistency maintenance.
Data Logging
Data logging or logging is the recording of data values generated by, and concurrent to, the execution of some generating process. Logging is used in database management systems to record updates to the database to facilitate recovery in the case of a system crash or media failure. It is also used to record the previous value of a record before an update so that the change can be undone. Logging is used in computer simulation to provide a record of the activity of the simulated system for subsequent processing and analysis, as an aid to debugging the system, and to implement rollback in so-called optimistic simulation where it may be necessary to undo actions that have progressed too far ahead in virtual time. Also, logging is used to record actions performed by a system so that these actions can be monitored and reviewed as part of security auditing.
Logging has been implemented in two well-established ways. First, the program can be modified either manually by a programmer or by an language processor to generate a write operation to a log buffer area for every data write operation occurring in the program. This approach is commonly used with database management systems, program debugging, and simulation systems. FIG. 1 shows this organization. An application process 101 performs a data write operation to the program data area 103. In conjunction with this data write operation, the application process 101 also performs a log data write to the log buffer area 105 consisting of the datum and possibly the address of the data write operation.
In the second approach, the memory mapping hardware can be set to write-protect the area of memory to be logged, normally with the aid of the operating system. Subsequently, a write operation to this area of memory causes a trap to a trap handler software routine which determines the address and datum being written, transfers that information to the log buffer, completes the write operation, and allows the program to continue execution. Mechanisms to implement this approach are provided in commercial operating systems such as the ATT Unix system by the mmap, mprotect, and signal facilities.
Basic Cache Design
Computer system designers have recognized the advantages of employing a fast cycle time cache memory unit or cache intermediate between the longer cycle time main memory and the processing unit or units. Caches are structured as a plurality of fixed-sized lines, the unit of transfer between the cache and main memory, with an associative table or directory, with a status control word or tag per line, indicating the validity of the line, read and write permissions, its associated main memory address, and possibly other information. Additional directory information has been used to ensure consistency between the cache information and main memory and possibly other caches in the system. The cache control module manages the contents of the cache memory unit and the cache directory in response to memory references by its associated processor(s) and bus or network interconnect. Each cache line holds one memory block, a portion of memory of size equal to the size of the cache line, and starting on an address that is a multiple of this size. A processor is coupled to a cache so that memory references are handled by the cache, rather than requiring the memory references to be communicated to the main memory system.
FIG. 2 shows this basic cache structure. A processor 202 refers to an address as part of a read or write operation by signaling the address and the operation over bus interconnect 203 to cache memory unit 204, in particular cache controller 206. Cache controller 206 looks up the address in cache directory 208 to locate the corresponding status control word 210. If the corresponding status control word 210 is present in cache directory 208, referred to as a cache hit, this status control word 210 indicates the corresponding cache line 216 in the cache line memory 212 containing the referenced datum. If the operation is a write operation, the datum to be written is transmitted over from processor 202 over data busses 214 and 203 into cache line 216. The datum value may or may not be propagated over bus 218 to the memory system (not shown), depending on whether that cache line is handled as write-through or write-back, as described in the next section.
If the corresponding status control word 210 is not present in cache directory 208, referred to as a cache miss, the cache control module 206 selects a cache line 216 to hold this memory block, writes the data currently contained in this line back to memory if it has been modified since the last such writeback, loads the required data from memory into this line, sets the corresponding status control word 210 to reflect the new contents of the cache, and allows the processor 202 to continue, completing the operation as for a cache hit described above.
Multi-level cache structures have also been used. The first level or L1 cache is directly connected to the processor to handle each memory access. The first level cache is connected to a second level cache (L2), rather than being connected to main memory. When a memory reference misses in the L1 cache, the L1 cache controller sends a request for the referenced block of memory to the L2 cache. If the L2 cache contains the referenced memory block, the L2 cache returns this data to the L1 cache, the same as the case of a cache hit described above. Otherwise, the reference is a miss at the L2 cache, and is handled the same as described for the cache miss case above. In particular, the L2 cache controller requests the data from the next lower level of cache or memory to which it is connected. Multi-level caching has been described in several published reports; One recent report is: ParaDiGM: A Highly Scalable Shared Memory Multi-Computer Architecture" by Cheriton et al. in IEEE Computer, February 1991.
Cache Writing Methods
Write operations modify the state of the cache, potentially making it inconsistent with the copy of the corresponding location in main memory. Two common approaches are used to ensure that main memory is eventually updated by writes to a cache line. In the first approach, the write-through approach, any write operation to the cache causes the cache to immediately write this same value into the memory system. Thus, memory is always current and the cache serves only to handle read operations in place of the memory system.
In the second approach, the write-back approach, write operations only immediately update the contents of the associated cache line, not main memory. The status control word for the cache line is then set to indicate that the cache line has been modified. When a cache line is to be invalidated (because it is to be used for a different portion of memory), or any other need arises to make the main memory consistent with the cache, the data in the cache line is then written back to the memory block associated with this cache line.
Caches may provide fixed or selectable address range or set of address ranges which are non-cached. When a processor writes to a non-cached portion of memory, the write operation is passed through immediately to the main memory system and does not affect the state of the cache. Read operations to addresses within the designated non-cached ranges are handled similarly, bypassing the cache and retrieving the requested data from the memory system.
Non-cached regions of memory are frequently used to access I/O devices such as network interfaces, graphical frame buffers, and various storage device interfaces. The non-cached behavior ensures that all writes are immediately transferred to the device. It also ensure that read operations return the current state of status registers and other device states that may change without the processor modifying this state, an example being a device completion flag. Non-cached access is also used on occasion to handle large areas of memory that are written but not read immediately, such as a log buffer. This approach avoids using a large number of cache lines to hold this data, frequently forcing the cache to discard other data that it is accessing, causing a significant performance penalty.
Caches and processors may provide an explicit mechanism for forcing modified cache lines to be written back or flushed to main memory. Explicit flushing can be used in place of non-cached access in some cases. The cache lines are written and then the modified data is flushed back to the memory to update the device or memory, converging to the situation achieved with a non-cached memory approach. However, this approach is frequently infeasible because the cache may write back a memory block at any time to make space for other memory blocks being referenced by the processor activity, so explicit cache flushing is not guaranteed to cause write back in the required order of data transfer.
Finally, caches may provide write buffers that accept the data to be written, either by the processor directly on a non-cached or write-through operation, or as part of a cache line writeback, normally allowing the processor or cache to proceed before the data has been transferred to the memory system. Write buffers are an established mechanism for minimizing the delay on the cache and processor operation imposed by writing data to the memory system. Write buffers have been used directly as part of the processor to absorb write data, as part of write-through caches and as part of write-back caches to absorb cache line write data.
Memory Copy and Rollback Support
An operating system facility known as copy-on-write (and also as deferred copy) provides a functionality with some relevance to logging. With a copy-on-write facility, a virtual memory source segment is logically copied to a destination segment by mapping the first segment to the destination segment, using virtual memory mapping hardware. The actual copy only occurs on a page basis at the point that a process first attempts to write to a page in the destination segment. At that time, the source page corresponding to the referenced page is copied to the destination segment before the write completes, modifying the destination segment. Subsequent read and write operations within this page go directly to the page in the destination segment. Thus, the copy-on-write facility "logs" modifications to the source segment but on the granularity of a page. That is, the modified page and its modifications are recorded in the destination segment area. This mechanism for logging and copying suffers from the major disadvantage of the page being an excessively large unit. A scheme was proposed by Cheriton et al. (The VMP Multiprocessor: Initial Experience, Refinements and Performance Evaluation, Int. Symposium on Computer Architecture, 1988) that provided this facility at the cache line level. However, that proposal relied on software-controlled virtually addressed caches, not a common computer system architecture.
An application-level variant of this copy-on-write mechanism has been used to implement checkpointing and rollback, or resetting to the checkpoint. (See Li et al., Concurrent real-time checkpoint for parallel programs, 2nd ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, March 1990 as well as S. Feldman and C. Brown, IGOR: A system for Program Debugging via Reversible Execution, ACM SIGPLAN Notices, January 1989.) In this scheme, a checkpoint consists of copying the modified pages, identified by the virtual memory mechanism, to a separate memory area or to secondary storage, and rollback consists of restoring the address space to the state in a previous checkpoint. In particular, the memory mapping unit is only used to trap to software on a write operation, at which point a full page copy is performed. Checkpoints are created by a conventional copy of each modified page and rollback is performed by copying back all pages modified since the last checkpoint, thereby incurring a significant copying cost.
Device I/O and Network Switching Techniques
Conventional device output is performed by operating system calls that copy the data from the application memory area to the device. In some cases, the device interface is accessed as a portion of the memory address space, allowing the processor to copy the application data directly into the device interface. In other cases, the application data are copied to a system memory area from which the data are then transferred by the hardware device controller using direct memory access (DMA). The former scheme normally requires that the processor access the device memory area as uncached memory and transfer the data one word at a time over the memory bus. The latter normally requires an extra copy of the data over the memory bus, once to transfer to the system area and a second transfer as part of the DMA operation. These copy operations impose a significant performance overhead when used with high-performance output devices.
Device input is conventionally performed by transferring the data to a system memory area and then copying the data to the application memory area as part of a system call requesting the data, thus incurring two copies of the data. The copy from the device to system memory is required because the data cannot be stored in the limited memory on the device interface, in the case of so-called programmed I/O (considering that the application may not have yet requested the data), or because the device is using DMA to transfer to memory and the system cannot easily direct the DMA into the application memory (because of paging complications or the application has not allocated the memory yet). Again, the extra copies impose a performance overhead when used with high-performance input devices.
Some device interfaces allow the device to be mapped into the application address space and accessed directly, a graphics framebuffer being a familiar example. However, it is difficult to implement safe shared access to a device using mapping because the device memory area cannot generally be divided into page units such that the pages mapped into one application only contain data for that application. This is particularly true for network input, where the interface cannot conventionally determine which application is to receive the arriving data without substantial software intervention.
Network switching can be viewed as a special case of device I/O in which the device input is directed to the output of another device, namely the network output port. There is a broad range of literature and products in network switching. Basically, an addressed data unit or packet is received over a communication link and the receiving computer or switching device determines the output link on which to transmit this packet, and proceeds to transmit this packet on this output link. There are several basic approaches to effecting this switching.
First, using a switch with a computer processor, the processor is notified of packet arrival, examines the packet and possibly other routing information to determine the output line for the packet, and copies the packet from its input location to a transmission buffer for the output line, instigating the transmission of the data packet on the output line. With this approach, the switching and packet movement from input to output is accomplished by the processor operation according to software instructions. This scheme entails essentially device input and device output as described above, but perhaps eliminating the step of copying to and from a separate application memory area.
Second, specialized hardware can be provided that connects the input line to the output line, either statically by prior setup as with circuit switching or dynamically in response to parsing of the packet. The ATT Batcher-Banyon switching fabric is one example of a mechanism desired to be realized as such specialized hardware.
Finally, the input line device can write the input packet over a switching bus, with some designating address, such that one or more output devices recognize said address and copy the data so transmitted out on their associated transmission line. The ATT Bell Laboratories Datakit networking technology is an example of a specialized network using this approach of short bus switching, which, in the case of this particular example, uses circuit switching end one word packets.
In summary, this prior art principally achieves a software implementation of switching or a implementation in specialized hardware. No prior art shows how to support network switching by an extension of general-purpose computer system structures, as provided herein.
Cache Consistency Techniques
Cache consistency problems arises because, with caching, there is potentially a separate copy of the data in both the cache and main memory. A mechanism is normally needed to ensure that the processor(s) and I/O devices read and update these multiple copies in a consistent fashion so that read operations return the latest data, and write operations take effect in basically the order they were issued. There may also be multiple caches to support multiple processors and multiple levels of access to memory, further complicating the consistency problem.
There are two basic established approaches to consistency. In the first approach, the cache uses the write-through cache operation described above. When the processor writes a datum to the cache, the datum is transferred through immediately to the memory system, notifying other caches holding a copy of this datum, either explicitly or implicitly (by the other caches observing this write operation on a shared bus) of the update. Other caches holding a copy of this datum either update the copy of this datum in their cache with the written datum value or else invalidate the cache line in their cache containing the copy of this datum, thereby forcing a subsequent read access to this address to retrieve the updated datum value from the next level of memory, either main memory or a lower-level cache.
The write-through cache imposes an excessive write traffic load on the memory bus and memory store and requires a high-speed so-called bus snooping mechanism to update other caches in response to these write operations. As a refinement, IBM U.S. Pat. No. 4,442,487 describes an arrangement wherein a cache is made selectively write-through by incorporating a shared bit per status control word in the cache directory that is set when the bus snooping mechanism observes another cache loading the same memory block or when it is notified on a cache line load that the memory block is already present in another cache. This approach reduces the write traffic with small caches but has been observed to incur excessive write traffic with larger cache lines because a significant amount of data is resident in multiple caches after a period of execution time.
In the second approach, the so-called write-back or copyback approach, ownership information is maintained about cache lines, either by each cache or by the memory system. When a processor writes a datum to an address, the cache first checks that it has exclusive ownership of the associated cache line before allowing the write operation to proceed. If it does have exclusive ownership, the operation proceeds as above. If the processor does not have exclusive ownership, the write operation and the processor are suspended, the cache requests exclusive ownership of the cache line from the memory system, loading the cache line if not present in the cache, and then allows the write operation and the processor to proceed once the line is present and its holds exclusive ownership. A cache may be requested to relinquish ownership of a cache line or may relinquish ownership when replacing a cache line as part of loading a second cache line. In both cases, a cache line that is modified is written back in its entirety to main memory (and/or other caches in the system) before the cache releases its exclusive ownership of the memory block. When a processor reads a datum from an address, the cache first checks that it has ownership of the corresponding cache line, thereby ensuring it has the more recent data in the cache line. If it does have ownership, the operation proceeds as above. If it does not have ownership of the cache line, the read operation and the processor are suspended, the cache requests ownership of the cache line from the memory system, loading the cache line if not present in the cache, and then allowing the read operation and the processor to proceed once the line is present and it is holding ownership. A read operation can either use the same exclusive ownership required for write operations or an additional shared ownership mode which allows multiple processors and caches to read the data concurrently. The latter scheme effectively implements the well-known readers/writers algorithm used in operating systems to maintain consistency of data structures. Additional cache directory information, such as ownership and "modified" flags are added to the cache status control words in the cache directory to implement these protocols.
Write-back caches are recognized to handle infrequently shared data with less bus traffic than write-through caches but produce excessive read and write traffic when there are higher degrees of write sharing. This traffic is required to maintain consistency of the copies in the various caches and memory. The Motorola MC 68040 microprocessor provides a cache directory flag that selects either write back or write-through behavior on a page by page basis.
To reduce the overhead of consistency maintenance, such as the delay in acquiring exclusive ownership of a memory block, various research reports have proposed using various "relaxed" forms of consistency mechanisms, in which a strictly sequential ownership regime is not enforced or required. For example, a computer system may only ensure that ownership transfer has been fully synchronized at the point that locks are acquired or locks are released, so called release consistency developed by the DASH Project at Stanford University. Even more extreme, a cache can ignore the ownership protocol altogether and simply reread cached data from memory periodically to bring it up to date, with no coordination to other caches.
In general, the prior art in logging, caching, network switching and cache consistency has addressed the narrow aspects of their problem domains, and failed to provide a uniform mechanism that is applicable to the broader range of high-performance logging, device output and cache consistency.