1. Field of the Invention
The present invention relates to computer systems and methods in which list data are shared by software running concurrently on one or more processors. More particularly, the invention concerns an improved system and method that allows lock-free lookups of list elements while efficiently permitting concurrent update operations in which list elements are moved from one list to another.
2. Description of the Prior Art
By way of background, shared data elements that are members of a linked list sometimes need to be moved from one list to another while maintaining consistency for the benefit of data consumers who may be concurrently performing lookups on the same data. This situation arises in the context of in-memory file system tree images used by operating systems to perform file name lookups for locating files maintained on block storage devices. When a file's name is changed and/or the file is moved from one directory to another (referred to as a “rename” operation), its corresponding entry in the file system tree image will often move between lists. For example, in a typical directory entry cache, directory entry elements (representing files) are assigned to doubly-linked circular directory lists. Each such list is headed by a parent directory entry whose files are represented by the directory entries in the list. Relocating a file from one directory to another will cause its directory entry to move from one directory list to another. Similarly, in a directory entry hash table, directory entries are assigned to hash chains (lists) according to a hash algorithm based on their name and name of their parent directory. Directory entries will typically move from one hash chain to another whenever the file's name is changed or it is relocated to another directory.
Techniques must be used to perform these list operations without impacting readers who may be concurrently performing look-ups on the same file. Moreover, in computing environments conforming to the POSIX (Portable Operating System Interface), the list manipulations must be performed atomically. This atomicity requirement is illustrated in the context of the POSIX rename( ) system call by considering the situation where the rename( ) operation races with concurrent lookups of the old file name and the new file name. If a lookup of the new name succeeds, then every subsequent lookup of the old name must fail. Similarly, if a lookup of the old name fails, then every subsequent lookup of the new name must succeed. Note that a “subsequent” lookup must start after a preceding lookup completes. This is summarized in the following table, in which the term “failed” signifies a failure to open the file being renamed:
TABLE 1POSIX rename( ) atomicity conditionsRename (“old”, “new”)If open (“old”)If open (“new”)failed,succeeded,Then open (“new”)Then open (“old”)must succeedmust fail.
The atomicity requirements for the POSIX rename( ) system call are the same whether a file is being renamed to a new name, and when a file is being renamed on top of a pre-existing file. In the latter case, an “early” attempt to open the new filename (i.e., before the rename( ) operation returns) will fail to open the renamed file, but will instead open the pre-existing file. This race condition is in all ways equivalent to the race condition where the file is being renamed to a new name. Therefore, for simplicity, the ensuing discussion will consider only the case where a file is renamed to a new name.
There are a number of prior-art algorithms that permit atomic rename( ) operations by relying on locks held during the lookup operations. This is undesirable because directory cache lookups are extremely common, and such operations should be lock-free if possible. There are also lock-free synchronization techniques that provide the desired semantics, and avoid locking in the lookups. However, these rename( ) operations are extremely costly, requiring duplication of the entire data structure (which for a hash table can contain hundreds of thousands of elements, even on small desktop systems). Furthermore, even though the lookups are lock-free, they use atomic operations that perform write operations, thereby inflicting costly cache misses on lookups running in other processors.
Another mutual exclusion technique, known as read-copy update, permits shared data to be accessed for reading without the use of locks, writes to shared memory, memory barriers, atomic instructions, or other computationally expensive synchronization mechanisms, while still permitting the data to be updated concurrently. The technique is well suited to multiprocessor computing environments in which the number of read operations (readers) accessing a shared data set is large in comparison to the number of update operations (updaters), and wherein the overhead cost of employing other mutual exclusion techniques (such as locks) for each read operation would be high.
The read-copy update technique implements data updates in two phases. In the first (initial update) phase, the actual data update is carried out in a manner that temporarily preserves two views of the data being updated. One view is the old (pre-update) data state that is maintained for the benefit of operations that may be currently referencing the data. The other view is the new (post-update) data state that is available for the benefit of operations that access the data following the update. In the second (deferred update) phase, the old data state is removed following a “grace period” that is long enough to ensure that all executing operations will no longer maintain references to the pre-update data.
Traditional read-copy-update manipulation of list data leaves the old data element in place in the list, creates a new copy with the desired modifications, and then atomically inserts the new copy in place of the old element into the same list. This is impractical for the POSIX rename( ) operation. Here, the old element must be atomically removed and a new element inserted, not necessarily in the same place that the old one occupied, but likely into a different list. File system operations further complexify traditional read-copy update due to the existence of long-lived references to the old list element (directory entry representing the file) that is to be removed following a grace period. It is often difficult or even infeasible to determine where these references are located, because many different parts of an operating system kernel or of dynamically loaded kernel modules might at any time acquire a reference to the list element. Thus, there is no effective method for tracking down all the possible references to the old element.
A possible work-around would be to have read-copy update atomically update an entire file system tree data structure, and atomically replace it with a new one by switching pointers. However, as in the case of lock-free synchronization, this latter approach is hopelessly inefficient for directories containing large numbers of files, and is even less well suited to systems that maintain a hash table to cache filename/directory mappings. As stated, it is not unusual for even small desktop machines to cache more than 100,000 such mappings. Making a new duplicate copy of this table for each rename( ) operation is clearly undesirable. Another alternative, creating a copy of a single hash chain is not feasible because the rename( ) operation will normally move a directory entry to some other hash chain. It is also not possible to atomically create a copy of only the affected pair of hash chains with the instructions available on commodity microprocessors.
In sum, given current commodity microprocessor instruction sets, along with the undesirability of duplicating large list structures, it is not practical to atomically move an element from one list to another using traditional read-copy update techniques. If the POSIX rename( ) operation is not performed atomically, there will be a short but non-zero duration when the renamed directory entry will not be on any list. This time duration can be expanded by interrupts, ECC (Error Correction Code) errors in memory or caches, or by many other events that can occur in current microprocessors and operating systems. In a multiprocessor system, it is possible that some other process might be able to perform a lookup on the new name followed by the old name during this time interval and observe both failing, thus violating the required POSIX semantics as shown in the second column of Table 1.
Accordingly, a need exists for an efficient lock-free technique for atomically moving shared list elements from one list to another. It would be particularly desirable to provide a solution to the foregoing problem using existing aspects of the conventional read-copy update technique but with modifications thereto to facilitate inter-list movement of list elements with the required atomicity.