A fundamental issue surrounding locking of data is the need to provide synchronization in certain code paths in a kernel. These code paths, called critical sections, require a combination of concurrency, or re-entrance protection, and proper ordering with respect to other events occurring in the kernel. A typical result without proper locking is called a race condition. Even a simple operation such as the C++ command of “i++” (incrementing the variable “i”) is dangerous if the variable is shared! Consider a common case of a multi-processor system where one processor reads i, then another processor reads i, then both processors increment i, then both processors write i back to memory. If i were originally 2, i should now be 4, but in this case, i is in fact 3.
Operating systems, such as a Linux kernel, provide two kinds of locks, read-locks and write-locks, used by readers and writers, respectively. Typically, multiple threads can safely read data concurrently, as long as nothing modifies the data during reading. Therefore, there can be multiple concurrent readers (each with its own read-lock), but only a single writer (with a write-lock) with no concurrent readers.
Traditionally, read-write-locks are implemented in Linux using spinlocks. Read-write locks scale well in terms of performance in a “mostly-readers” case, wherein data structures in critical sections are infrequently modified. On the other hand, when a writer wants to modify data structures in a critical section, a write-lock is taken, wherein all readers and subsequent writers are blocked until the write-lock is relinquished. When there are a moderate to high number of writers, known as a “mostly-writers” case, the read-write-lock's implementation fails miserably due to writer starvation, wherein writers waiting for a write-lock have to wait until all the readers relinquish the lock. Read-write locks scales well only for “mostly-readers” case at the cost of poor performance in case, writers are present.
Additionally, read-write-lock implemented in a standard Linux kernel is not optimized for cache access and causes many cache invalidations while spinning, especially when writers are also present. This high incidence of cache invalidations is mainly attributed to the fact that even blocked spinning readers and writers still keep on modifying the cache to check whether the lock has been released or not. As is known in the field, the process of checking if the lock is release or not includes doing some write operation on lock and check if the lock is available, this causes the cache invalidation of the memory associated with lock for all CPUs accessing the memory. This implementation has become a hugely degrading factor in the newest Intel processors, such as a Sandy Bridge processor, which has the latest cache-sync protocols and other hardware optimizations for cache behavior. These cache invalidations start to degrade performance even more when we use per-CPU locks in multi-core architectures like the Sandy Bridge (with 2 sockets of 8 cores each), because a dramatic increase in the number of spinning locks results in more frequent cache modifications. Furthermore, the increase makes read-write-locks even harder to acquire.
There is therefore a need for an optimization of spinlock-based read-write locks that scales well in terms of performance in both mostly-readers and mostly-writers cases, and also optimizes utilization of the CPU cache to reduce cache invalidations, as compared to conventional techniques.