Producer-consumer relationship is at the heart of many computing systems. In a producer-consumer (P-C) model, there are different entities or processes (i.e., different producer-consumer pairs) that operate on the same data one after another, in a chain-like fashion, with each entity/process performing a different functionality (“functionality” as used herein refers to how a computer system manages transactions based on various settings or parameters). As a result, data is transferred between the different processes. In such a relationship, a producer thread calls a “producer method” to generate one or more data elements and place the data elements into a region of memory shared between a producer thread and a consumer thread. A consumer thread calls a “consumer method” to read the data elements and “consume” the data elements. A data element may represent a pointer to the area where the processed data items are stored in main memory. The consumer method accesses a pointer and performs multiple address translations in order to access data items in memory shared between the producer and consumer. The following example illustrates how producer-consumer relationship works to process data packets. In a networked storage server, an incoming data packet typically goes through the following processing layers: Ethernet driver→TCP/IP Layer→Network File System (NFS) protocol. In a producer-consumer model, there are at least three different threads of execution for the three functionalities (e.g., Ethernet driver processing, TCP/IP processing, and NFS processing). In order to receive the incoming request, the Ethernet driver acts as a consumer to a network interface card (NIC), which acts as the producer (e.g., it produces data elements, which the Ethernet driver consumes). Next, the Ethernet driver acts as a producer to a TCP/IP stack, which consumes data elements produced by the Ethernet driver. As a request traverses up the network stack, TCP/IP acts as a producer to the higher layer protocols (such as NFS or CIFS), which act as consumers, and so forth. Since the movement of data between the threads of execution is an integral functionality, efficiency of the producer-consumer communication is critical to the performance of a storage system or any other embedded system (i.e., a special purpose computer system designed to perform one or more dedicated functions). In contrast, in a non-producer-consumer model, the functionalities of different entities/processes are all carried out by a single process. Such a process first picks up a data packet from the network using the Ethernet driver functionality, then performs TCP/IP processing, and then performs NFS processing successively without much parallelism in processing.
Multi-core systems are widely used to process data packets. A processor core refers to a complete processing unit (registers, Arithmetic Logic Unit (ALU), Memory Mapping Unit (MMU), cache memories, etc), several of which may be co-located on a single chip (die/socket). The number of cores on a socket is product specific. For example, some of the products by Intel Corporation, of Santa Clara, Calif., have dual-core, quad-core processors etc.
A multi-core system combines two or more independent processor cores into a single package composed of a single integrated circuit (IC), called a die, or more dies packaged together. Typically, CPU cores are equipped with fast on-chip multi-level caches. For example, a CPU core may include two on-chip caches L1 and L2 for both data and instructions. L2 is generally much larger than L1, but has access times much slower than that of L1. In addition to these on-chip caches, the CPU cores might have a third-level larger L3 cache.
A multi-core processor implements multi-processing in a single physical package. In a multi-core environment, each of the producer and consumer processes may run on a different core, thereby providing several advantages. One of the advantages of executing each of the producer and consumer threads on a different core enables parallelism between the consumer and producer threads so that more than one process can be executed at the same time. Furthermore, running producer and consumer processes on different cores may eliminate context switching overhead between the producer and consumer processes, which would be the case if they were to run on the same core. As is known in the art, a process is an instance of a computer program that is being sequentially executed. Context switching is performed when a process is loaded into a processor. Execution context information for each process may include data loaded into CPU registers, memory mapping information associated with the process (such as memory page tables), and/or other information related to a process.
As discussed above, when producer and consumer processes are executed on different cores and communicate over a shared memory mechanism, the producer process writes to some locations in the shared memory region and the consumer process reads from those locations in the shared memory. Typically, a process is executed in a virtual address space created for that process. All processes use the memory mapping information available as part of its execution context to do the translation from virtual to physical addresses. Such a translation is done by the process on memory access using special hardware mechanism called a memory mapping unit (MMU) (not shown in Figures). However, to use the appropriate memory mapping translation tables for a process, the MMU needs to be loaded (programmed) with the appropriate address of the starting location of the memory mapping table. This address is usually part of the process' context maintained by the operating system.
In a shared memory based producer-consumer communication, the producer and consumer processes may not have mapped to the shared memory at the same offset in their respective virtual address spaces. In this case, the virtual addresses need to be translated between the producer and consumer processes. The addresses pointed to by the producer process need to be in a form that is understood by the consumer process. Since the producer process is only executed in a virtual address space and can understand virtual addresses, these virtual addresses cannot be passed directly to the consumer process because the consumer process cannot translate the producer's virtual address to the physical address. According to one communication mechanism, the producer process passes relative addresses of the pointers to the consumer process. According to another communication mechanism for passing addresses, a producer finds appropriate location in the consumer's address space where the memory is mapped and sends addresses relative to the start of the mapped region. The passed pointers are relative to the start of the memory region shared between a consumer process and a producer process. The consumer process is entrusted to convert the relative addresses to the appropriate virtual addresses (based on where the shared memory region is mapped in its virtual address space) before accessing the data.
As noted earlier, before the consumer process can access data in memory, it needs to perform multiple translations of virtual to physical addresses. Such a translation entails multiple memory lookups depending on the processor architecture (e.g., 32 bit or 64-bit) and the size of the address. For example, for 64-bit architectures, multiple levels of page tables are accessed before the final translation can be done. The entries corresponding to each level of page tables need to be accessed and cached. After performing virtual-to-physical address translation, once the consumer thread accesses the data itself, there would be a compulsory miss in level 1 (L1) cache at the core on which the consumer thread is executed, since data elements produced by the producer process are cached at a core where the producer process is executed. At that time, the data item is fetched from further down in the memory hierarchy (e.g., main memory). Multiple translations (commonly referred to as pointer “swizzling”) thus require extensive memory accesses as a result of compulsory cache misses. Compulsory cache misses hurt the efficiency of the producer-consumer communication in a multi-core system. This, in turn, impacts overall system performance.
Accordingly, what is a needed is a mechanism that reduces existing inefficiencies of producer-consumer communication mechanism in multi-core systems.