Field
The disclosure generally relates to multi-processor computer systems, and more specifically, to methods and systems for routing and processing data in multi-processor computer systems.
Description of the Related Art
Multi-processor computer systems allow concurrent processing of multiple parallel processes. Some applications can be parallelized efficiently among the processors in a multi-processor computer system. For instance, some applications can be parallelized by dividing different tasks into sub-processes called threads. Threads may perform operations on different data at the same time. However, one thread may sometimes need to operate on an intermediary or final output of another thread. When two threads have to wait often for the other to share information, they can be said to have high data dependency. Conversely, when threads rarely need to wait for information from other threads, they can be said to have low data dependency. Applications that have low data dependency between threads are often desirable because they can process more data in parallel for longer periods of time. Nevertheless, a great number of applications have high data dependency between threads. This can occur, for example, when each piece of data must be compared to each other piece of data in a dataset. Thus, when data dependency is high, a significant portion of the dataset may need to be accessible in memory. Accordingly, for processing operations with high data dependency, the process of transferring data between threads can significantly delay computation. This delay is often exacerbated when each threads is running on physically separated hardware nodes, as is common in multi-processor computer systems. In such systems, inter-node input/output (IO) operations can often constitute a significant bottleneck to the data processing rate of the system, also known as throughput. Memory hops can range from as little as 1-2 nanosecond using non-uniform memory architecture (NUMA) in local CPU/memory sets to multiple milliseconds when accessing a storage area network (SAN) over various network fabrics. Because processors are often idle while they wait for data to be delivered, throughput bottlenecks can represent a significant waste of time, energy, and money.
FIG. 1 shows a multi-processor system 110 including multiple nodes 120 connected by a network 130 to each other and to a shared memory 140. Nodes 120 can be logically discrete processing components characterized by separated memory systems. In some implementations, nodes 120 can be physically discrete systems, such as servers that have local memory storage and processing capabilities. In the illustrated system 110, there are N nodes 120. Although only three nodes are shown, there may be any number of nodes 120. Each node 120 includes at least one processor 150 and a cache 160. Although only one processor 150 is shown, each node 120 can include any number of processors 150. Similarly, the processor 150 can include any number of processor cores. Processor cores represent the parts of the processor 150 that can independently read and execute instructions. Thus, in one example, two processor cores can simultaneously run two processing threads. In some implementations, node 120 can include a total of four processor cores. In some implementations, node 120 can include a total of eight or more processor cores.
Multi-processor systems such as multi-processor system 110 are typically used in operations that process vast amounts of data. For example, the US Postal Service, with a peak physical mail volume approaching more than 212 billion pieces annually in 2007, is one of the world's largest users of high-volume data processing. Each physical mail piece is handled multiple times on automated equipment, and each automated event produces data scan records. Even when physical mail volumes decrease, additional tracking and performance metrics have increased the number of mail tracking scans, per physical mail piece. Thus, daily mail piece scan volumes can top more than 4 billion records. Each of these records is processed by a multi-processor system such as system 110. When mail records are processed, the system detects duplicate records by comparison to billions of previous records up to many months old. The system is also responsible for finding and removing the oldest mail records when storage capacity is reached, querying mail records for report generation, and other similar tasks. This example demonstrates the magnitude of the problem of efficiently processing data records in a system such as the multi-processor system 110.
Processing in a multi-processor system can include a row insertion operation. Conventionally, the row insertion may have been performed as follows: Incoming records would be routed in parallel to nodes 120 or specific processors 150 based on a criterion such as, for example, load-balancing. For example, under one load-balancing method, the incoming records would be routed to a processor 150 chosen from a set of available processors on a round-robin basis, without considering such factors as the location of related records. Additionally, database insertion processes would be scheduled on the processors 150. Upon receiving an incoming record, a processor 150 would then search for the record in the database. The search might require accessing data not stored in the local cache 160. Such a search might include a storage area network (SAN). Accordingly, the processor 150 might locate the requisite data on a remote node and transfer the data over the network 130 to the local node for comparison. In some implementations, the processor 150 may compare the incoming record with every record in the database. Thus, the processor 150 would transfer a significant amount of data over the network 130 to the local node. If no matches were found, the processor 150 would insert the record into the database.
At the same time, however, another processor 150 on another node 120 would be concurrently performing the same tasks on a different record. Thus, it is possible that two processors 150, operating on two matching records, could simultaneously attempt insertion into the same memory location. This can be referred to as a race condition, and can occur as follows: First, a first processor would determine that a first record has no match. Next, a second processor would determine that a second record has no match. Note that although the first and second records may or may not match, neither has been successfully inserted into the database yet. Subsequently, the first processor inserts the first record into the database. Finally, the second processor, having already determined that there is no record match, inserts the second record into the database. In order to ensure a race condition does not cause identical records to be inserted into the database, each processor 150 can obtain exclusive access to the insertion memory location, via a mechanism such as a lock. A number of different locking mechanisms are known in the art. Establishing and relinquishing memory locks can themselves require data transfers over the network 130. Thus, as memory blocks are locked, unlocked, and transferred back and forth over the relatively slow network 130, a significant amount of processing time can be wasted.
The multi-processor system 110 can incorporate a number of techniques to improve efficiency and cost-effectiveness. For example, the shared memory 140 can be organized hierarchically. Hierarchical memory organization can allow the system 110 to utilize a mix of memory media with different performance and cost characteristics. Thus, the system 110 can simultaneously exploit small amounts of faster, expensive memory for high-priority tasks and large amounts of slower, cheaper memory for other tasks. Accordingly, the shared memory 140 can be physically implemented with several different storage media, which may be spread out in multiple locations. For example, the processors 150 might store infrequently used data on a relatively cheap and slow disk drive in a storage area network (SAN, not shown). At the same time, the shared memory 140 can also be partially distributed amongst the nodes 120. The caches 160 can include local copies (caches) of data in the shared memory 140. The processor 150 can locally cache the data in a relatively fast and expensive dynamic random access memory (DRAM, not shown). The DRAM can be shared with other processors on a processing module. Typically, when the processor 150 requires more data, it will first look in the local cache 160, which usually has a relatively low latency. For example DRAM latency is typically measured in nanoseconds. If the data sought is not located in the local cache, a memory manager might have to retrieve the data from the SAN over the network 130. Because the SAN might be located far away, the memory manager might have to request the data over a relatively slow interconnect, such as Ethernet. SAN requests have much higher latency, typically measure in milliseconds. The relative speed of the interconnect, combined with additional latency of slower storage media, often results in significant performance degradation when data is not found in the local cache (a “cache miss”). Thus, most systems attempt to keep information that is accessed frequently in the local cache.
When a process runs on a multi-processor computer system such as system 110, it is typically scheduled to run on the next available node 120. However, the next available node 120 may not be the same node on which the process was last run. Under a hierarchical memory model as described above, the data the process has recently accessed will likely reside in a cache on the node on which the process was last run. This tendency can be called cache persistence. In order to take advantage of cache persistence in multi-processor environments, processes can be assigned an affinity to one or more processors. Processes given such affinity are preferentially scheduled to run on certain processors. Thus, affinitized processes are more likely to run on a processor that already has important process information in its local cache. However, affinity does not eliminate the problem of cache misses, particularly when applications have high data dependency between threads. Cache misses can persist in systems where the shared memory 140 is partially distributed amongst the nodes 120. One example of such as system is called a cache coherent system, which maintains consistency between the shared memory 140 that is distributed amongst the nodes 120. In a cache coherent system, for example, an affinitized process may be programmed to compare incoming data to data previously processed on another node 120. The affinitized process may also be programmed to modify that data. In order to maintain memory consistency, the data is typically transferred between nodes 120. Thus, even though much of the data to be processed may be contained in the local cache 160, the data transfer between nodes 120 due to high data dependency can still represent a significant throughput bottleneck.
Typically, systems such as the USPS mail system described above are already using the fastest hardware practicable. Thus, it is not feasible to clear the throughput bottleneck with, for example, a faster network 130. Similarly, because the bottleneck occurs between nodes 120, adding additional nodes will not provide the desired increase in throughput. At the same time, it is not typically a viable option to decrease the rate of incoming data. For example, it is probably not acceptable for the Post Office to delay the mail, or associated reporting, to accommodate computer bottlenecks. Within such systems, the locality of memory is dominated by its “electron distance,” or the distance an electron would have to travel over an electrical path in order to reach the memory. For example, a processor 150 accessing a local cache 160 could have an “electron distance” on the order of millimeters. On the other hand, a processor 150 accessing memory located on another node 120 or over a SAN could have an “electron distance” on the order of meters. Accordingly, it is desirable to resolve the bottleneck at a system-architecture level. In attempting to solve this problem, others have attributed a throughput limit to the need for remote data access. However, systems and methods described herein are capable of addressing this remote data access bottleneck in an unanticipated manner.