This description relates to computing in parallel processing environments.
FPGAs (Field Programmable Gate Arrays) and ASICs (Application Specific Integrated Circuits) are two exemplary approaches for implementing customized logic circuits. The cost of building an ASIC includes the cost of verification, the cost of physical design and timing closure, and the NRE (non-recurring costs) of creating mask sets and fabricating the ICs. Due to the increasing costs of building an ASIC, FPGAs became increasingly popular. Unlike an ASIC, an FPGA is reprogrammable in that it can be reconfigured for each application. Similarly, as protocols change, an FPGA design can be changed even after the design has been shipped to customers, much like software can be updated. However, FPGAs are typically more expensive, often costing 10 to 100 times more than an ASIC. FPGAs typically consume more power for performing comparable functions as an ASIC and their performance can be 10 to 20 times worse than that of an ASIC.
Multicore systems (e.g., tiled processors) use parallel processing to achieve some features of both ASICs and FPGAs. For example, some multicore systems are power efficient like an ASIC because they use custom logic for some functions, and reconfigurable like FPGAs because they are programmable in software.
Software defined networking allows the network data plane to be implemented in an external server. The forwarding plane, sometimes called the data plane, defines the part of the router architecture that decides how to forward packets arriving on an inbound interface.
Modern servers use virtual functions that are implemented in add-in accelerator cards to serve multiple virtual machines (VMs) running on a single host processor. This topology is commonly referred to as SRIOV (single root IO virtualization).
When many packet flows target a single output port, bandwidth management is employed to implement quality of service and policy guarantees.
When multiple processors communicate across a fabric such as PCI Express or Ethernet, the processors typically use shared first in first out memory devices (FIFOs) to send messages. These FIFOs require mutual exclusion (MUTEX) locks to support many-to-one, one-to-many, or many-to-many transfers. The MUTEX locks allow the sender to check for full access, acquire a slot to send messages, and write an entry to a destination without another sender interfering with the transfer. Similarly, a receiver may require a MUTEX lock in order to check for a non-empty FIFO and grab the next valid entry. MUTEX locks can become bottle necks in high performance systems, because MUTEX locks require exclusivity; only a single sender or receiver can be performing a transfer at a given time and all others are required to wait. Commonly, this waiting by such agents to acquire a lock is referred to as “spinning.”
Complex digital integrated circuits (ICs) require precise coordination of the timing among many different paths in order to function correctly, especially at relatively high clock frequencies. In modern integrated circuit processing technologies, e.g., 40 nm, 28 nm, and 22 nm process generations, there can be significant process variability among transistors and conductors on circuit paths, which affect relative timing of clock and data signals. Such process variations can limit the maximum clock frequency of an IC and/or, in some cases, cause functional errors during operation.
A Load instruction tells a processor core to take the memory address in one register and load the value stored at that memory location into a second register, the destination register. The cache or memory system can take one to hundreds or more clock cycles to return the value from memory to the processor core. To avoid stalling during that time, the processor core marks the destination register of the load as not-ready until the value is returned from memory. This is done by keeping a ready bit for each register. The processor core continues to execute instructions. If an instruction tries to use a register that is marked as not-ready, the processor core stalls until the ready bit for that register is changed to ready indicating that the value was returned from memory. Processors use different methods to avoid stalling on this case. For example, out-of-order processors with compliers find other instructions to execute where the input registers are ready and run those. This uses more hardware than an in-order processor. Another technique is speculative execution, where such speculative execution processors switch into a speculative mode instead of stalling and speculatively execute instructions, but do not change the state stored in any registers until the processor commits results of the speculative execution.
With most common shared memory multiprocessor memory ordering models, when a processor core X writes a memory location M, processor core X is permitted to observe its own write to memory location M before other processors observe the write operation to the memory location M.