There are various known related arts pertaining to packet transfer. A technology of efficiently processing a packet received from a network is disclosed as one of such related arts.
PTL 1 discloses an example of a cache control device. The cache control device described in PTL 1 includes following configurations.
An area setting means sets a write area on a cache memory. The write area is an area into which a predetermined amount of received data can be written.
On processing performed on part of or entire received data written into the write area, an area deletion means deletes an area into which the received data having been a target of the processing are written.
On processing performed on received data, the area setting means newly sets an area having a capacity equivalent to that of the deleted area in a location for writing subsequently received data. The set area is an area into which the subsequently received data can be written. The subsequently received data is data received after reception of the predetermined amount of the received data.
A received data determination means determines whether or not an amount of data exceeding the writing capacity of the write area is received.
A received data write means writes received data into the write area. Further, the received data write means writes received data, an amount of which exceeds the write capacity of the write area, into a main memory.
A data read means performs prefetching the received data from the main memory.
PTL 1 describes that, by having the configuration described above, the cache control device is able to clear processing delay accompanying memory access and increase processing performance, while maintaining an amount of cache memory usage constant without missing packets.
PTL 1 describes that the reason that the processing delay accompanying memory access is cleared is because memory access processing described below is reduced.
A first memory access processing to be reduced is a process for writing from a network interface card (NIC) into a memory. A second memory access processing to be reduced is a process for loading from a central processing unit (CPU) to the memory. A third memory access processing to be reduced is, when data stored in a cache are driven out accompanying the second memory access processing, a process for writing the data into the memory.
With respect to the aforementioned technology of “prefetching received data from a main memory”, technologies disclosed in NPL 1 and NPL 2 will be described below step by step by use of drawings.
FIG. 1 illustrates a general configuration of a computer 107 according to the related arts. Referring to FIG. 1, the computer 107 includes a CPU 200 and a main memory 300. The computer 107 further includes an input/output (I/O) controller 407. The computer 107 further includes Peripheral Component Interconnect (PCI)-Express (registered trademark) devices 650. The computer 107 further includes storage devices 660.
The I/O controller 407 is a controller controlling various interfaces to various I/O devices. The I/O controller 407 includes a PCI-Express root complex 440. Further, the I/O controller 407 includes a Serial Advanced Technology Attachment (SATA) interface 441. The I/O controller 407 further includes other I/O interfaces 442. The I/O controller 407 provides the CPU 200 with an access function to I/O devices connected through these interfaces.
In a relatively old-type computer, for example, a memory controller is built into an I/O controller, and a main memory is connected to the I/O controller. However, it is common in the recent computer 107 that a function of a memory controller (not illustrated) is integrated into the CPU 200.
The PCI-Express root complex 440 is located at the highest layer of an I/O fabric configured with PCI-Express, and is an interface for connecting the CPU 200 and the PCI-Express device 650.
The SATA interface 441 is an interface for connecting a storage device, such as a hard disk drive (HDD) or a solid state drive (SSD).
The other I/O interfaces 442 are, for example, a universal serial bus (USB) interface, a Personal System (PS)/2 interface, a serial port interface, a parallel port interface, or the like. Further, the other I/O interfaces 442 are various interfaces for connecting various peripheral equipments to the computer 107.
The PCI-Express device 650 is a device connected to an I/O fabric configured with PCI-Express, and is referred to as an endpoint in the PCI-Express specification. As an example of such a device, a device for network connection such as a NIC is mentioned. As another example, a device for connecting a storage device such as redundant arrays of independent disks (RAID) is mentioned. As another example, a device with an audio visual (AV) function such as a sound card or a video capture card is mentioned.
The storage device 660 is a storage device in conformance with the ATA standard. As example, an HDD, an SSD, a Blu-ray (registered trademark) drive, or the like is mentioned.
FIG. 2 illustrates a configuration of a computer 108 taking a different form from the computer 107. Referring to FIG. 2, the computer 108 is different from the computer 107 in the point that the PCI-Express root complex 440 is built into the CPU 200 and the PCI-Express device 650 is directly connected to the CPU 200. Furthermore, the computer 108 is different from the computer 107 in the point that the CPUs 200 are directly interconnected.
FIG. 3 illustrates an example of a configuration of the CPU 200. The CPU 200 mainly includes a processing core 220, cache memories 223 (a local cache 222 and a shared cache 224), a cache controller 221, and a memory controller 225. In FIG. 3, illustration of the PCI-Express root complex 440 included in the CPU 200 illustrated in FIG. 2 is omitted.
As illustrated in FIG. 3, it is common that the CPU 200 includes a plurality of the processing cores 220. Each processing core 220 includes the cache controller 221 and the local cache 222.
The processing core 220 executes various instructions.
The cache memory 223 stores data used by the processing core 220.
The local cache 222 is a cache memory that may be exclusively used by each processing core 220.
The shared cache 224 is a cache memory shared between the processing cores 220. The shared cache 224 is generally located at the lowest layer of the cache memory hierarchy, and therefore is also referred to as a last-level cache.
The memory controller 225 controls the main memory 300 illustrated in FIGS. 1 and 2.
The cache controller 221 controls access to the cache memory 223. In a common computer, a cache memory has capacity of about one hundredth of the size of a main memory, and is often capable of storing only part of data, instructions, or the like required for certain processing. Accordingly, in data read from the main memory 300, the cache memory 223 only stores data required by the processing core 220 at the time. Consequently, the cache controller 221 controls replacement of data stored in the cache memory 223, write-back of data to the main memory 300, and the like.
In a computer network (hereinafter referred to as a network) such as Ethernet (registered trademark), when receiving a packet, the computer 107 and the computer 108 operate as follows. First, packet data received by a NIC (PCI-Express device 650) is stored in the main memory 300 included in the computer 107 and the computer 108, such as the main memory 300. Secondly, when the CPU 200 processes the received packet, the packet data are read from the main memory 300.
The cache controller 221 checks whether data of a processing target packet is stored in any of the cache memories 223 before accessing the main memory 300. When the required data is not stored in the cache memories 223 (a cache miss case), the cache controller 221 reads the packet data from the main memory 300 through the memory controller 225.
Next, the cache controller 221 stores the read packet data into one of the cache memories 223. A layer of the cache memory 223 into which the data are stored depends on the specification of the cache controller 221. In a case that the CPU 200 processes a packet, when the packet being a processing target is already stored in the cache memory 223 (a hit case), the CPU 200 is able to process the packet without accessing the main memory 300.
The difference between a hit case and a cache miss case of access to the cache memory 223 is the point in that whether or not access to the main memory 300 is performed. However, the difference in delay between access to the cache memory 223 and access to the main memory 300 in the recent CPU 200 has broadened by a factor of several tens to a hundred. Therefore, it is desirable that data processed by the CPU 200 are stored in the cache memory 223, whenever possible.
Consequently, the CPU 200 has a prefetch function of reading required data from the main memory 300 to the cache memory 223 before the CPU 200 requires the data or a processing target. As the prefetch function, there are hardware prefetch which the CPU 200 automatically performs and software prefetch which an operating system (OS) or an application instruct. By prefetching data from the main memory 300 to the cache memory 223, it is possible to store the data in the cache memory 223 before the CPU requires the data.
Currently, a function of giving a hint to data transmitted from a PCI-Express device is defined in the PCI-Express specification. The hint is information used for properly processing packet data received by the CPU 200. This function is referred to as transaction layer protocol (TLP) processing hints (TPH). This function is incorporated in the specification since PCI-Express 2.1. The specification of the TPH is described in detail in NPL 1.
The TPH is a function of writing a hint (for example, a tag called a steering tag) into a specific field in a header of a TLP packet transferred through a PCI-Express fabric based on a memory address of the transfer destination of the TLP packet and the like. How to handle the steering tag (hint) depends on a device on the reception side of the TLP packet (the CPU 200, the I/O controller 407, or the I/O controller 408).
For example, Intel (registered trademark) Corporation implements a function called Direct Cache Access (DCA) on the CPU 200, the I/O controller 407, or the I/O controller 408. The function is a function for the CPU 200 to perform hardware-prefetching data from a PCI-Express device based on the hint.
A NIC dealing with the DCA adds a steering tag as a hint to a TLP packet, and transmits the TLP packet to the CPU 200, the I/O controller 407, or the I/O controller 408. According to NPL 2 (Section 7.5 in particular), in the case of the DCA, a steering tag stores either (1) or (2) described below: (1) a flag indicating whether the DCA is valid or invalid, and an ID indicating a target cache, and (2) a target core ID based on an Advanced Programmable Interrupt Controller (APIC) ID.
Further, according to NPL 2, in the Intel 82599 10 Gigabit Ethernet (registered trademark) Controller, there are three modes, (1), (2), and (3) described below: (1) mode of writing a hint (steering tag: target core ID based on APIC ID) into a TLP packet of a receive descriptor written back to the main memory 300 (referred to as “PCIe TLP” in NPL 2), (2) mode of writing a hint into a TLP packet to the header buffer of a received Ethernet packet, and (3) mode of writing a hint into the payload of a received Ethernet packet.
The payload part of the TLP packet is temporarily stored in the main memory 300. Subsequently, the CPU 200, the I/O controller 407, or the I/O controller 408 performs hardware-prefetching of the payload part to the cache memory 223 based on the steering tag information. The hardware prefetching provides efficient processing of a packet received from a network.