The International Standards Organization (ISO) has established the Open Systems Interconnection (OSI) Reference Model. The OSI Reference Model provides a network design framework allowing equipment from different vendors to be able to communicate. More specifically, the OSI Reference Model organizes the communication process into seven separate and distinct, interrelated categories in a layered sequence. Layer 1 is the Physical Layer. It deals with the physical means of sending data. Layer 2 is the Data Link Layer. It is associated with procedures and protocols for operating the communications lines, including the detection and correction of message errors. Layer 3 is the Network Layer. It determines how data is transferred between computers. Layer 4 is the Transport Layer. It defines the rules for information exchange and manages end-to-end delivery of information within and between networks, including error recovery and flow control. Layer 5 is the Session Layer. It deals with dialog management and controlling the use of the basic communications facility provided by Layer 4. Layer 6 is the Presentation Layer. It is associated with data formatting, code conversion and compression and decompression. Layer 7 is the Applications Layer. It addresses functions associated with particular applications services, such as file transfer, remote file access and virtual terminals.
FIG. 1A is a block diagram illustrating an exemplary conventional L5 command sequence. The L5 level allows the transfer of application buffers between different applications, which is known as a dialog. Referring to FIG. 1A, an exemplary L5 command sequence may be to move N bytes of memory 11 starting at virtual address A of system 1 to system 2 via the network 10. The n bytes of memory 11 may occupy N bytes of memory 12 starting at virtual address B at system 2. An indication that this task is complete may also be requested from system 1.
FIG. 1B is a block diagram illustrating L1 through L5 processing of the conventional L5 command of FIG. 1A. Referring to FIG. 1B, the left column illustrates protocol layer sequences 17, 19, 21, 23, and 25 for protocol layers L5, L4, L3, L2 and L1, respectively, whereas the right column provides a pictorial view of the data as it processed and prepared for transmission. The first step 15 illustrates the conventional L5 command of FIG. 1A to move a specific piece of host virtual memory (1) to a specific location in another host. The L5 processing encodes the data about the request into a header (2a) and trailer (2b) for the L5 data. The header and trailer may be allocated as new virtual memory areas and may be passed as three or more separate send requests to the L4 layer.
The L4 processing layer may segment the send requests to fit within limitations of the lower layers. It then allocates additional virtual memory areas for individual L4 headers (3) for each segment. Each header, along with the virtual memory pieces needed to complete the segment, is passed down to the L3 layer. In our example, three packets of data are passed down, the first having three virtual memory areas, the second having two virtual memory areas, and the third having three virtual memory areas.
The L3 processing layer allocates yet another virtual memory space for each packet and formats the L3 headers (4), and normally the L2 header for Ethernet packets, into the virtual memory areas. Each group is now passed down to the L2 processing layer. The L2 processing layer may collect the data from the received virtual memory areas and may produce a continuous data stream. Encapsulation for packets may be required at the L2 processing level, as well as guaranteed inter-packet gap (IPG) time. The L2 processing converts the packets received from the L3 processing level into a continuous data stream 28, which may be communicated to the L1 processing level. The L1 processing level coverts the continuous data stream 28 received from the L2 processing into the appropriate electrical or optical signal pulses to travel on the network media.
In conventional communication systems that process packetized data, various different NIC technologies have been implemented in different chips, with each chip adapted to process a particular layer. In this regard, specific chips have been developed to handle L2 processing, specific chips have been developed for L4 offload processing, and specific chips have been developed to handle L5 offload processing.
FIG. 1C is a block diagram of conventional virtual memory system for mapping physical memory. Referring to FIG. 1C, the physical system memory 31 is broken into equal sized 2^^N sized memory pages. These pages may comprise the physical memory allocation units for the system. As memory is allocated to an application, the lower translation page table (TPT) 35 is built to list the physical memory pages currently allocated to the application. The application may also be characterized by an upper TPT entry 33 that may point to the lower TPT entry 35.
Access to a particular virtual memory location, at a specific virtual address, by the application is mapped by first finding the upper TPT entry 33 for the application or buffer (step 1). This upper TPT entry 33 may contain a pointer to the start of the lower TPT entry 35 (step 2). The upper TPT entry 33 may also specify a fixed virtual offset or virtual base to subtract from virtual address before further processing. An index into the lower TPT entry may be determined from the upper bits of the virtual address. This is possible because the pages sizes are consistent and of 2^^N size (step 3). Once the lower TPT entry is determined, its value may point to the start of one of the physical memory pages (step 4). The lower bits that are not used in determining the lower TPT entry index position may be utilized to index into the selected physical page to determine the physical address of the desired data (step 5). Accesses to the complete virtual address space of the buffer may be mapped. In this way, a completed virtual memory view 37 may be generated.
While page based virtual memory systems are common in general purpose computer systems, more customized systems, such as storage severs, may utilize a block organization system. FIG. 1D is a block diagram of conventional block organization system for mapping physical memory. Referring to FIG. 1D, the block organization system is similar to the virtual memory page system of FIG. 1C, except that basic allocation unit may be any consistent size for a particular task.
Access to a particular location in the physical memory 41 for a task is mapped by first finding the task entry in the task list 43 (step 1). The task entry may contain a pointer to the start of the task block list 45 (step 2). The task entry may also specify a specific amount of the first block that is not included in the task, for example, the first block offset. The first block offset may be added to the original location. An index into the block list may be determined by dividing the address by the block size. The division result will be the index into the block list 45 (step 3). The value from the task block list 45 may determine the start of the physical block to be accessed within the physical system memory 41 (step 4). The remainder of the division may provide the index into the selected block to finally determine the physical address of the desired data (step 5). Accesses to the complete block system space of the buffer may be mapped. In this way, a completed block system memory view 47 may be generated.
FIG. 1E is a block diagram illustrating conventional L2 processing on a NIC. The NIC 103 is connected to a host memory 101 via an interface bus 105. Referring to FIG. 1, the NIC 103 may comprise a receive (RX) data engine 119, an on-chip RX buffer descriptor pre-read memory 121, a transmit (TX) data engine 115 and an on-chip TX buffer descriptor pre-read memory 117. The host memory 101 may comprise TX buffers 107, RX buffers 109, a TX buffer descriptor queue 111 and a RX buffer descriptor queue 113. Network data may be communicated with the NIC via the transmit flow 123 and the receive flow 125.
For L2 processing, the basic requirement is to transmit from an endless stream of host physical address TX buffers 107, each with variable length. Data may be received into an endless, constantly replenished stream of host physical address RX buffers 109. During each of the transmit and receive cycles, one packet may go into one or more of the sequential TX buffers 107 and the RX buffers 109, respectively. Since there is only one transmit flow 123 and one receive flow 125, normal practice is to simply read ahead the TX buffer descriptor queue 111 and the RX buffer descriptor queue 113 for the physical address spaces. Data packets may then be transmitted, or received, based on the availability of network bandwidth. All addressing information for the buffers may already be residing on chip. Some times more than one stream is supported in either or both directions, but the number of streams during L2 processing is normally limited to 16 or less.
L2 devices are normally the most competitively priced networking devices on the market. The cost of an L2 device implementation is that application software, or the host based drivers, must do extensive formatting, segmentation, and re-assembly of network traffic to create/process complete L2 packet streams This consumes expensive host CPU bandwidth and host memory bandwidth. An important feature of the L2 on-chip RX buffer pre-read memory 121 is that it is filled with each new buffer descriptor entry from the RX buffer descriptor queue 113 only once. Similarly, the on-chip TX buffer pre-read memory 117 is filled with each new buffer descriptor entry from the TX buffer descriptor queue 111 only once.
Data from the receive flow 125 is initially stored in the RX data engine 119. It may then be distributed in the host RX buffers 109 using RX buffer descriptor queue information stored on the on-chip RX buffer descriptor pre-read memory 121. Data in the transmit flow 123 is moved from the host TX buffers 107 into the TX data engine 115 using TX buffer descriptor queue information stored on the on-chip TX buffer descriptor pre-read memory 117. The buffer descriptor entries will stay on the NIC 103 in the dedicated pre-read memories 117 and 121 until the buffer pointed to by the corresponding descriptor is emptied (during a TX operation) or filled (during a RX operation). Once the buffer descriptor is consumed, the on-chip memory (121 or 117) may be reused for a new buffer descriptor from its corresponding queue.
L2 processing has traditionally been achieved by having a driver running on the main system processor converting virtual memory areas into a list of physical memory address/length pairs. In this regard, the TX buffer descriptor queue comprises physical memory information. The buffers 107 pointed to by each physical address/length pair are portions of the virtual memory areas generated by the higher level network stacks. For L2 offload level, the TX buffer descriptor queue 111 may be read by the NIC into the on-chip TX buffer pre-read memory 117 and fed to the TX data engine 115. Final L2 formatting is achieved as the data is communicated out from the TX data engine 123.
FIG. 2 is a block diagram illustrating conventional L4 processing. The NIC 203 is connected to a host memory 201 via an interface bus 205. Referring to FIG. 2, the NIC 203 may comprise a RX data engine 219 and a TX data engine 215. The NIC may be connected to an off-chip RX buffer descriptor pre-read memory 221, and an off-chip TX buffer descriptor pre-read memory 217. The host memory 201 may comprise a set of TX buffers 207, RX buffers 209, a TX buffer descriptor queue 211 and a RX buffer descriptor queue 213, all associated with a specific L4 connection being processed by the NIC. Many 1000's of these connections may need to be processed by the NIC. Data may be communicated with the NIC via the transmit flow 223 and the receive flow 225.
There are different implementations for L4 offload buffer management processing. One implementation for L4 buffer management processing is where each connection supported requires a pair of transmit/receive buffer streams. During a transmit and a receive operation, host physical buffers of variable length may be used, for example, the TX buffers 207 and the RX buffers 209 on the host memory 201. Buffer descriptor information is stored in the TX buffer descriptor queue 211 and the RX buffer descriptor queue 213. During L4 processing, the NIC 203 and the host memory 201 may support many data connections. For example, the number of TX buffer descriptor queues and RX buffer descriptor queues that may be managed is normally counted in the 1000's, with some L4 devices supporting 64K queues and more. Compared to the market for L2 types of devices, the market for L4 types of devices has been less price competitive. With regard to data processing, the L4 types of devices simply pre-read the buffers, just as the L2 types of device did. However, one TX/RX buffer set is required for each connection during L4 processing. This requires large amounts of memory for storage on the NIC, normally requiring multiple external memory devices to implement (i.e. external memory attached to the NIC). For example, an off-chip TX buffer descriptor pre-read memory 217 and an off-chip RX buffer descriptor pre-read memory 221 may be utilized, together with the RX data engine 219 and the TX data engine 215 on the NIC 203, in order to manage the transmit flow 223 and the receive flow 225.
During L4 processing, a driver running on a main system processor coverts the output from the L5 processing into TX buffer descriptor queue 211 entries. The main difference between L4 and L5 processing is that in order to perform the L4 offload, the NIC must keep track of data for each connection individually, so the number of queues and amount of queued information can be significant. The data pointed to by the queue entries is the portions of virtual memory space specified by the L5 processing level. For L4 offload level, the TX buffer descriptor queue 211 for each connection must be read and subsequent processing may be performed for levels 4, 3, 2, and 1.
FIG. 3 is a block diagram illustrating conventional L5 processing. The NIC 303 is connected to a host memory 301 via an interface bus 305. Referring to FIG. 3, the NIC 303 may comprise a RX data engine 325 and a TX data engine 323. The NIC 303 may be connected to an off-chip translation page table (TPT) memory 311. The off-chip TPT memory 311 may comprise a plurality of upper translation page tables 313 and a plurality of lower translation page tables 315. The host memory 301 may comprise application virtual memory buffers, contained in equal size host memory pages or blocks 307. Network data may be communicated with the NIC via the transmit flow 327 and the receive flow 329. Data between the NIC 303 and the host memory 301 may be exchanged via the communication bus 305.
L5 offload buffer management processing is more structured than L2 and L4 processing. Normally, the buffers 307 on the host memory 301, for any L5 transmit or receive operations, are described by a two level page table system—lower level and upper level. The upper level is normally called the upper level translation page table (Upper TPT) and identifies a specific buffer for a single or multiple operations. The upper TPT entry identifies a starting virtual address and length for the application buffer, just as the application sees the buffer, as well as the location in the lower TPT entries, where the physical page locations may be located. The upper TPT entry may also include certain access permissions for the buffer.
The lower level is normally referred to as the lower level translation page table (lower TPT). The lower TPT entry is a simple page, or block table, where each entry points to a standard sized system memory page or block. Each of the lower TPT entries 315 may have multiple upper TPT entries 313 pointing to it, where each upper TPT entry may describe a different non-overlapping or overlapping application buffers. This type of L5 buffer description corresponds directly with modern virtual memory general purpose processing systems, such as the systems in FIGS. 1C and 1D, which organize data memory as a large pool of fixed size buffers or storage servers.
In conventional L5 offload systems, the page table system is uploaded and ready in the NIC before it is needed for L5 processing. Each page table is used for the duration of time it is needed and then unloaded from the NIC when it will no longer be used. The L5 processing method may produce very high performance for high-cost NIC implementations. However, it requires a very large amount of memory for all the page tables needed to be loaded for data processing. L5 processing, therefore, may be very expensive as it often requires multiple devices for implementation. For example, an off-chip TPT memory 311, with upper TPT entries 313 and lower TPT entries 315 may be needed, together with a TX data engine 323 and an RX data engine 325 on the NIC 303, in order to manage the transmit flow 327 and the receive flow 329.
Further limitations and disadvantages of conventional and traditional approaches will become apparent to one of skill in the art, through comparison of such systems with some aspects of the present invention as set forth in the remainder of the present application with reference to the drawings.