The present invention relates to memory management and data communication networks. More particularly, the present invention relates to an apparatus and method for reducing the required number of queuing memory access cycles using a distributed queue structure in devices such as data communication network devices having multiple queues.
As is known to those of ordinary skill in the art, a network is a communication system that allows users to access resources on other computers and, exchange messages with other users. A network is typically a data communication system that links two or more computers and peripheral devices. It allows users to share resources on their own systems with other network users and to access information on centrally located systems or systems that are located at remote offices. It may provide connections to the Internet or the networks of other organizations. The network typically includes a cable that attaches to network interface cards (xe2x80x9cNICsxe2x80x9d) in each of the devices within the network. Users may interact with network-enabled software applications to make a network request (such as to get a file or print on a network printer). The application may also communicate with the network software, which may then interact with the network hardware to transmit information to other devices attached to the network.
Many techniques and devices are known to those of ordinary skill in the art for transmitting data between nodes in a network. For example, data may be transmitted through multiple intermediate network connection devices, such as routers and switches, located between a source node and a destination node. These intermediate network communication devices may contain one or more queues that temporarily store data awaiting transmission to another node or network communication device in the network. In networks that transmit data using an Internet Protocol (xe2x80x9cIPxe2x80x9d), best-effort service is typically provided by the various network nodes. Best-effort service does not provide any Quality of Service (xe2x80x9cQOSxe2x80x9d) guarantees for a particular data stream. Instead, best-effort service transmits data in the order it was received using a network communication device""s available bandwidth.
Network communication devices that support QOS or other resource allocation techniques typically use multiple queues in which each queue is associated with a particular QOS or a particular data flow. A portion of the device""s resources, such as bandwidth, are allocated to a particular queue within the device.
FIG. 1 is a block diagram illustrating an exemplary network 100 connecting a user 110 and a particular web page 120. FIG. 1 is an example that may be consistent with any type of network known to those of ordinary skill in the art, including a Local Area Network (xe2x80x9cLANxe2x80x9d), a Wide Area Network (xe2x80x9cWANxe2x80x9d), or a combination of networks, such as the Internet.
When a user 110 connects to a particular destination, such as a requested web page 120, the connection from the user 110 to the web page 120 is typically routed through several internetworking devices such as routers 130-A-130-I. Routers are typically used to connect similar and heterogeneous network segments into internetworks. For example, two LANs may be connected across a dial-up, integrated services digital network (xe2x80x9cISDNxe2x80x9d), or across a leased line via routers. Routers may also be found throughout internetwork known as the Internet. End users may connect to a local Internet service provider (xe2x80x9cISPxe2x80x9d) (not shown).
As shown in FIG. 1, multiple routes are possible to transmit information between user 110 and web page 120. Networks are designed such that routers attempt to select the best route between computers such as the computer where user 110 is located and the computer where web page 120 is stored. For example, based on a number of factors known to those of ordinary skill in the art, the route defined by following routers 130-A, 130-B, 130-C, and 130-D may be selected. However, the use of different routing algorithms may result in the selection of the route defined by routers 130-A, 130-E, 130-F, and 130-G, or possibly even the route defined by routers 130-A, 130-B, 130-H, 130-I, 130-F, and 130-G. A detailed discussion of the aspects of routing algorithms that determine the optimal path between two nodes on a network is not necessary for the purposes of the present invention, and such a discussion is not provided here so as not to overcomplicate the present disclosure.
Routers such as routers 130-A-130-I typically transfer information along data communication networks using formatted data packets. For example, when a xe2x80x9csourcexe2x80x9d computer system (e.g., computer 110 in FIG. 1) wishes to transmit information to a xe2x80x9cdestinationxe2x80x9d computer system (e.g., computer 120 in FIG. 1), it generates a packet header in an appropriate format which typically includes the address of the source and destination end system, and then fills the remainder of the packet with the information to be transmitted. The complete data packet is then transmitted to the router attached to (and responsible for) the source computer system, which forwards it toward the destination computer system. Packets transmitted among the routers themselves (typically referred to as xe2x80x9ccontrol packetsxe2x80x9d) are similarly formatted and forwarded.
When a router receives a data packet, it reads the data packet""s destination address from the data packet header, and then transmits the data packet on the link leading most directly to the data packet""s destination. Along the path from source to destination, a data packet may be transmitted along several links and pass through several routers, with each router on the path reading the data packet header and then forwarding the data packet on to the next xe2x80x9chop.xe2x80x9d
To determine how data packets should be forwarded, each router is typically aware of the locations of the network""s end systems (i.e., which routers are responsible for which end systems), the nature of the connections between the routers, and the states (e.g., operative or inoperative) of the links forming those connections. Using this information, each router can compute effective routes through the network and avoid, for example, faulty links or routers. A procedure for performing these tasks is generally known as a xe2x80x9crouting algorithm.xe2x80x9d
FIG. 2 is a block diagram of a sample router 130 suitable for implementing an embodiment of the present invention. For the purpose of explanation, the present invention is described as embodied in a router. However, those of ordinary skill in the art will recognize that various other network communication devices such as switches (including asynchronous transfer mode (ATM) switches and IP switches), data servers, and similar devices may embody the teachings of the present invention. In a particular embodiment of the invention, router 130 is an Internet Protocol (xe2x80x9cIPxe2x80x9d) router. However, those of ordinary skill in the art will recognize that the present invention can be used with various other protocols.
Referring to FIG. 2, router 130 is shown to include a master control processing unit (xe2x80x9cCPUxe2x80x9d) 210, low and medium speed interfaces 220, and high speed interfaces 230. The CPU 210 may be responsible for performing such router tasks as routing table computations and network management. It may include one or more microprocessor integrated circuits selected from complex instruction set computer (xe2x80x9cCISCxe2x80x9d) integrated circuits, reduced instruction set computer (xe2x80x9cRISCxe2x80x9d) integrated circuits, or other commercially available processor integrated circuits. Non-volatile RAM and/or ROM may also form a part of CPU 210. Those of ordinary skill in the art will recognize that there are many alternative ways in which such memory can be coupled to the system.
Interfaces 220 and 230 are typically provided as interface cards. Generally, they control the transmission and reception of data packets over the network, and sometimes support other peripherals used with router 130. Throughout the description of this invention, the term xe2x80x9cdata packetxe2x80x9d shall be understood to include any,grouping of one or more data elements of any size, including data cells, data bytes, and the like. In a particular embodiment of the invention, router 130 is an IP router capable of handling IP data packets. In this embodiment, IP data packets associated with different IP data flows are buffered in different queues. This buffering of IP data packets can be performed on a per service class basis or a per data flow basis.
Examples of interfaces that may be included in the low and medium speed interfaces 220 are a multiport communications interface 222, a serial communications interface 224, and a token ring interface 226. Examples of interfaces that may be included in the high speed interfaces 230 include a fiber distributed data interface (xe2x80x9cFDDIxe2x80x9d) 232 and a multiport Ethernet interface 234. Each of these interfaces (low/medium and high speed) may include (1) a plurality of ports appropriate for communication with the appropriate media, and (2) an independent processor, and in some instances (3) volatile RAM. The independent processors may control such communication intensive tasks as packet switching and filtering, and media control and management. By providing separate processors for the communication intensive tasks, this architecture permits the master CPU 210 to efficiently perform routing computations, network diagnostics, security functions, and other similar functions.
The low and medium speed interfaces are shown to be coupled to the master CPU 210 through a data, control, and address bus 240. High speed interfaces 230 are shown to be connected to the bus 240 through a fast data, control, and address bus 250, which is in turn connected to a bus controller 260. The bus controller functions are typically provided by an independent processor.
Although the system shown in FIG. 2 is an example of a router suitable for implementing an embodiment of the present invention, it is by no means the only router architecture on which the present invention can be implemented. For example, an architecture having a single processor that handles communications as well as routing computations would also be acceptable. Further, other types of interfaces and media known to those of ordinary skill in the art could also be used with the router.
At a higher level of abstraction, FIG. 3 is a block diagram illustrating a model of a typical router system that is applicable in the context of the present invention. As shown in FIG. 3, a networking device such as a router 130 may be modeled as a device having a plurality of input interfaces 310a-310n, each having a corresponding input interface queue 320a-320n. Each input interface 310 receives a stream 330a-330n of data packets 340a-340z, with each data packet 340 typically arriving at a variable rate and typically having a variable length (usually measured in bytes). In addition to the data xe2x80x9cpayloadxe2x80x9d in each packet, each packet contains header information, which typically includes a source address and a destination address. Currently, the dominant protocol for transmitting such data packets is the Internet Protocol (xe2x80x9cIPxe2x80x9d). However, as will be described more fully in subsequent portions of this document, embodiments of the present invention can be implemented using any routable protocol known to those of ordinary skill in the art.
As each new data packet 340 arrives on an interface 310k, it is written into a corresponding input interface queue 320k, waiting for its turn to be processed. Scheduling logic 350 determines the order in which input interfaces 310a-310n should be xe2x80x9cpolledxe2x80x9d to find out how many data packets (or equivalently, how many bytes of data) have arrived on a given interface 310k since the last time that interface 310k was polled. Scheduling logic 350 also determines the amount of data that should be processed from a given interface 310k during each xe2x80x9cpolling round.xe2x80x9d When scheduling logic 350 determines that a particular data packet 340i should be processed from a particular input interface queue 320k, scheduling logic 350 transfers the data packet 340i to subsequent portions of the networking device (shown as dashed block 355) for further processing. Eventually, data packet 340i is written into one of a plurality of output queues 360a-360q, at the output of which the data packet 340i is finally transmitted from the networking device the corresponding output interface 370a-370q. Fundamentally, then, the packet forwarding component of a router performs the function of examining the source and destination address of each data packet and identifying one from among a plurality of output interfaces 370a-370q on which to transmit each data packet.
In the router model illustrated in FIG. 3, each queue is associated with one of the router""s input or output interface ports. However, as mentioned earlier, it is also possible to associate a queue with a particular xe2x80x9csession,xe2x80x9d with a xe2x80x9cflow,xe2x80x9d or with any other category or classification of data stream. In the context of the present invention, therefore, a xe2x80x9cqueuexe2x80x9d is simply an ordered list of elements waiting to be processed. A xe2x80x9cflowxe2x80x9d is a stream of data traveling between two endpoints across a network (for example, from one LAN station to another). Multiple flows can be transmitted on a single circuit. As those of ordinary skill in the art will recognize, the number of queues in a network device can be very large in implementations where each flow can be associated with a queue.
In a queuing control design realized by a hardware memory structure, the number of memory accesses to the queue within a certain amount time is limited by the bandwidth of the memory. Typically, updating a queuing event such as the arrival or departure of a queue data element requires two memory access cycles: one to read the current status and one to write the updated values. In such a configuration, completing the service of a queuing event (i.e., the arrival and departure of a data element) requires four memory access cycles. Such service time requirements limit the throughput of the queue. If the data arrival rate is faster than the total queue service time, the queue will overflow.
As mentioned earlier, in networking design, it is often necessary to implement a large number of queues on shared memories for high-speed data processing. Reducing the number of queue access cyclesxe2x80x94and, hence, the queue service timexe2x80x94while maintaining the queuing functions is critical to achieving desired performance.
A typical memory architecture 400 for a network device employing multiple queues is illustrated in FIG. 4. As shown in FIG. 4, controller 405 may be a microprocessor, a microcontroller, or any other suitable equivalent device, and may be implemented as one or more integrated circuits. Controller 405 is coupled to a Queue Memory 410 via address bus 412 and data bus 414. Controller 405 is also coupled to Datalink Memory 420 via address bus 422 and data bus 424. Finally, controller comprises a Free List Register Memory 430. The various components shown in FIG. 4 may be implemented as one or more integrated circuits. It should be noted that, as shown in FIG. 4, the hardware architecture depicted implies that only a single access to Queue Memory 410 can be performed at any given time. Similarly, one a single access to Datalink Memory 420 can be performed at any given time. Free List Register 430 is typically implemented as a register or as some other type of rapidly accessible memory, such that accesses to the Free List Register 430 are not considered to be costly in terms of total memory bandwidth. Nothing precludes the possibility that Free List Register 430 is implemented independently of controller 430, so long as Free List Register 430 can be accessed rapidly by controller 430.
Referring now to FIGS. 4 and 5, Queue Memory 410 comprises a set of Head and Tail pointers, with one Head pointer and one Tail pointer per queue. Each queue within Queue Memory 410 is typically implemented by associating the Head and Tail pointers of each queue with a set of data link information, which is a linked list memory structure (such as Datalink Memory 420) to maintain the stored data. Data Storage Memory 425 is the memory structure used to store the actual data elements. For the sake of explanation, an example is provided herein, with reference, to FIGS. 4-7.
In a typical queuing construct, a Queue Memory 410 records the Head and Tail information for the beginning and end positions of the queue. The data elements between the Head pointers are maintained in a linked list memory (e.g., Datalink memory 420 shown in FIGS. 4 and 5). As those of ordinary skill in the art will recognize, Datalink memory 420 provides the location of the next data element in the queue, while the actual data elements are stored in Data Storage Memory 425. In the example shown in FIG. 5, there are five data elements: a, b, c, d, and e. The Free pointer indicates the beginning (xe2x80x9cheadxe2x80x9d) of the remaining free link locations.
To receive a new data element, when a new data element arrives, the Queue Memory 410 is read to obtain the current Tail position (in this example, the value 101). The current Free location will be used to store the newly arriving data. Next, to establish the link, the current value of the Free pointer (103 in this example) is written to the Datalink Memory 420 at the current Tail position. This Free pointer value, which now is the new Tail pointer, is then written to the Tail record of Queue Memory 410. To obtain a new Free location, the current Free pointer is read to obtain the next available pointer (having a value of 105 in the example shown in FIG. 5). This value becomes the new Free pointer. Therefore, two memory access cycles are required for each of the Queue Memory 410 and the Datalink Memory 420 (for a total of four memory access cycles) when receiving a new data element.
A flow chart of the typical data element reception process just described is provided at FIG. 6. At step 610, Queue Memory 410 is read to obtain the current Tail pointer. At step 620, the value of the current Free pointer is written to Datalink Memory 620 at the current Tail pointer location. At step 630, Datalink memory 420 is read at the current Free pointer location to obtain the new Free pointer location. At step 640, the new Free pointer location is stored in Queue Memory 410 as the new Tail pointer. Finally, at step 650, the current Free pointer (stored in the Free List Register 439 shown in FIG. 4) is set to equal the new Free pointer. As mentioned earlier, step 650 is typically not very costly in terms of memory bandwidth, because the Free List Register 430 is typically implemented as a register or other rapidly accessible type of memory. Therefore, ignoring step 650, four memory access cycles are required to receive each data element: one Queue Memory read cycle, one Datalink Memory write cycle, one Datalink Memory read cycle, and one Queue Memory write cycle.
To transmit a data element, Queue Memory 410 is first read to obtain the current Head pointer (having a value of 2 in the example shown in FIG. 5). Datalink Memory 420 is read at this position to obtain the next data element after the Head pointer value is written at the current Head location. In the example shown in FIG. 5, the value 103 will be written to location 2. At this point, the new Head pointer (having a value of 5 in the example) is written back to the Queue Memory 410 Head record. Finally, the new Free pointer is set to the old Head position (having a value of 2 in the example). Thus, the old Head pointer is now xe2x80x9creturnedxe2x80x9d to the Free link pointers. The Free pointers now start from location 2, then point to location 103, then to location 105, etc. Therefore, as the example illustrates, two memory access cycles are also required for each of the Queue Memory 410 and the Datalink Memory 420 (for a total of four memory access cycles) when transmitting a data element from the queue.
A flow chart of the typical data element transmission process just described is provided at FIG. 7. At step 710, Queue Memory 410 is read to obtain the old Head pointer value. At step 720, Datalink Memory 420 is read at the old Head pointer location to obtain the next data element location. At step 730, the Free pointer is written to the Datalink Memory 420 at the old Head pointer location. At step 740, the next data element location is stored in the Queue Memory 410 as the new Head pointer. Finally, at step 750 the Free pointer is set to equal the old Head pointer value. As was the case with step 650 shown in FIG. 6, step 750 is not considered to be very costly in terms of memory bandwidth. Therefore, ignoring step 750, the data element transmission process requires four memory access cycles: one Queue Memory read cycle, one Datalink Memory read cycle, one Datalink Memory write cycle, and one Queue Memory write cycle.
In summary, to completely process a data element (i.e., to receive a data element and to transmit a data element) in a typical queuing implementation, a total of four memory access cycles are required for each of the Queue Memory 410 and the Datalink Memory 420. This is illustrated in Table I, below.
As mentioned earlier, the number of memory accesses to the queue within a certain amount time is limited by the bandwidth of the memory. Therefore, reducing the number of memory access cycles required to process each data element would increase the data processing capability of a data network device such as a router. Unfortunately, no current mechanism exists to facilitate such a reduction in the number of memory access cycles required. As will be described in more detail below, the present invention provides a distributed queuing architecture that significantly reduces the number of memory access cycles required to process each data element.
According to aspects of the present invention, to reduce the number of memory access cycles required to process each data element, the queue and data link structures are implemented on separate memories. Instead of a single memory structure, a queue is maintained using separate Receive and Transmit Queues. Similarly, the data memory linked list is separated into a Data Queue Link and a Data Stack Link. Compared with existing approaches, the novel queuing structure according to aspects of the present invention reduces the number of required memory access cycles by half when processing a typical data element arrival and departure. It provides a scheme to more efficiently utilize the queuing memory bandwidth and to increase the data throughput. Moreover, the method is scalable and can be implemented for a large number of queues. These and other features and advantages of the present invention will be presented in more detail in the following specification of the invention and in the associated figures.
To reduce the number of memory access cycles required to process each data element in a data networking device having one or more queues and a corresponding set of data link structures, the queue and data link structures are implemented on separate memories. Each queue is maintained using separate receive and transmit queue structures. Similarly, the data memory linked list is separated into a data queue link and a data stack link. Each of these four memories comprises its own address and data bus, and all four memories may be accessed simultaneously by a controller. In a general case, processing a complete data transmission event (i.e., a data element arrival and a data element departure) may be performed with a latency of at most three steps. In the first step, the transmit queue is read to obtain the old head pointer. In the second step, the following three sub-steps are performed simultaneously: (1) the receive queue is read to obtain the current tail pointer, (2) the data stack link is read at the current free pointer position to obtain the new free pointer, and (3) the data queue link memory is read at the old head pointer address obtained in the first step to obtain the next data element location. The data values obtained from performing the first and second steps are used as either as addresses or data values in the third step. In the third step, the following four sub-steps may be performed simultaneously: (1) the free pointer is stored in the receive queue as the new tail pointer, (2) the next data element location is written to the transmit queue as the new head pointer, (3) the free pointer is stored in the data queue link memory at the current tail pointer location, (4) the free pointer is written to the data stack link memory at the old head pointer location. Various modifications to the above sequence of steps are possible.