1. Field of Invention
The present invention relates to scheduling and arbitrating events in computing and networking, and more particularly to the use of the data structure known as a pile for high-speed scheduling and arbitration of events in computing and networking.
2. Description of Related Art
Data structures known as heaps have been used previously to sort a set of values in ascending or descending order. Rather than storing the values in a fully sorted fashion, the values are “loosely” sorted such that the technique allows simple extraction of the lowest or greatest value from the structure. Exact sorting of the values in a heap is performed as the values are removed from the heap; i.e, the values are removed from the heap in sorted order. This makes a heap useful for sorting applications in which the values must be traversed in sorted order only once.
The properties of a heap data structure are as follows.
    P1. A heap is a binary tree, or a k-ary tree where k>2.    P2. A heap is a balanced tree; i.e., the depth of the tree for a set of values is bounded to logk(N), where N is the number of elements in the tree, and where k is described above.    P3. The values in a heap are stored such that a parent node is always of higher priority than all of its k descendent nodes. Higher priority means “higher priority to be removed from the heap”.    P4. A heap is always left (or right) justified and only the bottom level may contain “holes” (a lack of values) on the right (or left) side of that level.
Property P2 is a reason that heaps are a popular method of sorting in systems where the sorted data must be traversed only once. The bounded depth provides a deterministic search time whereas a simple binary or k-ary tree structure does not.
Property P3 dictates that the root node of the tree always holds the highest priority value in the heap. In other words, it holds the next value to be removed from the heap since values are removed in sorted order. Therefore, repeatedly removing the root node removes the values in the heap in sorted order.
FIG. 1 is a conventional architectural diagram illustrating a tree-based heap data structure 10, with a level 0 of heap, a level 1 of heap, a level 2 of heap, and a level 3 of heap. Tree-like data structures such as heaps are typically depicted and implemented as a series of nodes and pointers to nodes. Each node comprises a value to be sorted. In the level 0 of heap, a node 11 stores a value of 5. In the level 1 of heap, a node 12 stores a value of 22, and a node 13 stores a value of 10. In the level 2 of heap, a node 14 stores a value of 26, a node 15 stores a value of 23, a node 16 stores a value of 24, and a node 17 stores a value of 17. In the level 3 of heap, a node 18 stores a value of 27, and a node 19 stores a value of 38.
FIG. 2 is a conventional architectural diagram illustrating an array-based heap data structure 20. It is well known in the art that balanced trees, such as heaps, may be constructed with arrays. The array-based heap data structure 20 eliminates the need to keep forward and backward pointers in the tree structure.
FIG. 3 is a conventional flow diagram illustrating the process of a heap remove operation 30. Once a root node 11 is removed, a “hole” is created in the root node position 11. To fill the hole in the root node 11, the bottom-most, right-most value (BRV) 12 is removed from the heap and is placed in the hole in the root node 11. Then, the BRV and the k descendent nodes are examined and the highest priority value, if not the BRV itself, is swapped with the BRV. This continues down the heap. This comparison and swapping of values is known as the “percolate” operation.
FIG. 4 is a conventional flow diagram illustrating the process for a heap insert operation 40. To add a value to be sorted into the heap, a slightly different kind of percolate operation is performed. The first hole 41 to the right of the bottom-most, right-most value is identified, and the new value is inserted there. This value is compared to the value in its parent node. If the new value is of higher priority than the parent value, the two values swap places. This continues until the new value is of lower priority, or until the root of the tree is reached. That is, the percolate continues up the tree structure rather than down it.
The described methods of adding and removing values to and from a heap inherently keeps a heap balanced: no additional data structures or algorithms are required to balance a heap. This means that heaps are as space-efficient as binary or k-ary trees even though the worst case operational performance of a heap is better than that of a simple tree.
A third operation is also possible: “swap”. A swap operation consists of a remove operation whereby the BRV is not used to fill the resultant hole in the root node 11. Instead, a new value is immediately re-inserted. The percolate operation is performed is identical to the delete case.
Because the percolate operations for remove and for insert traverse the data structure in different directions, parallelism and pipelining of the heap algorithm are inefficient and difficult, respectively.
High-speed implementations of heaps seek to find a way to execute the heap algorithm in hardware rather than in a software program. One such implementation is described in U.S. Pat. No. 5,603,023. This implementation uses a number of so-called “macrocells,” each consisting of two storage elements. Each storage element can store one value residing in a heap. The two storage elements in a macrocell are connected to comparison logic such that the greater (or lesser) or the two can be determined and subsequently be output from the macrocell. A single so-called “comparing and rewriting control circuit” is connected to each macrocell so the comparisons between parent nodes and child nodes can be accommodated. In every case, both child nodes of a given parent are in the same macrocell, and the parent is in a different macrocell.
The shortcomings of the heap data structure and of previous implementations are described in the following points:    S1. Efficient pipelined heaps cannot be implemented due to opposing percolate operations.
There are two completely different percolate operations described in the previous section: one is used to remove values from the heap in sorted order, and one is used to insert new values into the heap. The former operation percolates downward from the top of the heap, whereas the latter operation percolates upward from the bottom of the heap.                A pipelined hardware operation is similar to an assembly line in a factory. In a pipelined heap—if such a structure existed—one insertion or removal operation would go through several stages to complete the operation, but another operation would be in the previous stage. Each operation goes through all the stages. I.e., if stage Sj is currently processing operation i, stage Sj-1 is currently processing operation i+1, stage Sj-2 is currently processing operation i+2, and so on.        However, since some operations flow through the heap in one direction (e.g., insertion), whereas other operations flow though the heap in the other direction (e.g., removal), an efficient pipeline that supports a mix of the two operations is difficult to construct. This is because a removal operation needs to have current, accurate data in the root node (property P3, section 4.1) before it can begin, but an insertion of a new value percolates from the bottom up (see section 4.1). Thus, an insert operation is executed before a subsequent removal operation can be started. This is the direct opposite of a pipeline.        
A unidirectional heap that operates only top-down is in the public domain. To operate in this fashion, the insert operation computes a path through the heap to the first unused value in the heap. Additionally, a simple method is proposed for tracking this first unused position. However, this tracking method assumes that heap property P4 holds. Although this property holds true for a traditional heap, removal of this property is desirable to eliminate shortcoming S2, described below. Thus, a suitable unidirectional heap structure suitable for high-speed pipelining does not exist in the current state of the art.    S2. Pipelined implementations of heaps are difficult to construct in high-speed applications due to the specifics of the “remove & percolate” operation.            The operation that removes values from a heap in sorted order leaves a “hole” in the root node once the highest priority value has been removed. This hole is filled with the bottom-most, right-most value in the heap.        In order to fill the hole caused by a remove operation, a hardware implementation of a heap must read the memory system associated with the current bottom of the tree to get the last value of the tree. This requires (a) that the location of the bottom always be known, and (b) that the all the RAM systems, except the tree root, run faster than otherwise necessary. When the each of the logk(N) tree levels of the heap has a dedicated RAM system, the required speedup is two times the speed otherwise required. (Placing the logk(N) tree levels of the heap in separate RAMs is the most efficient way to implement a pipelined heap, if such a thing existed, since it has the advantage of using the lowest speed RAMs for any given implementation.)        Point (b) states that “all” memory systems must be faster because the bottom of the heap can appear in any of the logk(N) memories.        
Point (b) states that the memory must be twice as fast because the RAM is read first to get the value to fill the hole. The RAM may then be written to account for the fact that the value has been removed. Later, if the downward percolation reaches the bottom level, the RAM will be again read and (potentially) written. Thus, a single operation may cause up to 4 accesses to RAM. Only 2 accesses are necessary if the remove operation is optimized to avoid reading and writing the bottom-most level to get the bottom-most, right-most value.    S3. A conventional design may not be fully pipelined. That is, since there is only one “comparing and rewriting control circuit,” and since this circuit is required for every parent-child comparison in a percolate operation, it is difficult to have multiple parent-child comparisons from multiple heap-insert or heap-remove operations being processed simultaneously. This means that an insert or remove operation is executed before a new one is started.    S4. A conventional design is structured so that it takes longer to remove values from deeper heaps than from shallower heaps.    S5. A conventional design is incapable of automatically constructing a heap. An external central processor is repeatedly interacting with the design to build a sorted heap. (Once the heap is correctly constructed, however, the values may be removed in order without the intervention of the central processor).    S6. A conventional design employs so called “macrocells” that contain two special memory structures. Each macrocell is connected to a single so called “comparing and rewriting control circuit” that is required to perform the parent-child comparisons required for percolate operations.            This structure means that a macrocell is required for every pair of nodes in the heap, which in turn means that:        The structure does not efficiently scale to large heaps since large quantities of these special memory structures consume more area on a silicon die than would a traditional RAM memory sized to hold the same number of heap values.        The structure is costly to rework into a k-ary heap where k>2 since comparison logic grows more complex with the number of values being compared.            S7. A conventional design does nothing to prevent the painful problem of using a value from the bottom of the heap to fill the root node during a remove operation. The conventional design provides dedicated hardware to facilitate this nuance of heaps.
Scheduling and arbitration is common technique in the field of computing and networking which requires a series of events to occur in a particular order. The order of events is typically determined by a number assigned to each event, based on desired start time, desired end time, or some other criteria. These events are typically stored in an event queue, executing in ascending or descending order of the assigned values. Schedulers often use several separate event queues to maintain order amongst a related set of events.
In computing and networking, these events are often periodic. This means that once the event has occurred, it is rescheduled to occur again sometime in the future. There are currently many techniques for scheduling events in computing and networking, each relying on some type of sorting technique. Events may be sorted initially (scheduling), leaving the dispatching entity to simply dispatch events in the given order; or the events may be dispatched in order by an entity that examines all of the events or a sub-set of events to determine the next event to dispatch, or the “winning” event (arbitration).
In one solution, an arbiter or a scheduler performs a linear search or linear sort algorithm over a small number of events. This solution can be implemented in both hardware and software, but does not scale well as the number of events increases. In addition, various data structures, such as heaps and binary search trees, can be used for scheduling and arbitration. Although the use of these data structures can be faster than simply performing a linear search, there are still many drawbacks.
If the number of events is small, hardware implementations of a scheduler can exploit parallelism to quickly examine all events and select the winner. Trees of such hardware logic can be constructed to increase the number of events that may be arbitrated. Unfortunately, the cost in power and die area on an integrated circuit becomes extremely great as the number of elements to compare increases. In addition, the arrangement of comparators in trees carries with it inherent propagation delays, making this solution impractical for high-speed applications with a large number of events.
A systolic array is another implementation suitable only for hardware. Unfortunately, like the comparator trees, systolic arrays require a considerable amount of hardware, costing a large amount of die area on an integrated circuit. In addition, if multiple event queues are required, each queue must be sized for the worst case number of events, even though it may be impossible to fully populate all the queues simultaneously, thus leading to greater hardware inefficiencies.
One of the most commonly used data structures for scheduling and arbitration is known as a “calendar.” A calendar consists of a timeline and a pointer. Each entry (time-slot) in the timeline contains a list of all events that should occur at that time. As time advances, the pointer is incremented to reference the appropriate time-slot.
For many of today's computing and networking applications, speed of execution is absolutely critical. Linear searching has an execution time of O(N), while heaps and binary trees have an execution time of O(log N). Thus as the number of events that must be scheduled grows, the time it takes to arbitrate amongst them increases. This makes such techniques unsuitable for many high-speed applications. Moreover, heaps, binary trees, and linear sorts cannot take advantage of pipelining to increase speed of execution.
Although calendars operate with an execution time of O(1), the storage space required for implementation grows rapidly as scheduling resolution increases. Since the storage space for calendars grows linearly with the scheduling precision of the calendar, it is very expensive and hardware inefficient to support a high scheduling precision over long periods of time.
Moreover, because calendars are based on the concept of ever-increasing time, when multiple events occupy the same timeslot, time must stall while all events are dispatched. However, there are cases when an event takes a non-zero amount of time to complete, and where time cannot simply stop, such as when scheduling traffic on the Internet. In such cases when multiple events occupy the same timeslot, only one event can be dispatched, while the remaining events must be moved to the next available timeslot. This adds complexity to the algorithm as well as increased accesses to RAM, causing the execution time to increase significantly, thus rendering calendars unsuitable for certain high-speed applications.
A similar problem occurs when multiple priorities are used in the calendar to create a scheduler that gives priority to certain queues. When multiple events from multiple queues are placed in the same calendar timeslot, the calendar must do some additional work to determine which event should be serviced next. Furthermore, when the remaining events are moved to the next timeslot, additional work must be done to sort these entries in priority order with respect to any existing entries. An alternative to sorting is to have parallel timeslots, one for each priority that the calendar supports. This reduces algorithmic complexity and processing time, but it multiplies the storage space by the number of supported priorities.
Calendars do not handle “work conserving” scheduling and arbitration without a penalty of either time or storage. “Work conserving” has meaning when events are scheduled according to time. Work conserving means that as long as there is an event to dispatch, an event will be dispatched if it is the next winner, even though its previously calculated service time has not yet arrived. To provide a work conserving scheduler with a calendar, either: the algorithm needs run very fast to move the pointer through the timeslots until a scheduled event is found, or; the algorithm must run at some faster speed, or additional supporting data structures that consume additional storage space and cause additional algorithmic complexity are required to quickly find the next event. The memory accesses to the additional storage space can cause the algorithm to run more slowly, making it unsuitable for some applications.