The processing of queries, such as in databases or in general data processing, can be a very time- and processor-intensive task. As such, it is often desirable to introduce at least some level of parallel processing to these tasks. Typically, in conventional solutions, a query is parsed into a tree containing data operators and then branches or entire subtrees of tins tree are duplicated and run in parallel. This allows, for example, one subtree to operate on one portion of the data and then an identical but separate instance of the subtree to operate on another portion of the data. The results of both parallel executions of the subtrees are generally combined. This is known as the Volcano query-processing model.
Such conventional types of parallelization models, however, have several drawbacks. It is common for database queries to have dependencies between the data operators. For example, when data is requested from an operator (for example, a join), the operator (e.g., parent operator) must request data from its child operators. Furthermore, additional synchronization generally may occur. For example, a parallel hash join typically must build a hash table for the portion of the input seen by each thread of execution locally, then the hash tables must be merged while all other threads wait. Inter-operator calls at different levels are not simple to execute, and require a high degree of resources to maintain synchronization. Additionally, the potential for deadlocks is high, because the structure of operators/calls is different or each query type, and dependencies between different levels of the operator tree/call stack can exist simultaneously in ways that are not easy to predict. Additionally, some variants of the Volcano query-processing model allow operators to call their child operators in any order, according to the needs of the parent operator, introducing additional dependencies.
Intra-query parallelism solutions that are currently implemented provide that the only communication occurring between operators happens when parent operators request an action (typically the supplying of rows) from their child operators. This simplifies reasoning about arbitrarily complex operator trees, and only requires system developers to think about the local behavior of each operator. However, as briefly described earlier, there is a need for coordination of the various branches of a parallel operator. This coordination is provided by the operator on a single branch (the “master”) which is specially initialized for this purpose. Parallel operators each operate on their own thread, and a parallel plan is optimized and built with a maximum parallel degree chosen by an optimizer. Each parallel branch then has its own tree of operators, which mirrors the tree of its siblings. For example, a tree of operators can include one or more exchange operators. An exchange operator can exchange data cross process and processor boundaries. When the first fetch is performed on a cursor and the fetch reaches the exchange operator, the exchange operator determines how many worker threads are available to be used by the plan and initialized these worker threads, one per branch (up to the maximum degree of the plan).
In one specific example embodiment, each worker thread (e.g., ExchangeRequest) then proceeds (more or less) independently, using a model of (etching rows from its child operators, processing them, and passing them up to the Exchange. In this case, some of the parallel operators should be synchronized.
One reason for synchronization is to reflect an actual data dependency. For example, a merged hash table can only be able to be built once all the branches contributing to it have built their portion of it; no branch can probe a merged bash table until all branches have finished building and one thread has performed the merge. There can also be instances where synchronization is an artifact. For example, each thread is responsible for deleting every object it creates, and only those objects. Furthermore, it can only delete the objects once the rest of the threads are done accessing them.
In one specific example embodiment, all of this synchronization is implemented by large numbers of specialized, named semaphores (typically condition variables) within each of the parallel operators. There are a large number of bugs caused by unexpected interactions between all of these coordinating semaphores and cleanup of objects accessed by all threads. These bugs are typically deadlocks, but also include crashes. Fixes for the deadlocks can be utilized, but these often introduce new faults in to the code that later show up as new bugs. Either the fix to the deadlock is too aggressive, in which case faults are encountered where an item that needs to be synchronized is no longer synchronized, or new deadlocks are introduced but pushed up or down one level of the code.
One issue is that the synchronization patterns and the interactions between the synchronization requirements of different operators (especially if they are at different levels of a plan) are very hard to predict. The use of a master branch to control shared state between ail of the sibling branches is one of the problem areas. This design means that not all branches can be fetched from equally; the master branch depends on its parent(s) fetching from it in a certain order, relative to its siblings. However, some operators have their own ordering requirements and don't know about the ordering requirements of their children.
Another weakness is that the processing that is performed at a lower level of the branch tree can be required even if the upper level of the branch does not use it (either because its evaluation was short-circuited or because it hit a runtime error). This is because all branches typically use the results of shared processing that is performed by lower levels. This can be handled by utilizing pipeline parallelism, where each region of a tree runs in a separate thread, so processing is performed at lower levels of a parallel branch even if the upper levels of that particular branch did not request it. Regions of the tree can be imposed by the synchronization points. For example, in FIG. 4, subtree 204, 208, and 212 form one such region. Part of the work performed by a parallel join hash operator belongs to the region under the parallel join hash operator (the part that belongs to the build side). This has its own weakness, however, in that lower levels of the branch can end of performing work that is not needed.
A further characteristic of current implementations is a lack of clean separation between static and dynamic portions of a plan. This does not present a source of bugs, but it does require the stateful and stateless portions of execution objects to be more closely tied than necessary, which increases code complexity. The static information persists across multiple executions of a cursor, but the objects storing the static information is duplicated for each branch of a parallel plan thus keeping many versions of the static plan context. By contrast, dynamic objects that are created during a fetch typically only endure while foe cursor is still fetching.