The present exemplary embodiments relate generally to search algorithms. They find particular application in conjunction with graph searches, and will be described with particular reference thereto. However, it is to be appreciated that the present exemplary embodiments are also amenable to other like applications.
Graph searching mostly accounts for the heavy-lifting in many areas of high performance computing and artificial intelligence (AI), such as planning, scheduling, combinatorial optimization, and model checking. This is because these tasks generally include searching graphs having exponential size in the depth of the search. Thus, finding an optimal or even approximate solution can take a long time, and the ability to scale up to larger problems is to a large degree dependent on the speed of the underlying graph-search algorithm.
One ubiquitous approach to speeding up graph searching is to efficiently utilize the increasing number of parallel processing units available in modern systems, such as multiple, multi-core CPUs and GPUs. Under this approach, the prime challenge to efficiency is duplicate-detection, specifically the overhead of communicating potential duplicates to all involved processes. Most existing parallel graphs search algorithms circumvent this issue by ignoring duplicates such that communications are restricted to distributing root states of local searches and their termination signals. This is acceptable as long as problem graphs are trees, which lend themselves conveniently to parallelization. The topology of a tree guarantees that there is only one unique path from the root to any node in the tree and thus no duplicates will be encountered. However, for most search problems, the most natural and succinct representation of the search space is not a tree; rather, it is a graph having many alternative paths between a pair of nodes. Failing to consider duplicates in graphs having multiple ways of reaching the same node can result in the search space becoming exponentially large. Furthermore, in the worst case, the presence of duplicates can result in the searches of all but one participating process being superfluous (e.g., when the root nodes of all other processes happen to be superfluous duplicates of nodes in said process), leading to these algorithms performing (in some cases exponentially) worse than state-of-the-art single-threaded algorithms.
A traditional method of addressing duplicates involves storing global Open and Closed lists to check for duplicates. However, this method may suffer from prohibitive communication and/or synchronization overhead in parallel search, since efforts must be made to avoid race conditions among multiple processing units. Further, even if the Open and Closed lists are divided into smaller pieces and distributed across different processors, significant communications overhead can occur, if, for example, one processor generates nodes that belong to a different processor.
A class of parallel graph search algorithms using a hash function to distribute the search nodes among multiple processors (or cores) mitigates some of the foregoing concerns. One such example is the PRA* algorithm (for more information, see Matthew P. Evett et al., PRA*: Massively Parallel Heuristic Search, J. Parallel Distrib. Comput. 25(2), 133-143 (1995)). However, since general purpose hash functions are static and do not adapt to a particular problem instance, these algorithms are generally incapable of exploiting problem-specific structures for improved parallel efficiency.
To illustrate, assume a 100 machine cluster having perfect load balancing (i.e., each machine gets 1% of the total workload). When the successors of a node are generated, there is a 99% chance that they belong to machines other than the one that generated them, since the hash function would distribute these newly generated successors equally among all 100 machines. In general, the number of machine-to-machine communication channels needed for PRA* (or any parallel algorithm that uses a regular hash function to distribute search nodes among machines) is on the order of the number of machines squared. For a cluster of one thousand machines, PRA* needs roughly 1,000,000 one-way machine-to-machine channels, which are difficult to sustain in a high-performance computing network.
Parallel structured duplicate-detection overcomes this problem by using a locality-preserving abstraction to extract parallelism. Namely, using a state-space projection function, a state space of a graph is statically divided into disjoint regions, each forming an abstract state. Then, two abstract states y and y′ are connected by an abstract edge (or abstract operator) if and only if (a) there exists a pair of states x and x′ such that y and y′ are the images (abstractions) of x and x′ under the state-space projection function, respectively, and (b) x′ is a direct successor of x in the original state space. The state-space projection function is selected in such a way that the successors of any state mapping to a disjoint region are guaranteed to map to only a small subset of regions (i.e., preserving the locality of the search graph) and that such mapping can be computed very efficiently (e.g., by simply ignoring some state variables or by shrinking the domain sizes of some state variables).
By mapping each encountered node to its corresponding abstract node, the abstract graph can be used to efficiently determine a duplicate-detection scope for each node. That is, potential duplicates can be detected in the set of all nodes mapping to abstract nodes that are successors of the abstract node to which the currently expanding node maps. Now any two nodes with pair wise disjoint duplicate-detection scopes can be expanded in parallel without any need for communications. Through the use of coarse abstractions (i.e., a large number of nodes mapping to the same abstract node), a layer of the search graph can be expanded with very little communication overhead by assigning abstract nodes with disjoint neighborhoods to different processes.
However, even with the most sophisticated locality-discovering algorithm, there is no guarantee that such a local structure always exists in any given problem. This is notwithstanding that it has been shown that many planning problems do have the appropriate local structure that can be leveraged by parallel structured duplicate-detection (PSDD), as well as by other locality-aware search algorithms. The search graph of the well-known Hidden Markov Models (HMMs) is one such example of a problem lacking local structure, illustrated in FIG. 1.
HMMs seek to compute the most probable sequence of hidden states that results in a sequence of observed events, and, as can be seen, the search graph has a layered structure (a layer can correspond to all the states the system can be in at time point ti, for example). Because any node in one layer has all the nodes in the next layer as its successors, the graph has no locality at all between any two consecutive layers. Not surprisingly, PSDD cannot be applied in this case, since a successor node in the next layer could be generated from any node in the current layer, which prevents PSDD from partitioning the search space in a way that would allow parallel node expansions.
Formally, the locality of an abstraction can be expressed as the ratio between maximum out-degree and the size of the induced abstract graph. An abstraction captures the locality of the original search graph, if the ratio of the corresponding abstract graph is minimal. Of course, for abstract graphs that are fully connected with self-loops such as those for HMMs, they have a ratio of 1 and possess no such locality, since the number of successors for any abstract state is the entire set of abstract states.
Although there is another technique called delayed duplicate-detection (DDD) that could, in principle, allow parallel node expansions in this case, it has the drawback that duplicates are not eliminated as soon as they are generated, which is particularly problematic for HMMs, because the number of duplicates generated and stored for a layer of the search graph is equal to the number of hidden states squared, as opposed to just the number of hidden states if duplicates were eliminated. For systems with a large number of hidden states the difference can be huge.
Accordingly, it would be advantageous to have an algorithm that enables large-scale parallel search with immediate duplicate detection and low synchronization overhead for problems that do not admit simple decomposition schemes.