Data-intensive computing systems are nowadays increasingly based on data that is structured as a graph, i.e. a collection of nodes and edges interconnecting the nodes. For example, graph databases exist in the prior art which use graph structures with nodes, edges and properties to represent and store data. In such graph data structures, the data is typically queried using query graphs. In other words, the user formulates a query in the form of a graph with at least two nodes, and the system then fills the formulated query graph with those data items in the database which fulfill the search conditions of the query graph (cf. e.g. http://en.wikipedia.org/wiki/Graph_databases).
To efficiently process graph queries, various approaches have been developed in the prior art (cf. e.g. http://en.wikipedia.org/wiki/Bidirectional_search). However, a common obstacle of the known approaches is that processing a graph query typically involves a high amount of redundant data, which consumes vast amounts of memory and thus leads to poor performance.
In relation to the general field of graph data processing, U.S. Pat. No. 7,702,620 relates to a system and method for ranked keyword search on graphs. The approach disclosed therein deals with a ranked keyword search on the top of a graph-structured database, and with fast searching the closest keywords across prepared/indexed graphs of words which is improved by prioritized words and blocks. However, the patent does not provide a general approach for efficiently processing graph queries.
Generally speaking, a graph query is a data structure that indicates at least a source node, a target node, and one or more edges there-between.
Each node and edge may comprise one or more relation conditions, which represent search conditions of the query graph. According to the approach predominantly followed in the prior art, such a graph query is processed as follows: For a given source node, the outgoing edge(s) is/are examined and the respective relation conditions are evaluated towards the connected target node(s), which results in an intermediate set of result items. This approach is repeated for each source node. When all source nodes have been processed, the respective intermediate result sets are intersected in order to retrieve the final result of the graph query. As the person skilled in the art will appreciate, this prior art approach, however, consumes vast amounts of memory due to the temporal storing of the intermediate result sets, which also has a negative effect on the efficiency of the graph query processing.
It is therefore the technical problem underlying certain example embodiments to provide an approach for processing a graph query which is more efficient and consumes less memory and computing resources, thereby at least partly overcoming the above explained disadvantages of the prior art.
This problem is according to one example aspect solved by a computer-implemented method for processing a graph query. The graph query serves for retrieving data items from a data source by indicating at least a source node representing one or more source item types, a target node representing one or more target item types, and a plurality of edges between the source node and the target node, wherein each edge comprises one or more relation conditions, wherein each relation condition defines a mapping between items of one of the source item types and items of one of the target item types. In the embodiment of claim 1, the the method comprises the following steps:                a. selecting a first edge of the plurality of edges;        b. traversing the selected first edge from the source node to the target node in accordance with the one or more relation conditions of the first edge to produce an intermediate set of result items, wherein the intermediate set of result items comprises the items of the data source which belong to the at least one target item type and which fulfill the corresponding one or more relation conditions;        c. selecting a further edge of the plurality of edges;        d. traversing the selected further edge from the target node to the source node in accordance with the one or more relation conditions of the selected further edge, and deleting items from the intermediate set of result items produced in step b. which do not fulfill the corresponding one or more relation conditions;        e. repeating steps c. and d. for each further edge of the plurality of edges; and        f. returning the intermediate set of result items as the result of the graph query.        
Accordingly, the embodiment defines a particularly efficient way of processing a graph query in order to retrieve data items from a data source, wherein the graph query involves a plurality of edges. In this context, the term “data source” is understood as any type of source for data items, such as a database (e.g. relational database, object database, graph database), a cache, a file system, or even a software or hardware sensor or application producing data (e.g. streaming data). Contrary to the prior art approach explained further above, the method does not evaluate each edge in the direction from the source node to the target node and then intersects the intermediate result sets into a final result. Rather, certain example embodiments propose to evaluate only a first one of the plurality of edges in the direction from source to target node (also referred to as “forward filtering” hereinafter), which results in an intermediate set of result items. Then, the method evaluates each further edge of the plurality of edges in the inverse direction, i.e. from the target node towards the source node (also referred to as “backward filtering” hereinafter). During this process, each item in the intermediate set of result items (which was retrieved during the forward filtering) is checked whether it meets the relation conditions of the other edges and if not, the item is deleted from the intermediate result set. Accordingly, the certain example embodiments fundamentally deviate from the prior art approach in that only one single intermediate result set is generated during the forward filtering phase, which is then successively reduced (in that “false results” are deleted therefrom) during the backward filtering. Accordingly, the present inventive method requires far less memory and computing resources due to the decreased generation of intermediate results.
In one aspect of certain example embodiments, the selected first edge may comprise a plurality of relation conditions, and step b. above (i.e. the forward filtering) may comprise producing the intermediate set of result items as the union of the results of each relation condition. Accordingly, the method is able to efficiently process particularly complex graph queries in which the edges may have more than one relation conditions, i.e. search criterion, attached thereto. Preferably, the above steps c.-e. (i.e. the backward filtering) are performed iteratively for each item in the intermediate set of result items.
In another aspect of certain example embodiments, selecting a first edge of the plurality of edges in step a. may comprise selecting the edge which, when traversed in step b., is expected to result in a minimal intermediate set of result items. In other words, the edges are preferably ordered according to their expected impact on the size of the intermediate set of result items, and the forward filtering is applied to that edge which is expected to result in a minimal intermediate set of result items. This way, the memory consumption of certain example embodiments can be further minimized to a great extent. Preferably, the selected edge is the edge which is connected to the least amount of source items.
In more complex scenarios, the graph query may indicate a target node comprising a plurality of target item types, and the above-explained steps a. and b. may be performed iteratively for each of the plurality of target item types, so that the intermediate set of result items comprises the items of the data source which belong to any of the plurality of target item types. In this scenario, steps c.-e. are preferably performed iteratively for each of the plurality of target item types.
In a preferred implementation of the inventive graph query processing method, the intermediate set of result items stores IDs representing the respective items in the data source. In other words, instead of directly storing the respective data items that in the intermediate set of result items (which may consume more or less memory depending on their structure), only IDs, i.e. unique identifiers of the data items are stored. Since an ID may be implemented as a simple integer number, this aspect further minimizes the memory consumption of certain example embodiments.
Certain example embodiments also concern a computer program comprising instructions for implementing any of the above-described methods.
Lastly, also a system for processing a graph query is provided, wherein the graph query serves for retrieving data items from a data source by indicating at least a source node representing one or more source item types, a target node representing one or more target item types, and a plurality of edges between the source node and the target node, wherein each edge comprises one or more relation conditions, wherein each relation condition defines a mapping between items of one of the source item types and items of one of the target item types, wherein the system comprises:                a. means for selecting a first edge of the plurality of edges;        b. means for traversing the selected first edge from the source node to the target node in accordance with the one or more relation conditions of the first edge to produce an intermediate set of result items, wherein the intermediate set of result items comprises the items of the data source which belong to the at least one target item type and which fulfill the corresponding one or more relation conditions;        c. means for selecting a further edge of the plurality of edges;        d. means for traversing the selected further edge from the target node to the source node in accordance with the one or more relation conditions of the selected further edge, and for deleting items from the intermediate set of result items which do not fulfill the corresponding one or more relation conditions;        e. wherein the selecting a further edge of the plurality of edges and the traversing of the selected further edge from the target node to the source node is repeated for each further edge of the plurality of edges; and        f. means for returning the intermediate set of result items as the result of the graph query.        
Further advantageous modifications of embodiments of the system of certain example embodiments are defined in further dependent claims.