Nowadays, data-intensive computer systems typically store their data in databases and database systems are provided which allow storing the data into the database, as well as accessing the stored data.
Accessing data of a database is typically performed by formulating queries in a query language and having the database system process the query. Since complex databases typically store excessive amounts of data, sometimes even millions of individual data objects, many approaches have been developed for increasing the performance of the query processing to allow users to retrieve the searched data as fast as possible.
The most common conventional database systems are so-called relational database management systems (RDBMS). A common language for interacting with data stored in such RDMBS is the Structured Query Language (SQL) commonly known in the art. The theoretical concept of a relational database is based on a set of mathematical terms, wherein the most important is called relation. A relation is a mathematical description of a table and a relational database may be considered as a set of tables. Each row (also referred to as tuple) of a table comprises a data record (hereinafter also referred to as “data object” or “database object”), wherein the record comprises one or more attributes. The attributes correspond to the columns of the table. FIG. 7 shows a simple example of a relational database having a table “Book”. As can be seen, the table “Book” stores in this example five data objects (rows) representing five books. Each book has an ID, a name and a genre code, which are stored in the columns of the table.
In order to retrieve data from such a relational database, queries are typically formulated in a declarative manner. This means that the query does not specify the steps the RDBMS is supposed to perform in order to retrieve the desired result. Rather, the query only specifies, in the form of a pattern, the properties of the desired result by means of one or more search conditions. A simple example of a query would thus be “Give me all books which have the term ‘history’ in their name”. As known in the art, more complex queries can be formulated on multiple attributes, which may be combined using logical operators.
Besides simple queries involving search conditions directed only to one type of object (i.e. to one single database table), more complex queries oftentimes involve multiple objects having relations defined therebetween. As an example, consider the query “Give me all books which have the term ‘history’ in their name and the genre ‘Science’”. As can be seen, this query involves two separate tables of the exemplary database of FIG. 7 (namely the tables “Book” and “Genre”), wherein each data object in “Book” is related to one or more data objects in “Genre” and vice versa. Such queries involving multiple objects and the relations therebetween are hereinafter referred to as “relational queries.”
Large databases are typically stored on hard drives or other suitable permanent storage devices. As a result, traditional relational databases typically suffer from the relatively slow disk storage access for processing queries to the database. One main reason for the hard drive access being so slow is the random access to the tables. In other words, entries of the tables are typically not accessed in sequence, but rather in random order (depending on the information searched), which requires many movements of the head of the hard disk.
Some traditional relational databases try to improve performance by reading entire blocks from disk to a cache memory, hoping other entries in the block might be needed later on. However, because following relations in a query typically results in random access to the database tables, this strategy performs poorly on larger databases with many small rows.
Other commercially available products employ so-called No-SQL databases. Most vendors of such databases do not offer any support for relational queries at all. Others try to resolve relationships by creating special indexes for each relational query which can possibly be performed. This approach, however, is limited to scenarios with a small and fixed set of conceivable relational queries and is hardly practical in more complex scenarios.
Lastly, also in-memory databases exist, which can quickly process relationships because all data is loaded in a fast memory, such as a cache or random access memory (RAM). However, this approach requires large quantities of memory, which greatly limits the scalability do to the hardware limitations in modern computer systems.
Relational database systems (RDBMS) can be considered to be the most prominent architecture used nowadays. While RDBMS are generally able to perform complex relational queries, there are scenarios in which they deliver poor results. In particular, performance breaks down as soon as the system has to perform random disk access(es). This is typically the case when the following criteria are met: data involved in querying does not fit into memory (e.g. RAM), queries operate on many rows, and/or queries use joins across multiple tables.
Under these conditions, the person skilled in the art will appreciate that a RDBMS will have to resolve many foreign keys using row lookups in a non-cached table, causing a large number of random disk accesses. On mechanical hard drives, this will cause slow head movements, resulting in long query execution times and poor disk access efficiency.
It is therefore the technical problem underlying certain example embodiments to provide an approach for processing queries (in particular relational queries which take into account relationships between the data objects to be retrieved) which is more efficient and allows a faster retrieval of the searched data, thereby at least partly overcoming the above explained disadvantages of the prior art.
This problem is according to one aspect of the invention solved by a computer-implemented method of processing a relational query. In the embodiment of claim 1, the relational query addresses a plurality of data objects linked by one or more relationships and the method comprises the following steps:    a. for each data object addressed by the relational query, computing at least one result in accordance with at least one index structure, the index structure being stored in a storage device of the database system; and    b. merging the results computed in step a. in accordance with at least one translation data structure to produce a final result of the relational query, the translation data structure representing the one or more relationships between the data objects and being stored in a memory of the database system.
Accordingly, the embodiment defines a particularly efficient approach for processing relational queries, which is on the one hand based on traditional indexing technologies. Such index technologies are used in certain example embodiments for computing the results of the individual data objects addressed by the relational query. Preferably, the at least one index structure of certain example embodiments is an inverted index, which is used to store attributes of data objects and allows a quick retrieval of all objects which have specific values in specific attributes. Such inverted indexes are known in the prior art (cf. e.g. Wikipedia “Inverted index”. Furthermore, each data object stored in the database preferably comprises a unique identifier (ID), which is returned as result of the index lookup operation.
The above-described indexing technologies allow performing search queries with conditions on multiple attributes of a single data object which can be combined using logical operators. These queries are typically resolved by splitting them up into individual index lookups. Object identifiers returned by the individual lookups are then merged, as indicated by the logical operators in the query to get the final result (e.g. if the query connects two search conditions with a logical “AND”, the merged result is the intersecting set of the partial results of the index lookups, a logical “OR” corresponds to union set of the partial results, and so on).
As the skilled person will appreciate, these queries can be executed with good performance, since indexes can quickly produce all object identifiers for a given value and the merging of object identifiers according to the query structure requires no additional disk access. However, these search queries are limited to combinations of attributes of one single object. They especially cannot handle scenarios where a plurality of objects which have relationships between them are involved and wherein these relationships have to be considered in the query, i.e. they cannot handle relational queries, as concerned by certain example embodiments.
The mechanism proposed by certain example embodiments solves this problem by combining, i.e. merging, multiple search queries as described above by defining relational conditions between them, thus allowing to perform sub-graph searches heeding the relationships as well as object attributes.
This is achieved by introducing a translation mechanism which can translate object identifiers across relations. This translation mechanism is used in the merging step of certain example embodiments, where the computed results corresponding to the individual data objects are combined/merged in accordance with special translation data structures, which represent the relationships between the data objects stored in the database. As the person skilled in the art in database programming will appreciate, the term “computing at least one result” also encompasses the “empty set” as a result, since a database query might well produce no results for any query node or even the entire query.
Translation allows converting results from query conditions across relationships defined in the query, making them easily comparable and therefore searchable. Importantly, the translation data structures are kept in memory of the database system and only describe which object is related to which other object. Preferably, the translation data structures utilize primitive and compact identifiers and space efficient lookup techniques, such as binary search. Because no object data is kept in memory (note that the corresponding index structures are stored in a storage device of the database system), the translation structure is much smaller than the on-disk index in many scenarios.
In summary, certain example embodiments advantageously combine the advantages of storage devices (e.g. a hard disk of the underlying computer system, or any other type of permanent storage) and memory (i.e. fast memory, such as a random access memory (RAM) of the underlying computer system) while at the same time avoiding the disadvantages of both types of storage/memory. As the person skilled in the art of computer science will appreciate, the advantage of storage devices is its huge capacity and decent data streaming performance. Its disadvantage is its slow seek time, delivering poor random I/O performance. Memory, on the other hand, has limited capacity, but outstanding random I/O performance. In certain example embodiments, object data only exists on the storage device (by means of the index structures) and all random access is moved to the memory at the same time.
Accordingly, during processing the relational query, data on disk is only accessed in its indexed form for reading results for individual query conditions. This means that disk data access cost is almost linear to the number of query conditions, just like in conventional index search algorithms (good disk data streaming performance greatly reduces the impact of larger data graphs where each index entry has more result object identifiers stored). On the other hand, the handling of the random access pattern caused by following relationships is completely left to the translation mechanism. This mechanism is isolated from any object data and solely performs object ID translations according to the translation data held in memory.
In one aspect of the present invention, the method comprises the further step of representing the relational query as a query tree, wherein each data object addressed by the relational query is represented as a node of the query tree and wherein each relationship between the data objects is represented by an edge in the query tree. Representing the relational query in a tree format is a prerequisite of particularly advantageous aspects for processing the query, as will be explained further below. Most relational queries can be directly represented as query trees. However, if the relational query corresponds to a more generic connected graph, the method preferably produces a spanning tree based on the connected graph, as will be explained in the detailed description further below.
Preferably, the method further comprises the step of performing the above-described computing step and/or the merging step as part of a first pass of processing the query tree, wherein the first pass is performed recursively for each node of the query tree. During the first processing pass, the merging step preferably comprises merging the results of the current node with the results of the child node(s) of the current node. Most preferably, the method also comprises performing a second merging step as part of a second pass of processing the query tree, wherein the second pass is performed recursively for each node of the query tree, and wherein the second merging step preferably comprises merging the results of the current node with the results of the parent node of the current node. Accordingly, in a preferred embodiment of the invention, the query tree is traversed two times (i.e. in a first and a second pass), wherein the partial results of the individual data objects are first merged in the direction of the parent node and then in the direction of the respective child nodes. Moreover, the first pass and optionally the second pass of processing the query tree may traverse the query tree starting at the root node in a depth-first manner. This approach guarantees that at first, valid results are obtained for parent nodes of leaf nodes and afterwards for their parent and so forth all the way to the root node during the first pass. Determining the result for a parent based on its children delivers a result which is only missing the relation from the parent of the parent. Once the root node is reached, however, the result for it is complete and correct, because the root node has no parent of its own. The second pass will simply apply the missing parent relation to the children going downward from the root node. It is noted that the second pass may be performed equally well in a depth-first manner or a breadth-first manner.
Alternatively or additionally, for two given data objects which are addressed by the relational query and which are linked by a relationship, the translation data structure may comprise a mapping of an identifier of the first data object to an identifier of the second data object in accordance with the relationship. If the relationship is represented as a data object in a database of the database system, the translation data structure may comprise a mapping of an identifier of the first data object to an identifier of the second data object and to an identifier of the relationship. As already mentioned above, the provided translation data structures, by means of mapping the identifiers (IDs) of the respective data objects and/or relationships, is particularly compact, yet efficient.
Certain example embodiments also concern a computer program comprising instructions for implementing any of the above-described methods. Lastly, also a database system is provided, the database system comprising an interface, adapted for receiving relational queries which address a plurality of data objects linked by one or more relationships; a storage device, adapted for storing at least one index structure; a memory, adapted for storing at least one translation data structure, the translation data structure representing the one or more relationships between the data objects; and a control logic, adapted for, for each data object addressed by the relational query, computing at least one result in accordance with the at least one index structure, and for merging the computed results in accordance with the at least one translation data structure to produce a final result of the relational query.
Further advantageous modifications of embodiments of the database system of certain example embodiments are defined in further dependent claims.