With the increasing impact of the World Wide Web, a lot of attention has been given to XML (Extensible Markup Language) which has become a universally accepted standard for exchanging data over the Web. XML is a format for representing semi-structured (i.e. irregularly structured) data in textural form. XML documents comprise hierarchically nested collections of elements which may be accessed following the document's tree structure. XML query languages such as XQuery and XPath use path expressions to traverse these semi-structured data and access specific nodes. Thus, the task of navigating through irregularly structured graphs is of central importance in processing XML queries. In order to make the processing of queries efficient, a well-directed access of the data stored in the XML document is required; this may be achieved by introducing an appropriate index structure which supports query processing and improves its performance.
Generally, an index is a data structure that improves the speed of operations on a database table such as executing queries against the database. Existing database management systems, such as relational database and object-oriented database systems, generally comprise mechanisms for rapidly retrieving based on key fields in the database. These mechanisms include index structures based on B-trees as well as indices constructed around key fields that are frequently queried in order to enable fast searching on these fields. Similar indexing mechanisms may be applied to XML documents, thus creating so-called XML values indices comprising information about existing nodes and the data which they contain. These XML values indices support and speed up XML document handling and querying such as extracting, comparing etc. elements or attributes to and from XML documents. However, these indices—adopted from relational databases with rigid structures—may exhibit inadequacies when applied to semi-structured data such as XML documents. In particular, they are incapable of dealing with documents in which specific elements or attributes are missing, i.e. are non-present.
Note that this situation cannot occur in structured (relational) databases, since non-presence of a data value is indicated as a database table entry with a NULL, N/A, etc. indicator; thus, in a relational database, there is always an entry (i.e. a column in a specific row) for a data field, even when the data field has no value. When the database table is indexed, the non-present values are automatically indexed using the standard indexing procedures of relational databases since a corresponding indicator (NULL, N/A, etc.) is present in the database rows.
In contrast, an XML document can contain either zero, one or multiple nodes at different hierarchy levels. Since there is no NULL indicator, a non-presence of a node on a given hierarchy level is not automatically indexed. Thus, a regular values index—of the kind that is used on fully structured data—is not capable of detecting a non-existence of a specific node when applied to semi-structured data such as an XML document. This index does not support searches for documents which do not possess certain criteria, i.e. do not have specific values.
A method for a full-text index on XML data, including wildcard search support and partial matches, is described in US 2008/0010313 A1. This method, however, does not support searching documents for the non-presence of certain nodes and thus cannot solve the problem outlined above.
Aside from efforts geared at developing advanced regular (values) indices for XML documents, a variety of XML indexing techniques—some of them based on the concept of a so-called structural index—have been suggested and developed for tackling semi-structured data. A structural index keeps track of all paths within the document structure, notably the hierarchical relation of its nodes (elements and attributes). The article “Index structures for matching XML twigs using relational query processors” by Z. Chen et al, Data & Knowledge Engineering 60 (2007), pp. 283-302 gives an overview of various XML path indices, relational join indices as well as object-oriented path indices. This article also introduces novel indexing structures geared at finding documents with specific path/value combinations. These, however, do not lend themselves to solving the problem of efficiently spotting documents not having specific properties.
IBM's DB2 comprises a path index which registers all paths within a database table but does not keep track of which specific path occurs in which specific documents within the table. This path index, constituting a subset of the generalized structural index outlined above, is therefore not capable of detecting and reflecting non-presence of data within specific documents and cannot be used for queries directed at documents which do not possess certain criteria.
US 2003/0212662 A1 describes a method for looking up paths identified by regular path expressions and then finding the related data. The XML data is stored in regular relational tables. The problem of indexing data contained in the XML documents—and, notably, of detecting documents which do not contain specific nodes or specific paths—is not addressed.
U.S. Pat. No. 7,287,023 suggests usage of a structural index which indexes the entire document and thereby remembers which paths are present and where. However, this structural index will generally not make use of any XML values index that may already be in place. Rather, the structural index is an additional index which may require considerable amounts of storage space since it indexes all paths within the entire document.
In summary, while prior art XML values indices furnish values of data stored in specific nodes of an XML document, they are incapable of detecting non-presence of certain nodes within XML documents. On the other hand, XML path indices reflect the hierarchical structure of the XML documents but are inefficient (or not capable) to yield specific values. However, users usually want to index a certain path within an XML document as well as query for specific values stored within this document. In order to satisfy these two requirements, both a XML values index and a XML structural index are needed, thus duplicating the required resources, computational expenditure as well as storage space. This is clearly wasteful.
Thus, there is a need for a method that handles non-presence of nodes in semi-structured data documents, e.g. XML documents, in such a way that information on the non-presence is consistently represented within a corresponding index. In situations in which an XML values index has already been put in place, the method should build upon this XML values index, thus saving computation time and minimizing storage space.