Filtering Expression Sets
In the context of event and content-based subscription systems, events are defined which, when met, trigger an action. For example, a subscriber can define rules in the form of expressions that specify a state of content that, when met, trigger transmission of content to the subscriber. Using a database management system as an underlying engine for an event-based subscription system, a subscriber can register queries with the system that represent conditional expressions on the content of the events. Generally in this context, an event refers to some quantifiable set of information and the expressions are related to the content of such document. In such a subscription or similarly functioning system, a potentially very large set of queries, i.e., an expression set on the content, are registered to manage the publication of desired content data. When a given data item becomes available, these conditional expressions are filtered to find those expressions that match the given data item. The data for which the expressions are filtered could be, for example, a set of name-value pairs, an XML (Extensible Markup Language) document, or a combination of both.
A simple but inefficient approach to the task of filtering expression sets is to test all of the expressions in a given set for each data item. However, this approach is scalable neither for a large set of expressions nor for a high rate of events. Therefore, most commercial systems pre-process the expression set and create in-memory matching networks (i.e., specialized data structures) that group matching predicates in the expression set and share the processing cost across multiple expressions.
Matching networks rely on the fact that a conditional expression on scalar data can be decomposed into independent predicates and a decision tree can be constructed by assigning each predicate to a node in the tree. Thus, matching networks are decision trees in which each node represents a predicate group in a given expression set. Data flows from a parent node to its children only if the data evaluates to true for the predicate representing the parent node. A path from the root of the decision tree to a leaf node represents all the conjunctions in an expression. The leaf nodes in the tree are labeled with expression identifiers and if a data item passes the predicate test on a leaf node, the corresponding expressions are considered true for that data item. Many variants of the matching networks (like RETE, TREAT and Gator networks) are in use for existing systems.
In existing systems, any operation requiring filtering of expressions and related information requires significant custom coding and reduces performance characteristics. Furthermore, the number of expressions is limited in size as the corresponding matching networks must fit in main-memory, changes in expressions are costly, and users are unable to adjust filtering strategies to the structure and use of the expressions and related data.
XPath Expressions
XPath is a language for addressing XML documents. XPath also provides basic facilities for manipulation of strings, numbers and booleans. XPath uses a compact, non-XML syntax to facilitate use of XPath within URIs and XML attribute values. XPath operates on the abstract, logical structure of an XML document, rather than its surface syntax. XPath gets its name from its use of a path notation as in URLs for navigating through the hierarchical structure of an XML document.
XPath models an XML document as a tree of nodes. There are different types of nodes, including element nodes, attribute nodes and text nodes. The XPath data model is described in detail in Section 5 (“Data Model”) of “XML Path Language (XPath)” (version 1.0), a W3C (World Wide Web Consortium) Recommendation dated 16 Nov. 1999.
The primary syntactic construct in XPath is the expression, which is evaluated to yield an object. XPath expressions are described in Section 3 (“Expressions”) of “XML Path Language (XPath)” (version 1.0). One important kind of expression is a location path. A location path selects a set of nodes relative to a context node. The result of evaluating an expression that is a location path is the node-set containing the nodes selected by the location path. Location paths can recursively contain expressions that are used to filter sets of nodes. The semantics of location paths are described in Section 2 (“Location Paths”) of “XML Path Language (XPath)” (version 1.0).
In the case of content-based subscription systems, the techniques used for constructing decision trees for expressions on non-XML data are not accurately applicable to XPath expressions defined on XML data. In the absence of an efficient evaluation engine, each XPath expression has to be tested on each XML data item separately to determine whether it evaluates to true or not. However, this approach also is not scalable for a large set of expressions or for a high rate of events.
One approach to grouping a large set of XPath expressions defined for expected XML data and for sharing the evaluation costs across multiple expressions is described in “Efficient Filtering of XML Documents for Selective Dissemination of Information” (Mehmet Altinel and Michael J. Franklin; Proceedings of the 26th VLDB Conference, Cairo, Egypt, 2000), in which in-memory finite state machines are built for each XPath expression.
To build such a finite state machine, each XPath expression is decomposed into a set of path nodes that correspond to elements in an XML document, which serve as states in the state machine. The state information also includes the relative and/or absolute level of the path node (element) within the XML document. A hash index is built on a set of states corresponding to multiple XPath expressions using the element name as the hash key. For each hash key, the states are maintained as one or more linked lists. In order to match an XML document for a set of XPath expressions, a document parser looks up the element name in the hash index every time a new element is encountered and a list of corresponding nodes are checked for a match with respect to the level of the element. For each node that succeeds the check, the next node in the corresponding state machine is activated. If the node that succeeded the check is a last node in a state machine for an XPath expression, then the expression is considered a match for the XML document.
The evaluation techniques used in the preceding approach rely on the level of the elements in the XML document and not on any predicates on the attributes in the elements. Therefore, any predicates on one or more attributes of an element are checked linearly when the node for the corresponding element is active in the state machine. Hence, for a large set of XPath expressions that differ from each other only in the predicate on the attribute, this approach is equivalent to evaluating each XPath expression on the XML document linearly. For example, two XPath expressions, /PUBLICATION/AUTHOR[@name=“SCOTT”] and /PUBLICATION/AUTHOR[@name=“ANDY”], are grouped only based on <publication> and <author> elements and the predicate on the ‘name’ attribute is checked linearly for both the expressions.
Due to the extensive use of main memory and many data movement operations during evaluation, this technique may not scale well. Also, the existence of element node filters that contain path expressions, i.e., an XPath expression nested within an attribute predicate of an XPath expression, further complicates the prior evaluation process.
Based on the foregoing, it is clearly desirable to provide an improved mechanism for managing expressions, such as XPath expressions, in a database system. In addition, there is a need for a mechanism that provides the ability to filter XPath expressions in conjunction with predicates on non-XML data.