A publish-subscribe paradigm involves publishers, who generate and feed content into the system, subscribers, who specify content of their interest, and an infrastructure—the system—for matching subscriber interests with published content and delivering matched content to the subscribers.
A publish/subscribe (pub/sub) system generally maintains a database of subscriptions, where each subscription is stored as a Boolean expression, which can be expressed by predicates and attributes. When a publisher generates content that matches a subscription stored in the database, the content can be provided to the subscriber. This matching of content to subscription can be referred to as an event. When an event occurs, the pub/sub system can report all subscriptions in its database that are matched or satisfied by the event. Therefore, customers who posted these matching subscriptions may then be notified.
For example, each subscription in the pub/sub system of a diverse online vendor may describe the conditions that a customer has for purchasing a product. A potential customer may post a set of conditions as a subscription to the vendor's pub/sub system in order to search for a product defined by the set of conditions (which may be in the form of a Boolean expression defining a product by its attributes). As a specific example, a customer may subscribe to content related to a camera by posting a subscription indicating item, price, manufacturer, and zoom. Then, when an event occurs—where a publisher/vendor indicates that a product matches (or falls within a range) of the subscription, the pub/sub system reports all subscriptions in its database that are matched (or satisfied by the event). Customers who posted these matching subscriptions may then be notified.
Pub/sub systems are used in diverse applications with varied performance requirements. For example, in some applications events occur at a much higher rate than the posting/removal of subscriptions while in other applications the subscription rate may be much higher than the event rate and in yet other applications the two rates may be comparable. Optimal performance in each of these scenarios may result from deploying a different data structure for the subscriptions or a different tuning of the same structure. Many commercial applications of pub/sub systems have thousands of attributes and millions of subscriptions. So, scalability in terms of number of attributes and number of subscriptions is critical.
The problem of rapidly evaluating a large number of predicates against specified events has been studied extensively in the literature. Yan and Garcia-Molina proposed the use of indexes to speed the evaluation of a collection of Boolean expressions and developed SIFT (T. W. Yan and H. Garcia-Molina, The SIFT Information Dissemination System. ACM TODS, 1999), which is a system based on indexing. Later, various researchers proposed decision trees and index structures for this problem. The proposed approaches can be divided into two main categories. The first category is counting-based while the second category is based on partitioning subscriptions into subsets (partitioning-based). Counting-based pub/sub systems build an inverted index structure from the subscriptions and minimize the number of predicate evaluations while partitioning-based systems minimize evaluations by recursively eliminating the subscriptions that cannot be satisfied.
One partitioning-based system involves BE-Tree developed by Sadoghi and Jacobsen (M. Sadoghi and H.-A. Jacobsen, BE-Tree: An Index Structure to Efficiently Match Boolean Expressions over High-dimensional Discrete Space, SIGMOD 2011). BE-tree partitions subscriptions defined on a high dimensional space using two phase space cutting technique, space partitioning and space clustering, to group the expressions with respect to the range of values for the various attributes. Experimental results reported by Sadoghi and Jacobsen indicate that the BE-tree outperforms state-of-the-art pub/sub systems such as SCAN (T. W. Yan and H. Garcia-Molina, Index Structures for Selective Dissemination of Information Under the Boolean Model, ACM TODS 1994), SIFT (T. W. Yan and H. Garcia-Molina, The SIFT Information Dissemination System. ACM TODS, 1999), Propagation (F. Fabret, H.-A. Jacobsen, F. Llirbat, J. Pereira, K. A. Ross, and D. Shasha, Filtering algorithms and implementation for fast pub/sub systems, SIGMOD 2001), Gryphon (M. K. Aguilera, R. E. Strom, D. C. Sturman, M. Astley, and T. D. Chandra, Matching events in a content-based subscription system, PODC 1999), and k-index (S. Whang, C. Brower, J. Shanmugasundaram, S. Vassilvitskii, E. Vee, R. Yerneni, and H. Garcia-Molina, Indexing Boolean Expressions, VLDB, 2009). BE-Tree, however, is limited to attributes whose values are discrete and for which the range in discrete attribute values is pre-specified. So, BE-tree is unable to cope with real-valued attributes, string-valued attributes, and discrete-valued attributes with unknown range. Additionally, BE-tree employs a clustering policy that is ineffective when many subscriptions have a range predicate such as low≤ai≤high, where ai is an attribute and the clustering criterion p that is used for the BE-tree lies between low and high. In this case, all such subscriptions fall into the same cluster and event processing is considerably slowed.
One counting-based system involves a matching algorithm, Siena, developed by Carzaniga et al. (A. Carzaniga, D. Rosenblum, and A. Wolf, Design and evaluation of wide-area event notification service. ACM Trans. On Computer Systems, 19, 3, 2001, 332-383; A. Carzaniga and A. L. Wolf, Forwarding in a Content-Based Network, ACM SIGCOMM 2003). Siena is a pub/sub system that uses a counting algorithm to find matching subscriptions. It maintains an index of attribute names and types. This index is implemented using ternary search tries. Unlike BE-Tree, Siena is not limited to discrete valued attributes from a pre-specified finite domain. Further, Siena is able to work with attributes of type string and supports operators such as prefix, suffix, and substring on this datatype. Siena, however, does not support incremental updates (i.e., subscription posting and deletion) and so updates must be done in batch mode.