The present invention generally relates to filtering digital information, and, more specifically, to a system and method of efficiently providing interested subscribers with XML documents based on a publish and subscribe system and method.
Information accumulates explosively and, thanks to the advance of computing and communications technologies, much of it is digitized and becomes available through the Internet or intranets. Often, though, there is too much information and the user is overwhelmed.
Traditional search and pull systems, as depicted in FIG. 1, are known within the art. A traditional search and pull system 11 for locating information of interest to a user is shown in FIG. 1. These systems typically have distributed network of computers 10 with information objects contained therein. In order to obtain interesting objects, a user computer 12 first needs to consult a search engine 14 to find the locations of the objects in the distributed network of computers 10, unless the locations are previous determined and saved (or memorized by the user). Once the locations of interesting objects are found, the user computer 12 needs to contact the corresponding computers 10 in the distributed network to retrieve the objects. To find locations using the search engine 14, a user computer 12 may input simple keywords, and the search engine 14 searches objects in the distributed network of computers 10 to find those that contain the simple keywords. This search and pull process may be repeated every time when new objects are needed.
Publish and subscribe systems are designed to ease this problem by disseminating information only to interested users. A publish and subscribe system facilitates an information push paradigm in which users express long-term interest in (“subscribe to”) information objects continuously released (“published”) by some external agent. The success of a large scale publish and subscribe system relies both on the efficient identification of interested users for each information object and on the prompt delivery of the object to all interested users.
Also known within the art are publish and subscribe systems that typically use simple keyword matching techniques for the identification task. When new information objects come, the system matches each object against the user profile interests to identify interested users, and deliver it to the interested users via, e.g., email. FIG. 2 depicts a typical publish and subscribe system 13, wherein a user specifies interest, or “subscribes”, to a publish and subscribe engine, and the publish and subscribe engine delivers, or “publishes”, to the user the information that matches his or her interest. Whereas previously, users received information as a result of isolated searches, the publish and subscribe system continuously delivers items that meet certain criteria to the user. These systems 13 also have a distributed network of computers 10 which contain information objects. However, a user computer 12 may specify interest into a publish and subscribe engine 16 which utilizes matching techniques for the identification task. The publish and subscribe engine 16 matches information contained within objects in the distributed network of computers 12 and delivers an object to the user computer 12 if it matches the specified interest. When new information objects come (from the distributed network of computers 10), the system matches them against the user profile (interests) contained in the publish and subscribe engine 16 to identify interested users, and delivers those information objects to interested users via, e.g., email. Traditionally, the match mechanism is based on simple keywords.
However, XML has become popular for information encoding and exchange. XML allows for a rich information format and opens the door to allowing structure-based subscription in addition to traditional keyword-based subscription. The increasing use of XML as an information encoding and exchange standard demands a more sophisticated mechanism that takes structural information of XML documents into account.
Yan et al, “Index structures for selective dissemination of information under the Boolean model”, in Transactions on Database Systems (TODS), ACM Press, 19 (2): 332–364, 1994 proposed many indexing mechanisms to support content-based subscription using both the Boolean model (Boolean connection of keywords) and the vector model (as in Information Retrieval). The information objects to publish are plain text documents without any specific structures assumed.
Fabret et al, “Filtering algorithms and implementation for very fast publish/subscribe,” in Proceedings of ACM SIGMOD (Special Interest Group on Management of Data) 2001 Conference, proposed fast algorithms and efficient caching mechanisms to handle millions of subscriptions. The subscription profile consists of conjunctions of (attribute, comparison operator, value) predicates. The information objects to publish are events, where each event is a conjunction of (attribute, value) pairs.
However, none of the prior art has considered the complex structure within the information objects, such as the element hierarchy in XML documents. Jianjun Chen, et al. “NiagaraCQ: A scalable continuous query system for Internet databases,” in Proceedings of ACM SIGMOD (Special Interest Group on Management of Data) 2000 Conference, introduced the XML-based continuous query system, NiagaraCQ. Queries are posted using the XML-QL language. Their goal is to identify common predicate signatures in order to minimize redundant query evaluation processes. However, common element hierarchical structures are not exploited.
Altinel and Franklin, “Efficient filtering of XML documents for selective dissemination of information” in Proceedings of 2000 International Conference on Very Large Data Bases (VLDB), pages 53–64, modeled each XPath expression as a finite state automaton and focused on processing many automata simultaneously. However, no attempt was made to eliminate redundant processing of common XPath components.
Diao et al, “YFilter: Efficient and scalable filtering of XML documents”, in Proceeding of International Conference on Data Engineering (ICDE) Conference, 2002, San Jose, Calif., USA, disclosed modeling all the XPath expressions in a single non-deterministic finite automaton. However, it does not consider XPath expressions with predicates, including Boolean connection and nested paths.
As can be seen, there is a need for a publish and subscribe system that is efficient, utilizes hierarchical structures, is capable of eliminating redundant processing, and can match any incoming XML documents against a huge set of profiles quickly.