Real-time Business Process Mining (BPM) is an increasingly critical area of information technology, helping businesses to leverage their resources for maximum benefit. It is an important part of Business Process Management that allows companies to analyze their processes based on actual real-time data collected from their systems. The goal is to enable companies to understand their processes and the state of their business and adapt quickly to changing business conditions. BPM improves the speed of process analysis by automatically generating real-time process models from events and messages generated by the underlying systems. By linking information from these sources it allows instant analysis of the business processes generating interactive visual displays of how the process works and shows how specific case characteristics influence processing times.
Process Mining can dramatically accelerate a business' discovery process. A mined process model gives users a unique insight, showing the actual underlying process model and a fascinating insight into how work is actually flowing through the business. Existing BPM systems however fall short of their promise. They tend to be restricted in the kind of data they can analyse and can sometimes put restrictions on the data model itself. Furthermore, the kind of analysis that can be performed is normally restricted to an inextensible and inflexible set of functions because these are closely tied to the data model being used.
Current technology typically requires the translation or import of data into the system before it can be analysed. This imposes severe constraints on the way the data is defined and collected and precludes any real-time analysis of the data. Limitations are also imposed by predefined database schemas that are neither flexible, nor easily extensible which puts further limits of the kind of analysis that can be performed. Moreover, such systems usually suffer from scalability issues, not least because the database itself becomes the bottleneck when a large number of requests are executed against it. The use of databases also makes real-time updates of data very difficult.
Many enterprises have business models that are highly dependent on processes that are executed both in good time (meeting Cycle Time measures) and correctly with minimal failures and repeats (meeting Right First Time measures). In such large organisations, in order to support the business in managing and improving processes across different systems and divisions, solutions are required that can analyse different forms of business process data in order to determine the real state of execution of the processes and evaluate accurately the performance measures associated with them. As automation is becoming an increasingly ubiquitous feature of process execution, there is a growing need for capabilities of monitoring and analysis that can, firstly, cope with the large amount of data generated during the process life spans and, secondly, is able to provide alarms very rapidly when risks of failures are detected.
Continuous Querying
Continuous Querying is an extremely important issue in the field of real-time data processing. Apart from solutions related to traditional databases such as Oracle Continuous query capabilities, NiagaraCQ and OpenCQ, there is very little work in the field of triple stores. In the case of a triple store defined on top of a traditional data base engine it could be possible to exploit the continuous query capabilities offered by these solutions to perform continuous queries. But this approach requires mapping queries to the triple store into SQL queries, introducing an additional layer of computation. In our preliminary study of the systems we have identified traditional databases as a big bottleneck for efficient handling of Resource Description Framework (RDF) graphs making such solutions infeasible.
Triple store systems with continuous query capabilities are limited at the moment to Atlas and LarKC.
The two approaches are very different: LarKC uses C-SPARQL [2]: an extension of the SPARQL grammar designed specially for stream processing capabilities. This extension allows defining temporal intervals of execution of the query where the results are pulled at the end of each interval by the query engine.
An example of C-SPARQL is:
REGISTER STREAM AllCarsTurningFromPalmIntoOakCOMPUTED EVERY 1m ASSELECT ?car1FROM STREAM <http://streams.org/citycameras.trdf>[RANGE 5m STEP 1m]WHERE { ?camera1 c:monitors c:Oak-Avenue .?camera2 c:monitors c:Palm-Street.?camera1 c: placedAt ?tr_light .?camera2 c: placedAt ?tr_light .?camera1 t: registers ?car1 .?camera2 t: registers ?car2 .FILTER ( timestamp(?car1)>timestamp(?car2) && ?car1 = ?car2 )}
The query above defines a stream updated every minute on a temporal window of 5 minutes. This is, strictly speaking, not real-time processing.
Atlas uses a different approach [3]. Queries are registered as continuous queries and the insertion of a new triple triggers the execution of these queries. This event-driven approach is more efficient in case of real-time performance requirements and optimization of the query executions.
Atlas defines two different algorithms: CQC and CSBV. The two algorithms define a continuous query as a chain of dependent sub-queries allocated to the nodes of the network. The algorithms heavily rely on the indexing mechanism: for each triple that is inserted in Atlas this is stored in multiple nodes (three nodes in CQC and seven nodes in CSBV). The selection of the nodes is defined by the hash function of the predicate, subject and object components of the triple, plus the combinations between them (in case of CSBV). This is not an optimal approach, especially in BPM applications where, for example, we have a large number of rdf:type predicates. In Atlas, a high frequency of certain predicates will result in overloading certain nodes while the resources of some other nodes may be under allocated.
Another existing approach is the EP-SPARQL query language for Event Processing [4] developed as part of the ETALIS open source Even Processing platform.
This work is an extension of the SPARQL language in a similar way to C-SPARQL, in order to perform complex event processing.
The system is able to translate an ontological knowledge base into a logic program (using Prolog). This initial knowledge base is continuously extended with an incoming flow of information, translated as well into Prolog.
EP-SPARQL allows the submitting of queries that are then translated into Backward Chaining rules that are fired when new events are inserted in the knowledge base.
The most significant difference compared with C-SPARQL is that EP-SPARQL follows a push approach to the notification of new results.
Continuous Querying Related Work
Although the authors of LarKC (with the C-SPARQL definition) claim to support continuous queries, the implementation of the C-SPARQL engine in LarKC follows a different approach to the problem and is instead, a solution oriented towards stream processing and analysis. The main reasons that prevent the use of an approach like the C-SPARQL query engine of LarKC in providing a continuous query evaluation, particularly for BPM, are:
1. There is a conceptual difference between continuous query evaluation and C-SPARQL. C-SPARQL defines a query that is executed continuously every a certain amount of time defined in the query. This means that it adopts a pull approach in order to obtain new data.
Imagine a query is submitted to be continuously executed every hour: in case new data matching the query is submitted to the system after one hour and a half of the query being registered, this result will not be returned for the next half hour, until the query is executed again.
2. C-SPARQL is also designed for processing data streams; therefore the syntax allows defining queries that are interested in a specific temporal window or slice (e.g. last 30 minutes). As an example we can submit a query that every hour controls the amount of cars that have passed through a toll gate during the last 10 minutes. So imagine at time x the query is executed and returns that from time x−10 minutes to time x a number of cars y has passed thought the gate; at time x′=x+1 hour the same query is executed and returns that y′ cars have passed thought the gate from time x′-10 min to x′ and so on.3. The system can also execute queries on the entire store, but that is akin to running the query after every time interval t, where all relevant results, old or new, are returned every time. This approach is not very scalable and is fundamentally different to our data-driven approach where every continuous query invocation returns only the newest results.
Therefore the features of C-SPARQL are not practical for the present implementation. The main reason is the pull approach of LarKC for the continuous queries. This is a big limitation in case of real-time updates, because the client will receive results only after the query is executed again. This could be solved by defining very frequent query execution, but this approach will clearly lead to performance issues, for example, in case of infrequent updates to the data.
EP-SPARQL is an approach very similar to C-SPARQL: the two languages define extensions of SPARQL to be used in case of processing streams of information.
The ETALIS system, used to process EP-SPARQL queries, as we already pointed out, follows an approach that better fits the requirements of processing streams of information. Instead of a pull approach, the ETALIS system follows an approach by which new results are eventually generated when new information is entered in the system (push).
However, EP-SPARQL is not a solution used for continuously answering SPARQL queries once new information is entered in the system. Instead, it is a solution oriented towards complex event processing.
Also, the ETALIS system is based on a Prolog translation of the ontological knowledge base, meaning that the resulting system does not allow distributed processing so it is not scalable.
Furthermore, in order for the system to process events, the triples need to be accentuated with timestamps.
The Atlas system has been briefly described above. This is a system based on Distributed Hash Tables (DHT) over a peer-to-peer or overlay network and it has several problems such as:
1. Scalability: the Atlas system is not scalable. The tests provided with the documentation report that the system has problems to store more than a million triples, This aspect also introduces performance issues because each query goes across several nodes before finding a result. Moreover data is replicated: for each triple that is inserted in Atlas this is stored in multiple nodes.2. No failover mechanism: due to peer to peer nature of the approach the triples are replicated in some nodes. But in case of nodes failure or network problems, the triples are not restored. This compromises also the execution of the continuous query.3. Old triples need to be “remembered” for matching later on in the query.4. The queries in Atlas are defined using RDQL, a query language that has limited expressiveness.
It is therefore an aim of the present invention to solve the problems intrinsic of Atlas technology and also to improve the performance and expressiveness of the query language.
From the above, it can be seen that current SPARQL execution engines provide limited support for sophisticated continuous querying. It is therefore an object of the present invention to provide methods and systems which are able to execute normal SPARQL queries over real-time data with no modifications.
It is a further aim of the present invention to provide methods and systems which are inherently scalable, for example by distributing processing across all available grid nodes.
It is a further aim of the present invention to provide methods and systems which have a simple interface for subscribing to notifications in order to make integration with other systems easy.
A further object of the present invention is to allow the analysis and monitoring of process execution in real-time by processing event streams.