Business Process Mining (BPM) is an increasingly critical area of information technology, helping businesses to leverage their resources for maximum benefit. It is an important part of Business Process Management that allows companies to analyze their processes based on actual data collected from their systems. The goal is to enable companies to adapt quickly to changing business conditions. BPM improves the speed of process analysis by automatically generating real-time process models from existing systems logs, event logs, database transactions, audit trail events or simple management information. By linking information from these sources it allows instant analysis of the business processes generating interactive visual displays of how the process works and shows how specific case characteristics influence processing times.
Process Mining can dramatically accelerate a business' discovery process. A mined process model gives users a unique insight, showing the actual underlying process model and a fascinating insight into how work is actually flowing through the business.
Existing BPM systems however fall short of their promise. They tend to be restricted in the kind of data they can analyse and can sometimes put restrictions on the data model itself. Furthermore, the kind of analysis that can be performed is normally restricted to an inextensible and inflexible set of functions because these are closely tied to the data model being used.
Most of the business processes, in state-of-the-art systems, are modelled using expressive representations that can be easily enacted in the enterprise. Usually, at the same time, a comprehensive set of KPIs, SLAs and other formalisms, oriented towards analyzing the company's performance and behaviour, are defined and constantly monitored.
This approach is promising and efficient for a fully automated process. However, the same cannot be said for processes involving a high number of uncontrolled variables such as human actors, unpredictable events (e.g. weather, accidents), other tasks (agreements in the form of contract) that do not behave as expected (e.g. tasks not executed or some procedure is not followed). Such information cannot easily be captured or represented in the process model.
In such situations, if the performance is monitored by analysing the “expected” process model, the KPIs can give a distorted view of reality because they are analyzing an “ideal” process model that may actually never have existed in the organisation. Furthermore, sometimes processes are implicitly executed even if they have never been formally defined. Obtaining incorrect indications will lead to incorrect actions that can affect the overall performance of the company.
Therefore, it is important to perform analysis of process models created from real data and from real interactions by building the model from a flow of information reflecting real actions and, not from a pre-defined model.
Current technology typically requires the translation or import of data into the system before it can be analysed. This imposes severe constraints on the way the data is defined and collected. Limitations are also imposed by predefined database schemas that are neither flexible, nor easily extensible. This means that new information cannot easily be captured by the system without significant modifications. The scope for analysis and information extraction is hence severely limited.
Moreover, such systems usually suffer from scalability issues, not least because the database itself becomes the bottleneck when a large number of requests are executed against it. The use of databases also makes real-time updates of data very difficult.
In the following section, we describe existing work that mainly focuses on the current generation of triple stores.
Triple Stores
This data model is based on the Resource Description Framework (RDF) [2]. An RDF-based data model is generally more naturally suited to representing continuously evolving and unpredictable types of knowledge.
The atomic element of an RDF-based data model is a triple, which is composed of three mandatory parts. Sometimes, to this basic set of information, some systems add the name of the graph (i.e. the namespace), timestamps and other attributes. A set of triples defines the RDF graph. An RDF graph is a multi-labelled, oriented graph. At its most basic, a triple store is a system that provides the ability to store a RDF graph and a query interface (traditionally SPARQL (SPARQL Protocol and RDF Query Language [1]) to the graph.
A large number of triple stores are available both commercially, open source and in literature. The best known are Sesame [3] and Jena [4]. These are general-purpose platforms usually offering a query engine interface and parsing a query defined using SPARQL standard definitions.
The triple stores currently available can be broadly divided into centralized and distributed approaches. The centralized approaches include Sesame and Jena. However, both have also been reused and extended in some distributed approaches. Other distributed approaches include a wide variety of commercial solutions such as Oracle Database 11 g Semantic Technologies [5], BigOWLIM [6], AllegroGraph [7] and 4Store [8] which have been used in many applications.
To answer queries in a triple store means resolving conjunctive queries over the RDF Graph in the store. There are several query languages used for the triple stores. The frequently used and most expressive is SPARQL (now also the 1.1 extension [9] is under standardization by the World Wide Web Consortium (W3C)).
Due to the highly interconnected nature of RDF graphs, the storage of data and query processing are extremely demanding in terms of space and computational power. Therefore centralized solutions are now showing their limits and are getting progressively replaced by distributed applications that are able to scale the RDF graph, albeit at the expense of more computational power. However, distributed approaches also introduce several new problems, the main one being the execution of the join operations in answering conjunctive queries.
Distributed solutions are, at the moment, limited to peer to peer (P2P) or overlay networks (Tsc++[10], Atlas [11]), and DHTs or MapReduce [12]. (LarKC [13], SHARD [14]). The main feature of these systems is that, since they are built on top of a distributed architecture, they can be (more or less) easily extended according to the requirement of a particular application.
This is indeed a fundamental feature for applications that have to process quickly a huge amount of data, especially considering also that ontological data can have a dimensionality of billions of instances. However, at the moment, performance is comparable to centralized triple-stores.
Within the P2P approaches, the main players are Tsc++, which is based on the triple space computing paradigm, and Atlas which is based on decentralized indexing of the triples. Both approaches rely on P2P networks and they are designed not as a persistent triple store but as a metadata repository for web service descriptions. Therefore these systems address a different problem to those underlying the present invention.
The LarKC and SHARD projects have developed two approaches based on distributed hash tables. LarKC stands for Large Knowledge Collider and is a publicly funded project from the European Union under the FP7 Framework. LarKC is a plug-in based framework for the processing of Semantic web data. Processes are defined as workflow of basic operations, which are:                Identify: where do the axioms and data come from that contribute to a solution?        Transform: how to abstract that information into the forms needed by further heterogeneous components?        Select: which part of knowledge & data is required?        Decide: when is an answer “good enough” or “best possible” for a specific purpose?        Reason: what can be derived from information automatically, using deductive, non-deductive inference, etc.        
Each operation in the project is defined as an interface that is implemented by a specific plugin. The standard distribution of LarKC includes a select plug-in based on the Map/Reduce paradigm. The Map/Reduce approach is also used by SHARD project.
Map/Reduce is a two-step approach that allows high scalability and parallelization of the query execution. The problem with Map/Reduce approaches is that there is no communication between nodes, therefore their performance is determined heavily by the data partitioning used. The data insertion and query process have to make implicit assumptions about where the data is stored (also called a partition step). Since there is no communication between nodes, each node runs its query highly independently, without using any relevant or potentially useful information from the other nodes or data partitions. This means that, especially with big datasets, a huge amount of information is often retrieved in the intermediate steps, when only a subset would have been relevant.
Accordingly, the main problems with existing peer to peer solutions include:
Scalability: the peer to peer systems are not scalable and data is replicated. Each triple that is inserted in the system is stored in multiple peers. This is a problem that is also shared by the Map/Reduce approach, where in cases of high frequency of a predicate, subject or object some nodes may be overloaded.
Failover mechanisms: due to the peer to peer nature of the approach the triples are replicated in some nodes. But in case of the failure of nodes, the triples are not recovered.
Based on P2P/Overlay networks. Nodes do not share data and all communication is via messages. This normally means that intermediate results are communicated to a common node that performs the join operation. There will typically be several join operations in a single query execution as intermediate results are merged together. In some cases, this means that network communication can be a problem.
On the other hand the Map/Reduce approach relies heavily on the partitioning of the data. In a triple store it is possible to make “realistic” assumptions about where data will be stored based on the predicate, object and subject type of the triple, typically in that order. However these assumptions vary drastically depending on the application. Most systems using the Map/Reduce approach will rely heavily on this partitioning step. It is also very hard to dynamically repartition data in situations where the initial partitioning assumption is not efficient anymore.
The query algorithms used in Map/Reduce approaches are also heavily dependent on how data is partitioned. Any partitioning algorithm used could speed up one type of query significantly but greatly reduce the performance of another. Also, because there is no communication between nodes, each node runs its atom (also called query clause in some systems) without relevant information from the other atom(s) which means huge datasets are generated when only a subset would have been relevant.
Many enterprises have business models that are highly dependent on processes that are executed both in good time (meeting Cycle Time measures) and correctly with minimal failures and repeats (meeting Right First Time measures). In such large organisations, in order to support the business in managing and improving processes across different systems and divisions, solutions are required that can analyse different forms of business process data in order to determine the real state of execution of the processes and evaluate accurately the performance measures associated with them. As automation is becoming an increasingly ubiquitous feature of process execution, there is a growing need for capabilities of monitoring and analysis that can, firstly, cope with the large amount of data generated during the process life spans and, secondly, is able to provide alarms very rapidly when risks of failures are detected.
Thus it is an object of the present invention to provide methods and systems which allow the analysis and monitoring of process execution using the latest, most up-to-data available. A further object of the present invention is to provide an approach which is completely data-driven and does not rely on a predefined model for information extraction.
The primary distributed computing technology used by triple stores and query engines is Hadoop, and similar Map/Reduce middleware, that demand run-once semantics. Real-time data streams typically require that the data is updated and written to file between rounds of execution, and then loaded back into memory. This makes handling real-time data very difficult, if not impossible. Thus it is an object of the present invention to provide a system which can allow a query to be activated as more data becomes available without requiring such expensive data update operations.