Stream processing typically follows the pattern of continuous queries, which may be thought of in some instances as being queries that execute for a potentially indefinite amount of time on data that is generated or changes very rapidly. Such data are called streams, and streams oftentimes comprise events. Such streams often exist in real-world scenarios, e.g., as temperature readings from sensors placed in warehouses or on trucks, weather data, entrance control systems (where events are generated whenever a person enters or leaves, for instance), etc. Events may include attributes (also sometimes referred to as a payload) such as, for example, the value of temperature readings and metadata (sometimes referred to as a header or header data) such as, for example, creation date, validity period, and quality of the event. Possible events occurring in an environment typically are schematically described by so-called event types, which in some respects are somewhat comparable to table definitions in relational databases. Streams may in certain scenarios be organized in channels that in turn are implemented by an event bus. Channels and event types in this sense may be considered orthogonal concepts, e.g., in the sense that channels may comprise events of several event types, and events of the same event type might be communicated via different channels.
Event streams are typically used in computer systems adhering to the event-driven architecture (EDA) paradigm. In such systems, several computer applications each execute on distinct computer systems and are typically interconnected by a network, such as a local area network or even the Internet. Each application typically is in charge of executing a certain processing task, which may represent a processing step in an overall process, and each application typically communicates with the other applications by exchanging events. Examples include the calculation of complex mathematical models (e.g., for weather forecasts or scientific computations) by a plurality of distributed computers, the control of an assembly line (e.g. for the manufacturing of a vehicle, wherein each assembly step is controlled by a particular application participating in the overall assembly process), etc. It is noted that a multitude of processes, potentially of different applications (and thus not necessarily of one overall process), also may be supported. Generally, events may be represented in a variety of different formats. The XML format, for instance, is one common format in which events and their associated event types may be represented.
In a Complex Event Processing (CEP) system, events may be evaluated and aggregated to form derived (or complex) events (e.g., by a engine or so-called event processing agents). A typical manner to specify such evaluation and aggregation involves using CEP queries, which oftentimes are formulated in an SQL-like query language that is enhanced by some CEP-specific clauses such as, for example, a WINDOWS or ROWS clause to define conditions that relate to the occurrence of events within streams or channels. Typically, CEP systems are used to automatically trigger some activity, e.g., an appropriate reaction on an unusual situation that is reflected by the occurrence of some event patterns. A common mechanism to trigger reactions includes querying (or having some agent(s) listening) for specific complex events on dedicated channels and executing the appropriate action when such an event is encountered.
In contrast with database systems that run queries to analyze a certain state of the data, CEP systems perform “continuous” query execution on streams, e.g., a query is “constantly” and “continuously” evaluated “forever.”
Thus, CEP may be thought of as a processing paradigm that describes the incremental, on-the-fly processing of event streams, typically in connection with continuous queries that are continuously evaluated over event streams.
The newly introduced notion of “Big Data” refers to the fact that enterprises nowadays face challenging data management problems. Data is massively increasing in terms of volume, variety, and velocity. Besides the increase of common transaction-based data, other data sources emerge such as, for example, data from social media, mobile devices, sensor networks, etc. For companies striving to improve customer interaction and responsiveness, a suitable management of that big data is of important. It therefore will be appreciated that their corresponding enterprise applications and analytic tasks could benefit from more efficient and insightful data access, particularly when complemented with sophisticated data analysis techniques.
Distributed grid technologies, for example, have gained importance in the context of a way to provide efficient data access. By using multiple in-memory caches, efficient data access as well as scalability can be achieved. Recently CEP technologies such as those outlined above have been coupled with that caching approach to allow for efficient cache searching. In some such cases, continuous SQL queries process the streams of updates on the caches, search for relevant data, and publish these search results continuously to dedicated result caches. Thus, users can directly observe latest results by querying those result caches.
Unfortunately, however, it is believed that no meaningful consideration has been given to the statistical modeling of cache characteristics. Similarly, it is believed that no meaningful consideration has been given to the fact that such a model can be automatically updated and can keep track of latest changes in the cache characteristics.
In general, the use of data mining and statistical modeling is well-established in enterprise applications, as it allows one to capture core characteristics of data, derive important relationships, and forecast future behavior. While there are a number of existing approaches to general data mining, they do not provide a full spectrum of solutions. For example, a database system can manage large amounts of data and store them persistently. By means of queries selected subsets of the data can be retrieved and used for further analysis. Using that approach, standard data mining algorithms are implemented on top of the database system. To get a summary of all data currently stored in the database, several statistical models can be computed, including the estimation of value distributions. However, database systems are not designed for continuous processing of incoming events. As a consequence, they are also not designed for incrementally updating statistical models in a real-time manner. Because of slow disk access, one can query the database in a periodic way and compute according statistical models. However, decisions might be based on outdated data characteristics. Database systems also support triggers that can be fired when database operations are executed, although these triggers do not always scale well for large amounts of data streaming in with high rates.
Using a data warehouse approach, a database manages the data, which is periodically loaded into a data warehouse that conducts additional data-condensing operations. Standard mining techniques can then be applied on top of the warehouse. Unfortunately, similar to the database approach, the data warehouse approach is not suitable, as it is not always kept up-to-date. Data generally is loaded into the warehouse in a periodic fashion and then the data characteristics are computed, which is typically a very time-consuming process. Thus, statistical models can be computed, but most likely will not be up-to-date with respect to the latest trends.
As indicated above, distributed grids typically utilize multiple in-memory caches to allow for fast data access. The data being cached can originate from arbitrary sources such as, for example, databases or streaming data sources. In order to search for specific data, ad-hoc queries can be used. Ad-hoc queries typically traverse all data currently in the cache and select the data of interest. To accelerate the search, caches typically support an additional indexing of relevant attributes. Unfortunately, however, using ad-hoc queries for searching in the cache contents may be very time consuming, as indexes cannot always be leveraged, in which case the complete cache may need to be traversed. Additionally, these queries typically can only derive basic summary statistics of the cache contents such as, for example, minimum, average, and count statistics. These simple statistics may not uncover important data characteristics as can be done with more sophisticated statistical models.
A recent extension of the distributed grid approach uses Complex Event Processing to accelerate search requests. Typically, caches can provide listeners that provide notifications concerning recent cache operations. Continuous queries are registered to those listeners and incrementally process the notifications on cache operations. Each continuous query corresponds to one search request. The result streams of the query are continuously inserted or removed from an associated search result cache. Thus, the result cache contains the latest result for the current cache contents, which is the same data as if an ad-hoc query would have been executed over the current cache. Because continuous SQL queries over cache update operations are used to compute search results in an incremental online manner, the search operation on the cache is very fast. However, as in the previous approach, SQL queries can only derive basic summary statistics such as those listed above. It would be desirable to exploit higher-value statistical analysis to uncover and analyze the characteristics of the stream, which cannot be done with continuous SQL queries and current techniques. Additionally, it is believed that the continuous query approach currently is limited to accelerating searches in caches, while it would be desirable to allow for other applications.
Thus, it will be appreciated that there is a need in the art for the management and analysis of Big Data, as well as improved techniques for using grid technologies for caching big data.
One aspect of certain example embodiments relates to providing a meaningful live analysis of major characteristics of a cache. In addition to considering simple descriptive statistics, certain example embodiments make it possible to leverage well-defined statistical models that capture the main behavior of the cache. These models may in some instances be computed in an online manner over cache changes and therefore may automatically keep track of recent cache behavior. Such features are advantageous, as analytical models typically are derived in a periodic way and, as a consequence, these models are most likely out-dated.
Another aspect of certain example embodiments relates to combining two dimensions of the cache behavior. The resulting combined model may be thought of as a compact representation of the cache behavior that captures not only the way the data in the caches behaves, but also how it evolves over time. Certain example embodiments provide a complementing visual representation of the combined model, e.g., to provide the user with an intuitive way to analyze the cache and its behavior. By setting a temporal analysis range, for example, the user may additionally or alternatively adjust the time span on which the continuous analysis is based. Thus, short-term as well as long-term tendencies advantageously can be revealed.
Certain example embodiments advantageously make it possible to identify changes in the data and take quick reactions to such changes in the data, while also enabling proactive reactions to be taken based on recent developments. Thus, enterprise applications on top of the cache may be accorded powerful analytical means to capture recent changes.
Of course, standard data mining is different from stream data mining, as the latter approach refers to mining algorithms more specifically adapted to the streaming data scenario. And while there are a number of commercially available CEP engines that are built to allow for low-latency processing of high-volume event streams, none seems to leverage stream mining on cache update streams in order to derive continuously a statistical model of the stream that describes the data distribution and the validity characteristics in different data regions in a combined manner. Accordingly, none seems to provide for the online computation of a combined distribution and validity model, or comparable technologies.
In certain example embodiments, a method of analyzing the behavior and parameters of a cache in a computer system over a temporal range of analysis is provided. Notifications indicating that respective cache operations have been performed in connection with respective elements and the cache are received over a first stream, with each said operation having an operation type, and with the operation type being designated as one of an insert, update, or remove operation for the respective element. For each received notification where a selected element attribute of interest is available therein: information regarding a key of the respective element, the respective selected element attribute of interest, the respective operation type, and respective timestamp(s) associated with the respective operation, is extracted from the respective notification; and value and validity distribution models are computed using the extracted information. The computing of the value distribution model, in connection with a given notification and an associated given element, comprises: updating a temporal buffer of inserted and not yet removed and/or updated elements to include an entry for the given element, with the temporal buffer defining a range of elements to be considered in the computing of the value distribution model; and calculating a value distribution for the selected attribute of interest based on elements in the temporal buffer. The computing of the validity model, in connection with a given notification and an associated given element, comprises: ignoring the given notification when the given element has an insert operation type; calculating a validity value for the given element as a difference between first and second timestamps, where (a) for remove operation types, the first timestamp indicates when the given element was removed and the second timestamp indicates when the given element was inserted, and (b) for update operation types, the first timestamp indicates when an old element was removed and the given element was inserted and the second timestamp indicates when the old element was inserted; ignoring the given notification and the given element when the validity value is greater than a window size corresponding to the temporal range of analysis; and when the validity value is less than or equal to the window size determining a temporal partition of the temporal range of analysis into which the attribute of interest associated with the given element falls, and publishing an event to a second stream, the event indicating the validity value and the determined temporal partition; and running a query on the second stream in order to derive summary statistics for validity values in the partitions.
In certain example embodiments, an event processing system for analyzing the behavior and parameters of a cache in a computer system over a temporal range of analysis is provided. An event bus is configured to receive a first stream of events corresponding to respective operations on the cache, with each said event being associated with a respective element operating on the cache, and with each said operation having an operation type, and the operation type being selected from the group consisting of insert, update, and remove operation types. An event processing engine comprising processing resources includes at least one processor, the event processing engine being configured, for each event received from the first stream where the respective element has a pre-selected attribute of interest associated therewith, to (a) compute a value distribution model and (b) compute a validity distribution model. Part (a) is computed by at least: updating a temporal buffer of inserted and not yet removed and/or updated elements to include an entry for the respective element, with the temporal buffer defining a range of elements to be considered in computing the value distribution model and including at least data indicative of the attribute of interest for the elements therein; and calculating a value distribution of the attributes of interest for the elements in the temporal buffer. Part (b) is computed by at least: ignoring the respective event when the given element has an insert operation type; calculating a validity value for the respective element as a difference between first and second timestamps, with the first timestamp indicating when the given element was removed and the second timestamp indicating when the given element was inserted for remove operation types, and with the first timestamp indicating when an old element was removed and the given element was inserted and the second timestamp indicating when the old element was inserted for update operation types; ignoring the respective event when the validity value is greater than a window size corresponding to the temporal range of analysis; and when the validity value is less than or equal to the window size, determining a temporal partition of the temporal range of analysis into which the attribute of interest associated with the given element falls, and publishing a new event to a second stream of events, the new event indicating the validity value and the determined temporal partition. A query is run on the second stream in order to derive summary statistics for validity values in the partitions.
In certain example embodiments, a method of analyzing the behavior and parameters of a cache in a computer system over a temporal range of analysis is provided. Notifications indicating that respective cache operations have been performed in connection with respective elements and the cache are received over a first stream, with each said operation having an operation type, and with the operation type being designated as one of an insert, update, or remove operation for the respective element. For each received notification where a selected element attribute of interest is available therein: information regarding a key of the respective element, the respective selected element attribute of interest, the respective operation type, and/or respective timestamp(s) associated with the respective operation, is extracted from the respective notification; and a value distribution model and a validity distribution model are computed using the extracted information.
In certain example embodiments, there is provided a non-transitory computer readable storage medium tangibly storing instructions that, when executed by at least one processor of a system, perform a method as described herein.
Similarly, in certain example embodiments, there is provided a computer program comprising instructions for implementing a method as described herein; and/or an event processing system for analyzing the behavior and parameters of a cache in a computer system over a temporal range of analysis, adapted for performing a method described herein.
These aspects, features, and example embodiments may be used separately and/or applied in various combinations to achieve yet further embodiments of this invention.