Today, companies have to deal with an ever-increasing flood of business-relevant data. Indeed, because of technological advances and high degrees of connectivity, more and more data is being produced on a daily basis. This phenomenon is spread across practically all industries.
Not only is the volume of data increasing, but the frequency with which it is produced also is increasing—along with its variety. This phenomenon relates to what some have termed Big Data. Although there are a number of different definitions of Big Data, those skilled in the art understand that it generally refers to datasets so large and/or complex that traditional data processing applications are inadequate.
The amount and complexity of data being generated increases even more with increases in number of devices, systems, and services being connected with each other, e.g., over the Internet. In this regard, the so-called Internet of Things (IoT) is increasing yet further the volumes of data being produced on a daily basis, as well as the variety of the produced data. The IoT refers generally to the interconnection of devices and services using the Internet. The number of connecting devices emitting information has increased rapidly and is expected to continue increasing significantly. The IoT thus involves the handling of huge, heterogeneous volumes of data.
Regardless of whether it arises in the IoT or another technology-driven environment, this data at least in theory may have a high value to businesses, as it may be a part of the foundation of value-added services that can be offered to the customer. For example, a production machine may be made to automatically order items it is running short of by continuously comparing the current item consumption with the remaining capacities, a smart home for older people may detect a collapse by evaluating pressure sensors in the carpet, the churn rate of cell phone customers can be reduced by sending them coupons as compensation for current connectivity problems, etc.
In such setup, e.g., with dynamically varying numbers of data sources providing high volumes of heterogeneous data, proper data analysis can be quite a challenging task. For example, such analyses are not linear sequences of steps leading to an a priori expected result. Instead, they tend to involve more iterative and interactive processes with different analysis methods applied to different subsets of the data. Oftentimes, more insights on the data are derived with each iteration.
In that context, visual analytics sometimes is a reasonable approach, as it can combine automated data analysis with visual analysis conducted by the user. The user interactively may, for example, explore the dataset by visually investigating the data for specific patterns and then run automated data analysis tasks over the data. This process may run in an iterative manner. While visual analytics is one use case, some systems can also programmatically refine the data without user interaction.
In many use cases, the data sources to be analyzed produce an amount of data at a frequency that generally is so high that it oftentimes is referred to as being a data stream. Stream processing typically follows the pattern of continuous queries, which may be thought of in some instances as being queries that execute for a potentially indefinite amount of time on data that is generated or changes very rapidly. Such data are called streams, and streams oftentimes comprise events. Such streams often exist in real-world scenarios, e.g., as temperature readings from sensors placed in warehouses or on trucks for logistics purposes, weather data, entrance control systems (where events are generated whenever a person enters or leaves, for instance), etc. Events may include attributes (also sometimes referred to as a payload) such as, for example, the value of temperature readings and metadata (sometimes referred to as a header or header data) such as, for example, creation date, validity period, and quality of the event. Some events may have a data portion and temporal information (e.g., plane LH123 has landed at 4:34 PM, sensor TF17 has reported a temperature of 27° Celsius at 4 PM, IoT device DAEV17 has reported 67% CPU usage at 9:11 AM, etc.). Possible events occurring in an environment typically are schematically described by so-called event types, which in some respects are somewhat comparable to table definitions in relational databases. It thus will be appreciated that event streams typically have a high data volume and a low inter-arrival rate for events, and an event stream typically will be a temporally ordered sequence of events having the same type.
This “Big Data in motion” is contrastable with “Big Data at rest.” Traditional database and data warehouse technology is not always powerful enough and is not necessarily designed to deal with these amounts of data and/or data velocities. Thus, it may be necessary or desirable to extend the processing capabilities of companies so that their applications are able to support the real-time processing of event streams.
For example, given the continual arrival of new data, one cannot store all data and then run a post hoc analysis, as sometimes is possible (but not always guaranteed) with Big Data at rest. In contrast to such a post hoc analysis, streams may need to be, and/or provide benefits from being, analyzed in a real-time manner. A corresponding analytics approach may deal with the volatile, streaming nature of incoming data streams by offering a continuous exploration of the data. For the specific case of visual analytics mentioned above, while data streams in, the user can visually explore it and run automated analysis tasks to uncover hidden knowledge. It thus is desirable to make the visualization real-time capable and allow browsing through continuously updated data. It also will be desirable to make the analysis tasks real-time capable in the sense that the results are continuously updated with new data streaming in. These demanding processing requirements basically prohibit the use of standard visual analytics techniques, even assuming that data can be accessed multiple times.
Complex Event Processing (CEP) is an approach to handling some challenges associated with processing and analyzing huge amounts of data arriving with high frequencies. As will be appreciated from the above, in this context, the arriving data is classified as a data stream. By processing the incoming events in main memory using sophisticated online algorithms, CEP systems can cope with very high data volumes (e.g., in the range of hundreds of thousands events per second) being processed and analyzed appropriately. CEP systems are designed to receive multiple streams of data and/or events and analyze them in an incremental manner with very low (e.g., near-zero) latency. Events may be evaluated and aggregated to form derived (or complex) events (e.g., by an engine or so-called event processing agents). Event processing agents can be cascaded such that, for example, the output of one event processing agent can be the input of another event processing agent. In other words, while the data is streaming in, it may be analyzed on-the-fly, and corresponding analytical results may be forwarded to subsequent consumers. Therefore, a CEP system need not necessarily persist the data it is processing. This is advantageous because, as explained above, a data and/or event stream oftentimes is characterized by high volumes and high rates and cannot be persisted.
CEP in general thus may be thought of as a processing paradigm that describes the incremental, on-the-fly processing of event streams, typically in connection with continuous queries that are continuously evaluated over event streams. Moreover, CEP analysis techniques may include, for example, the ability to perform continuous queries, identify time-based relations between events by applying windowing (e.g., through XQuery or SQL), etc., with the aid of processing resources such as at least one processor and a memory. See, for example, U.S. Pat. Nos. 8,640,089 and 8,266,351, as well as U.S. Publication Nos. 2014/0078163, 2014/0025700, and 2013/0046725, the entire contents of each of which are hereby incorporated herein by reference.
With CEP technology, relevant data can be extracted in time so that business applications operating on top of that technology can present analysis results with minimum latency to the user. A CEP-supported application can be connected to several event sources that continuously produce events, and such events can be analyzed and condensed by CEP analysis logic. The analysis results can be rendered for the business user (i.e., a user from a business unit, as opposed to a user from the entity's IT department, who is able to leverage dedicated business user applications that present business-relevant metrics) in a report, graphical user interface, and/or other medium.
One common analysis task involves detecting dependencies in data streams. Typically, data streams are multidimensional, i.e., they have multiple attributes. The dependencies between those attributes in this sense describe the relationship(s) the attributes have to one another. These relationships can constitute an important characteristic of the stream. For example, the probability of losing a cell phone customer because of connectivity problems has been found to have a strong negative relationship with the mobile Internet contingent offered for free; that is, the more free megabytes that are offered, the less likely it is that the customer will terminate the agreement.
Unfortunately, however, it is difficult to explore attribute dependencies in streaming data in an interactive manner, e.g., because of the amount of data involved, its continuous arrival, and its potentially dynamically variable type.
Certain example embodiments address the above and/or other concerns. For instance, certain example embodiments relate to techniques that enable the dependency structures of multidimensional data streams to be explored, e.g., even for Big Data in motion. In this regard, certain example embodiments allow a user to explore data streams and run analytic tasks over them, e.g., in order to reveal dependency structures within the data. Certain example embodiments employ a multi-functional Stream Dependency Explorer (SDE) for these and/or other purposes, e.g., as described in more detail below.
One aspect of certain example embodiments relates to statistical modeling of attribute dependencies. Given a multidimensional data stream, certain example embodiments use regression models to determine dependency structures hidden in the data. For the different pairs of attributes, certain example embodiments may compute the regression functions to build a statistical model of those dependencies. Certain example embodiments also may be used to analyze dependencies between attributes from different streams. In this regard, certain example embodiments may derive those dependency models for subsets of the stream and compare those models with the ones being computed for the complete stream. By doing so, certain example embodiments can detect changing dependency structures in the subset. To cope with the volatile nature of data streams, the regression functions may be computed with respect to different timeframes, e.g., in order to support the investigation of short-term and long-term developments of the data.
Another aspect of certain example embodiments relates to visualization of attribute dependencies. For the specific case of a visual analysis, certain example embodiments enable the user to select one of the attributes to enable the selected attribute to be continuously viewed, e.g., in a line chart or the like. Regression models may be computed for the selected attribute and the other attributes. In relation to (e.g., below) the plot of the selected attribute, the corresponding regression function may be displayed for each of the remaining attributes, which helps make visible the dependency between that attribute and the selected one in an intuitive manner. Plotting the regression functions for the remaining attributes may in some instances allow the user to directly grasp important dependencies between the selected and remaining attributes. The user in certain example embodiments also can select in the chart a subset of the stream, e.g., causing the regression function for the subset, as well as the regression function for the full set, to be automatically displayed. This approach may advantageously allow the user to easily detect changing dependency structures in the subset.
Another aspect of certain example embodiments relates to providing recommendations for exploration steps. Certain example embodiments may automatically highlight significant dependency structures so that the user is directed to the most important facts (or at least what are likely to be the most important facts). Additionally, or in the alternative, certain example embodiments may be configured to recommend to the user automatically detected, interesting subsets of the data, e.g., to provide reasonable starting points for further analysis. Besides that interactive approach, certain example embodiments also may be able to automatically compute regression models for successively refined subsets of the data, e.g., in order to help derive or otherwise reveal significant changes in dependency structures.
Another aspect of certain example embodiments relates to efficient online analysis of dependencies in streaming data. With data continuously streaming in, certain example embodiments provide the analytical results in an online manner. For that reason, the analytical tasks are implemented following the complex event processing (CEP) paradigm, which is designed for an online evaluation of massive data streams.
Another aspect of certain example embodiments relates to support for post hoc analysis of dependencies in streaming data. Besides the analysis of the current status of the stream, certain example embodiments may support a post hoc analysis over previous data. A history of the analytical results may be maintained so that, for example, current dependency structures can be compared with historical dependency structures.
In certain example embodiments, an information processing system is provided. An interface is configured to receive at least one selected stream of events from an event bus that receives a plurality of streams of events from a plurality of different devices, with each event in each stream in the plurality of streams having at least one attribute. A model store is backed by a non-transitory computer readable storage medium. Processing resources include at least one processor and a memory coupled thereto, and they are configured to at least: receive first user input identifying the at least one selected stream; receive second user input identifying at least two attributes of interest, with the attributes of interest being selected from the attributes of the events in the at least one selected stream; execute a continuous query on the at least one selected stream to attempt to, directly or indirectly, mathematically compute a statistical model including the at least two identified attributes of interest, the statistical model having a type; store a representation of the mathematically computed statistical model to the model store; generate for output to the user a user-interactive display including a representation of the mathematically computed statistical model, the user-interactive display being dynamically changeable based upon at least (a) an update to the mathematically computed statistical model that results from the execution of the continuous query, and (b) receipt of third user input that identifies a sub-stream of the at least one selected stream; determine whether the mathematically computed statistical model sufficiently fits for the at least two identified attributes of interest; and in response to a determination that the mathematically computed statistical model does not fit for the at least two identified attributes of interest, change the type of the statistical model to a new type so that the new type of statistical model better fits for the at least two identified attributes of interest.
In certain example embodiments, an information processing method is provided. The method comprises: using processing resources including at least one processor and a memory, receiving, over an electronic interface to an event bus that receives a plurality of streams of events from a plurality of different devices, at least one selected stream of events, each event in each stream in the plurality of streams having at least one attribute; receiving first user input identifying the at least one selected stream; receiving second user input identifying at least two attributes of interest, the attributes of interest being selected from the attributes of the events in the at least one selected stream; executing a continuous query on the at least one selected stream to attempt to, directly or indirectly, mathematically compute a statistical model including the at least two identified attributes of interest, the statistical model having a type; storing a representation of the mathematically computed statistical model to a model store that is backed by a non-transitory computer readable storage medium; generating, for output to the user, a user-interactive display including a representation of the mathematically computed statistical model, the user-interactive display being dynamically changeable based upon at least (a) an update to the mathematically computed statistical model that results from the execution of the continuous query, and (b) receipt of third user input that identifies a sub-stream of the at least one selected stream; determining whether the mathematically computed statistical model sufficiently fits for the at least two identified attributes of interest; and in response to a determination that the mathematically computed statistical model does not fit for the at least two identified attributes of interest, changing the type of the statistical model to a new type so that the new type of statistical model better fits for the at least two identified attributes of interest.
According to certain example embodiments, the at least two variables may correspond to one or more independent variables and one or more dependent variables.
According to certain example embodiments, the statistical model at least initially may be a linear regression model, and/or the new type of statistical model may be a non-linear regression model.
According to certain example embodiments, the representation of the mathematically computed statistical model may be stored to the model store at a predetermined interval and/or upon an occurrence of a predetermined event that has been automatically detected by the processing resources.
According to certain example embodiments, a historical model may be retrieved from the model store in response to a user selection, the retrieved historical model having an associated original temporal resolution; and the retrieved historical model may be applied with respect to the at least two identified attributes of interest at a temporal resolution that differs from the original temporal resolution associated with the retrieved historical model and/or against data currently streaming in from the at least one selected stream.
According to certain example embodiments, the first user input may identify a plurality of selected streams. The second user input may identify at least first and second attributes of interest, with the first attribute of interest being selected from the attributes of the events in one of the plurality of selected streams and with the second attribute of interest being selected from the attributes of the events in another one of the plurality of selected streams.
According to certain example embodiments, the user-interactive display may further include a dynamically updatable visual representation of a relationship between one or more of the at least two identified attributes of interest and at least one other, different attribute of the events in the at least one selected stream.
According to certain example embodiments, whether a highly correlated relationship exists between one or more of the at least two identified attributes of interest and each of the other, different attribute of the events in the at least one selected stream may be automatically detected; and the user-interactive display may be caused to highlight each automatically detected highly correlated relationship.
According to certain example embodiments, sub-streams of the at least one selected stream of potential interest may be automatically detected; and the user may be able to view such automatically detected sub-streams. A or the continuous query may be executed on a selected sub-stream to attempt to, directly or indirectly, mathematically compute another statistical model including the at least two identified attributes of interest within the selected sub-stream.
According to certain example embodiments, user input indicative of a replay request and a past time period associated with the replay request may be received; one or more historical models may be retrieved from the model store in response to the receipt of the user input indicative of the replay request, with the one or more retrieved historical models having been mathematically computed in a time period that accords with the past time period associated with the replay request; and a replay visually demonstrating how the one or more retrieved historical models developed over the past time period associated with the replay request may be generated for output.
Corresponding non-transitory computer readable storage mediums tangibly storing instructions for performing such methods also are provided by certain example embodiments, as are corresponding computer programs.
These features, aspects, advantages, and example embodiments may be used separately and/or applied in various combinations to achieve yet further embodiments of this invention.