Business Intelligence (“BI”) generally refers to a category of software systems and applications used to improve business enterprise decision-making and governance. These software tools provide techniques for analyzing and leveraging enterprise applications and data. They are commonly applied to financial, human resource, marketing, sales, service provision, customer, and supplier analyses. More specifically, Business Intelligence tools can include reporting and analysis tools to analyze, forecast and present information, content delivery infrastructure systems to deliver, store and manage reports and analytics, data warehousing systems to cleanse and consolidate information from disparate sources, integration tools to analyze and generate workflows based on enterprise systems, database management systems to organize, store, retrieve and manage data in databases, such as relational, Online Transaction Processing (“OLTP”) and Online Analytic Processing (“OLAP”) databases, and performance management applications to provide business metrics, dashboards, and scorecards, as well as best-practice analysis techniques for gaining business insights.
Traditional BI tools have supported long-term decision planning by transforming transactional data into summaries about the organization's operations over a period of time. While this information is valuable to decision makers, it remains an after-the-fact analysis with latencies from data arrival to report production. The information needs of operational decision-making cannot be addressed entirely by traditional BI technologies. Effective operational decision-making requires little delay between the occurrence of a business event and its detection or reporting. Just-in-time, finer grained information is necessary to enable decision makers to detect opportunities or problems as they occur. BI technologies are not designed to provide just-in-time analysis.
Business Activity Monitoring (“BAM”) is the set of technologies that fills in this gap. BAM technologies provide right-time or just-in-time reporting, analysis, and alerting of significant business events, accomplished by gathering data from multiple applications. Right-time differs from real-time analysis. In right-time analysis, the main goal is to signal opportunities or problems within a time frame in which decision making has a significant value. Real-time analysis requires that opportunities or problems be signaled in a pre-specified, very short time-frame, even if the alert has the same decision-making value a day after the occurrence of the events that triggered it. Real-time operation, although preferred, is not essential. The goal is to analyze and signal opportunities or problems as early as possible to allow decision making to occur while the data is fresh and of significance. BAM therefore encourages proactive decision making.
Business events, transactional data or messages are modeled in BAM as “data streams”. A data stream is a sequence of time-stamped data items or tuples that have a fixed schema or structure and arrive in a given time order. A data stream S can be expressed as a sequence of pairs (s,τ), where s is a tuple belonging to the fixed schema of S and τ is a timestamp associated with the tuple. Timestamps could be explicit, i.e., assigned by data sources, requiring all data sources and query processing systems to be time synchronized, or they could be implicit, i.e., assigned on entry and representing tuple arrival time rather than tuple production time.
The data schema defines fields and a data type for each field. The tuples within a data stream consist of values for these fields. For example, a data stream schema representing sales data may, include, for example, fields such as productID, product_status, price, quantity, store_sales, storeID, city, store type, customerID, and employeeID, among others. A data stream schema representing an employee could include fields such as employeeID, first_name and last_name. For example, an employee data stream with the schema Se=(employeeID, first_name, and last_name) may have a tuple se=(1345, “Willy”, “Loman”) and a sales data stream with schema Ss=(employeeID, store_ID, total_sales) may have a tuple ss=(“Willy Loman”, 123, 10$).
The nature of queries and data analysis necessary for processing these types of time-stamped data streams is usually domain specific. For example, if a BAM system is used for monitoring stocks, a significant amount of user queries may focus on detecting threshold conditions. Queries may ask if the price of a particular stock increased or decreased above or below a given threshold. If a BAM system is used to provide just-in-time analysis of sales data, a significant amount of the queries may focus on multi-dimensional analysis or on the aggregation of the sales data across a variety of fields, such as customer profile, region, product type, and so on.
Such multi-dimensional analysis may be performed with a specialized multi-dimensional data architecture, generally referred to as the “stream cube”. A stream cube consists of a number of cuboids, with each cuboid representing multi-dimensional data with unique values for measures of a set of dimensions and different abstraction levels. Dimensions are a type of data model object that represent a side of a multi-dimensional data structure. Examples of dimensions include region, store, year, customer, employee, and product line, among others. Dimensions are defined by hierarchies of abstraction levels. The region dimension, for example, may have the following abstraction levels: city, country, continent, all.
Measures are quantities as ascertained by comparison with a standard, usually denoted in units such as units sold, dollars, etc. Measures are typically used to evaluate a quantifiable component of an organization's performance. For example, measures may include return on investment, revenue, sales volume, unit sales, store sales, inventory levels, cycle times, supply chain costs, number of customers, and the like. These measures summarize the data at the varying levels of abstraction. For example, the measure sales may be aggregated over a particular store, or over all stores in a state, country, etc.
A complete d-dimensional stream cube contains ad cuboids, where a is the number of abstraction levels for each dimension. For example, a 3-D stream cube may have three dimensions and an aggregate measure. If each dimension has only two levels of abstraction, then the cube has 23 or eight possible cuboids. An example of a 3-D stream cube is illustrated in FIG. 1. Stream cube 100 has eight cuboids 105-140 representing different levels of abstraction for the dimensions A, B, and C with the aggregate measure M. Measure M could be any aggregate measure such as, for example, sum or count.
Cuboid 105 is generally referred to as the “base cuboid”, as it represents the least abstract data representation or generalization. Base cuboid 105 consists of every possible combination of data values for the lowest abstraction level of each dimension with the aggregate measure M calculated for each combination. Conversely, cuboid 140 is generally referred to as the “apex cuboid,” as it represents the most abstract data representation or generalization. Apex cuboid 140 consists of one aggregate measure calculated over all the data. The other cuboids 110-135 in between base cuboid 105 and apex cuboid 140 contain measures calculated over different combinations of abstraction levels for each dimension. For example, cuboid 125 contains the measure M over the different values of dimension A, with dimensions B and C abstracted to their more general form.
Physically, each cuboid in a stream cube consists of a table that stores the respective combinations of dimensions and measures. The stream cube links up all the cuboids in a hierarchical structure. For example, suppose in stream cube 100 dimension A is a geographical dimension (e.g., country, state, city, etc.), dimension B is a product dimension (e.g., product category, product sub-category, etc.), dimension C is a store dimension (e.g., store type, etc.), and measure M is a sales measure. Base cuboid 105 consists of a table showing the sales value for all possible combinations of the geographical dimension A, product dimension B, and store dimension C. Apex cuboid 140 consists of a single value representing the total sales across the geographical, product, and store dimensions. And cuboid 125 shows the sales value for all possible combinations of the geographical dimension A.
Depending on the size of the stream cube, maintaining or materializing all cuboids within the cube is neither cost-effective nor practical. Data streams may contain detailed data such that analyzing the data at the stream level does not facilitate the discovery of useful trends or patterns in the data. Aggregating the data to a higher abstraction level is often necessary.
The stream cube may be fully materialized, with aggregate measures calculated for each cuboid, or partially materialized, with aggregate measures calculated for only a subset of cuboids. In this latter case, to find the measure of an immaterialized cuboid, the measures of cuboids at lower abstraction levels are aggregated to the immaterialized cuboid at the higher abstraction level.
A stream cube is said to be a relatively stable in size data cube. A stable stream cube may be designed by using a windowing model and setting bounds on the lowest and highest abstraction levels. A windowing model defines a time window in which all data tuples are important and processed if falling within the window and discarded or ignored otherwise. An example of such a windowing model is commonly referred to as the “tilted time frame”. The tilted time frame registers measures of most recent data at a finer granularity compared to measures of data that arrived at a more distant time.
In doing so, the tilted time frame compresses the data by gradually fading out old data. The level of granularity at which recent and past data is registered is dependent on the application domain. By integrating the tilted time frame into the stream cube, the size of the cube could be stabilized so long as the other dimensions in the cube are relatively stable with time.
An example of a tilted time frame is illustrated in FIG. 2. In tilted time frame 200, measures of data received a week ago are stored at a granularity of one day whereas measures of data received within the last fifteen minutes are stored at one minute granularity. For example, sales made a week ago would be counted per day, whereas sales made in the last fifteen minutes would be counted per minute. Tilted time frame 200 is divided into legs or sections, with each leg representing a group of time ranges. Each leg or section of tilted time frame 200 contains sub-cubes. A sub-cube is a stream cube that aggregates data only for a given time range. For example, the 1-hour leg 205 of tilted time frame 200 consists of four fifteen-minute sub-cubes that maintain measures for the following time intervals: (t−30 min, t−15 min], (t−45 min, t−30 min], (t−60 min, t−45 min], and (t−75 min, t−60 min], Here t represents the current time and “(x, y]” is an interval set notation meaning “between x but not including x, and up-to and including y”.
To keep the stream cube stable, each sub-cube is partially materialized along a subset of abstraction levels. Previous work has suggested materializing a stream cube along at least two abstraction levels, generally referred to as the “minimally-interesting layer” (“m-layer”) and the “observation layer” (“o-layer”). The m-layer represents the minimally interesting layer at which examining the data is productive. It is necessary to have such a materialized layer since it is often neither cost-effective nor practically interesting to examine the minute detail of stream data. Any cuboid below the m-layer is not materialized or computed. The o-layer represents the cuboid that is observed by most users, that is, the layer that a user takes as an observation deck, watching the changes of the current stream data by examining the slopes of changes at this layer to make decisions.
An example of a stream cube with the materialized m- and o-layers is illustrated in FIG. 3. Stream cube 300 is a 3-D data cube with dimensions region, product, and store type. The region dimension has three abstraction levels {city, country, all}, the product dimension has three abstraction levels {sub_category, category, all}, and the store type dimension has two abstraction levels {store_type, all}. Stream cube 300 has an m-layer 305 that groups measures by product sub-category and city as it aggregates the dimension store type to its highest abstraction level (i.e., all). M-layer 305 is computed by moving from cuboid 315 of (store_type, city, sub_category) to m-layer 305 by grouping or aggregating measures for different store_types.
Stream cube 300 also has an o-layer 310 that aggregates all dimensions. O-layer 310 corresponds to the apex cuboid of stream cube 300. The m- and o-layers 305-310 are always materialized and computed. All cuboids between those layers are reachable or could be computed on demand. All cuboids outside those layers, such as cuboid (store_type, city, category) 320, cannot be computed on demand.
There are three materialization options for the remaining, intermediate cuboids: (1) on-demand materialization, in which case the intermediate cuboids can be computed on demand if desired; (2) full materialization, in which case all cuboids in the stream cube are updated upon arrival of data streams; or (3) partial materialization, in which case only a subset of the intermediate cuboids are computed along a “materialization path” between the m-layer and the o-layer. A materialization path is a sequence of cuboids C1 . . . Cn that connect the m-layer C1 to the o-layer Cn such that each cuboid Ci could be incrementally updated by aggregating measures in the previous cuboid Ci-1.
Previous work has suggested that the latter alternative—that of partial materialization—is best suitable for the analysis of multi-dimensional data streams. In this case, the stream cube may be partially materialized along a materialization path that is static and computed by an expert. This materialization path, referred to as the “popular materialization path” or as the “popular drilling path”, contains the cuboids that users are most likely to request when drilling down from the o-layer to the m-layer.
The expert typically chooses this path based on his/her knowledge of the most likely requested data groupings in a particular application domain. For example, if users are more interested in examining sales by city and category compared to sales by country and sub_category, then the cuboid (city, category) is part of the popular materialization path. A popular materialization path is illustrated in stream cube 300 between m-layer 305 and o-layer 310. Intermediate cuboids 325-355 between m-layer 305 and o-layer 310 are the only cuboids that are materialized or computed in stream cube 300 between m-layer 305 and o-layer 310.
Although the popular path provides a way to partially materialize a stream cube so the size of the cube is stabilized, its static nature prevents the stream cube to fully respond to changes in users' just-in-time requests, changes in users' access to the stream cube (e.g., not all users may be able to have access to the same portions of the stream cube), as well as changes in system conditions (e.g., memory, storage space, etc.) over the duration of the stream cube. For example, several factors may influence the users' requests over time, including competition-induced factors, seasonal factors, market or economic factors, and external or unexpected factors. Other factors could influence users' requests such as internal business decisions or new governmental policies or regulations. These factors could be responsible for dimensional shifts in users' requests as well as shifts towards the observation and analysis of certain measures.
The popular path also prevents users from observing different cuboids according to their business. From a BAM perspective, users at different management levels deal with data at different abstraction levels. It is therefore unlikely that all users prefer to analyze the data only along the popular path between the m- and o-layers and drill down occasionally. Instead, users' requests are more typically scattered across the cube at different abstraction levels. For example, a regional sales manager might be interested in sales across particular stores, while a product manager might be interested in the sales of particular products. This difference in perspective makes the choice of a static materialization path detrimental to the efficiency of a stream cube within a BAM system. A static materialization path that is fixed for the duration of the stream cube does not satisfy the evolution and changes in just-in-time users' requests.
Accordingly, it would be desirable to provide techniques for partially materializing a stream cube to account for changes in users' requests and changes in system conditions. In particular, it would be highly desirable to provide techniques to dynamically materialize the stream cube to ensure just-in-time responses.