The present invention relates to techniques for analyzing large data sets and, more specifically, to methods and apparatus for efficiently running “what if” scenarios with large, multi-dimensional data sets.
The term “enterprise software” encompasses a vast array of solutions for manipulation of business data which can be loosely organized into three categories, On-Line Transaction Processing (OLTP), data warehousing, and On-Line Analytical Processing (OLAP). OLTP relates to a class of solutions which facilitate and manage transaction-oriented applications, typically for data entry and retrieval transactions in a number of industries including, for example, banking, airlines, mail order, supermarkets, and manufacturing.
The example of an airline reservation system will serve to illustrate various characteristics of the typical OLTP system. In such a system, there are perhaps thousands of people, i.e., airline passengers, all vying for access to a shared resource, i.e., the reservation system. It is therefore an important goal of any such system that the data stored in the system be readily accessible to ensure a high degree of responsiveness. It is also important to provide locking mechanisms to ensure, for example, that when an individual reserves a resource, e.g., an airline seat, that resource is no longer available to others in the system.
These types of functionalities are typically optimized by storing information in tables which are capable of interacting such as, for example, the tables found in a relational database. In the airline reservation example, one table might correspond to a particular reservation, another table to aircraft type, yet another to flight plans, and yet another to frequent fliers. To effect a particular transaction (e.g., reserving a seat on a particular flight), data are typically read from a number of tables and updated in some subset of those tables in a manner which is typically optimized through the use of indexing and other techniques. In OLTP systems, storing of data in more than one place is disfavored, emphasizing instead a heavy reliance on joint processing of the different tables to combine data.
The manner in which OLTP systems are optimized makes them very effective for real-time transaction processing. However, they are not particularly suited to reporting functions employing aggregate queries, e.g., show all of the people who are flying on a particular flight more than twice a month. On the other hand, the second class of solutions, data warehousing, employs different data schemas, e.g., “star” or “snowflake” schemas, which are better suited to support relatively sophisticated reporting functions.
In a typical data warehousing application, the star or snowflake schema organizes the data around a central “fact” table which is linked to related “dimension” tables. Returning to the airline reservation example, the central fact table might be a reservation which has columns indicating the type of aircraft, whether the passenger is a frequent flier, and the point of origin, the destination and the date of the flight. In addition, there would be dimension tables which include most if not all of the information in the central fact table, e.g., tables for type of aircraft, flight plans, frequent fliers, etc.
It is apparent that in data warehousing systems there may be a tremendous amount of duplication of data. In the OLTP context, this duplication is not acceptable in that, if the date of a particular flight were changed, that piece of data would need to be changed in every reservation table rather than at just a single location. However, this approach to the organization of data is advantageous from a reporting perspective in that it allows the creation and maintenance of summary tables which aggregate information which correspond to queries in which a particular business might be particularly interested, e.g., passenger loads for specific routes by fiscal quarter.
As effective as data warehousing systems are for predetermined queries, they are not particularly well suited for facilitating ad hoc analysis of large data sets. That is, because data warehousing systems are highly optimized to generate static reports, they do not efficiently support analysis of the data in which the questions are not known in advance. For example, a sales manager may look at a static report and see that nation-wide sales of a specific product during a particular month were lower than expected. However, because of the static nature of the report, the reason for the shortfall may not be apparent. In such a situation, the sales manager would like to be able to drill down into the data to determine, for example, whether there are any identifiable disparities (e.g., regional, temporal, etc.) which could serve as an explanation. These types of capabilities are the domain of OLAP.
OLAP systems organize data in multiple dimensions to allow the kind of ad hoc analysis which would allow the sales manager to zero in on the data which might explain the disparity identified in the static report. That is, using an OLAP solution, the sales manager could conduct a directed search which traverses various dimensions in the data before ultimately zeroing in on the detailed data of interest. This is to be contrasted with OLTP solutions which are highly optimized for retrieving detailed data and typically very poor at providing summaries. The OLAP approach is also to be contrasted with data warehousing solutions which would be required to maintain an impracticable number of summary tables to duplicate such functionality.
OLAP systems view data as residing at the intersection of dimensions. That is, the data underlying OLAP systems are organized and stored as a multi-dimensional database (or a modified relational database) which is an instantiation of the cross-product of all of the dimensions. This allows the user to traverse hierarchies of detail along dimensions of interest in an ad hoc manner to get at specific data.
On the other hand, a significant issue with OLAP solutions relates to the fact that they are typically only optimized for batch processing (as opposed to transaction processing which is characterized by near real-time updating). That is, the multi-dimensional “data cubes” which represent the summary data of interest grow geometrically as dimensions are added or expanded. This results in a database which includes vastly more summary data than detailed data. This, in turn, is a significant obstacle to supporting interactive updating of the data. That is, due to the large amount of highly interdependent summary information in the data underlying an OLAP system, the updating of any piece of detailed data tends to be computationally expensive in that many different summaries on many different levels of the hierarchy will typically need to be invalidated and recalculated.
Thus, instead of supporting the interactive updating of data, most OLAP systems typically employ overnight batch recalculations. There are OLAP solutions which attempt to strike various compromises to at least give the appearance of interactive updating. For example, some solutions repeatedly recalculate only the new input values in main memory (rather than the full database) giving the appearance of interactive updating. Other solutions employ scripting techniques to isolate and update subsets of data between batches. Unfortunately, these approaches only partially mitigate the inefficiencies associated with updating multi-dimensional data sets.
In view of the foregoing, it is desirable to provide techniques by which large, complex data sets may be more efficiently invalidated and recalculated to reflect changes.