The present invention relates to techniques for analyzing large data sets and, more specifically, to methods and apparatus for efficiently running “what if” scenarios with large, multi-dimensional data sets.
The term “data management software” encompasses a vast array of solutions for manipulation of business data which can be loosely organized into three categories, On-Line Transaction Processing (OLTP), data warehousing, and On-Line Analytical Processing (OLAP). Each of these categories has certain advantages and drawbacks, which were discussed in the above-referenced application.
In short, OLTP relates to a class of solutions that facilitate and manage transaction-oriented applications, typically for data entry and retrieval transactions in a number of industries including, for example, banking, airlines, mail order, supermarkets, and manufacturing. It is an important goal of an OLTP system that the data stored in the system is readily accessible to ensure a high degree of responsiveness. It is also important to provide locking mechanisms to ensure, for example, that when an individual reserves a resource, e.g., an airline seat, that resource is no longer available to others in the system. Thus, in OLTP systems, storing of data in more than one place is disfavored, emphasizing instead a heavy reliance on joint processing of the different tables to combine data. OLTP systems are very effective for real-time transaction processing, but not particularly suited to reporting functions employing aggregate queries, e.g., show all of the people who are flying on a particular flight more than twice a month.
Data warehousing employs different data schemas, which are better suited to support relatively sophisticated reporting functions. However, there may be a tremendous amount of duplication of data. In the OLTP context, this duplication is not acceptable, since a change to a single piece of data would need to be duplicated in many places in the data warehouse tables instead of just a single location. On the other hand, data warehousing is advantageous from a reporting perspective in that it allows the creation and maintenance of summary tables which aggregate information which correspond to queries in which a particular business might be particularly interested, for example, passenger loads for specific routes by fiscal quarter. While data warehousing systems are highly optimized to generate static reports, they do not efficiently support analysis of the data in which the questions are not known in advance. For example, a sales manager may look at a static report and see that nation-wide sales of a specific product during a particular month were lower than expected. However, because of the static nature of the report, the reason for the shortfall may not be apparent. In such a situation, the sales manager would like to be able to drill down into the data to determine, for example, whether there are any identifiable disparities (e.g., regional, temporal, etc.), which might serve as an explanation. These types of capabilities fall within the domain of OLAP.
OLAP systems organize data to allow the kind of ad hoc analysis which would allow the sales manager to zero in on the data that might explain the disparity identified in the static report. This is to be contrasted with OLTP solutions which are highly optimized for retrieving detailed data and typically very poor at providing summaries. The OLAP approach is also to be contrasted with data warehousing solutions that would be required to maintain an impracticable number of summary tables to duplicate such functionality. A significant issue with OLAP solutions relates to the fact that they are typically only optimized for batch processing (as opposed to transaction processing which is characterized by near real-time updating). Due to the large amount of highly interdependent summary information in the data underlying an OLAP system, the updating of any piece of detailed data tends to be computationally expensive in that many different summaries on many different levels of the hierarchy will typically need to be invalidated and recalculated. Thus, instead of supporting the interactive updating of data, most OLAP systems typically employ batch recalculations. There are OLAP solutions that attempt to strike various compromises to at least give the appearance of interactive updating. For example, some solutions limit the data set or indices upon it, such that it fits in main memory and then interactively recalculate all data values upon retrieval. Other solutions employ scripting techniques to isolate and update subsets of data between batches. Unfortunately, these approaches only partially mitigate the inefficiencies associated with updating multi-dimensional data sets. As a result, while OLAP systems are effective at the ad-hoc querying of data to assist in identifying and locating issues, they are relatively ineffective at the ad-hoc update or “what-if” scenario analysis needed to understand the implications of making changes to address those identified issues.
The above referenced patent application describes a number of techniques by which large, complex data sets may be more efficiently invalidated and recalculated to reflect changes. Change logging is employed in the maintenance of summary information for large data sets, in combination with dependency checking among data blocks for different levels of hierarchical data in such data sets. As a result, the time required to update or recalculate the underlying data is closer to being a function of the number of changes made rather than, as with most OLAP solutions, a function of the size of the data set or the number of dimensions. Furthermore, the described techniques also allow the running of multiple “what if” scenarios using the same underlying data set substantially simultaneously. Different users can run these multiple peer scenarios in parallel. Alternatively, a single user may have multiple levels of scenarios, that is, child scenarios based on the results of a parent scenario, none of which is currently practicable in the conventional OLAP domain. As the complexity of the users' “what if” scenarios increases, there is an increased need for a well-structured system that enables performing changes and updates and that supports rich analytics, compared to what is possible in conventional OLAP systems.