1. Field of the Invention
The present invention relates generally to computer memory and more specifically to a method and apparatus for storing and retrieving multi-dimensional data, such as financial data, in computer memory such that the speed of accessing the memory is maximized and the amount of memory needed to store such data is minimized.
2. Description of the Relevant Art
Financial data is often viewed in the form of a spreadsheet containing rows and columns of figures, or data. It has become common to implement such spreadsheets on computers, so that changes to one item may be automatically reflected in any other items which use the altered item as a basis for a calculation. Before any such manipulation of data can occur, however, the data must be imported from storage or input by the user. Many companies and individuals now routinely enter their basic financial data into computers for such later retrieval and manipulation.
A spreadsheet may be thought of as a "two dimensional" array of data. For example, Company X might list income and expense accounts along the vertical axis and the months of the year along the horizontal axis, as shown in FIG. 1. Each block in the spreadsheet corresponds to a particular account and a particular month, and the amount of that account in that month, if any, is entered in that block. In this example, the list of accounts is one "dimension" and time is the other dimension. In this example, some accounts depend on other accounts; for example, "Margin" is "Sales" less "Cost of Goods Sold." One advantage of computerized spreadsheets is that once the user defines this relationship, if any of the basic data is changed, such as the entry for Sales or the entry for Cost of Goods Sold, the computer can recalculate the data which depends on the changed data, such as Margin. This saves the user the effort of changing all entries which depend on other entries.
In this example, the number of potential "cells" or items of data is equal to the number of accounts times the number of time periods included on the spreadsheet. (Here there are 17 time periods, not 12, because the user wishes to summarize the accounts by quarter and year as well as month; there could be many more time periods if more than one year is to be included.) Each item of data may be considered to have two "attributes" or identifying characteristics, one indicating the account to which the indicated amounts are attributed and the second indicating the time period in which the indicated receipts or expenditures took place.
Another factor which becomes important in these applications is the ability to "consolidate" data. For example, in FIG. 1, the summaries by quarter and year mentioned above are consolidated data from the three months of each quarter or the entire year, respectively. As with the Margin example above, the data for the quarters or the year need not be independently entered, but may be calculated from the monthly data and the spreadsheet instructed to recalculate these figures after any changes to the basic monthly data.
However, many corporations have data which has more than two dimensions. For example, Company X may have several product lines, and may wish to be able to view data showing the accounts by each product line over time, rather than, or as well as, by total accounts for the company, i.e. the total of all product lines. Thus, the product lines of Company X make up a third dimension. In turn, the value of each total account for a given time period represents the sum of that account for each of the product lines and thus is the result of consolidating the data from the different product lines.
Now the potential number of data cells is greater, and equal to the number of accounts times the number of months times the number of product lines. Each item of data now has three attributes, one indicating the account, another indicating the month, and the third indicating the product line represented by the data. This may still be somewhat manageable in terms of the storage needed.
Also, once the number of dimensions exceeds two, it is useful to be able to view the relationship between any two dimensions. That is, in this example, the user may wish to view accounts over time for any or all product lines, accounts by product line for any or all time periods, or product lines over time for any or all accounts. This data can be exhibited by a series of spreadsheets, each showing one such relationship. Thus, the spreadsheet shown in FIG. 1 shows accounts over time; however, it only shows the total accounts. While each account could be broken down by product line, as shown in FIG. 2, this greatly increases the size of the spreadsheet and makes it more difficult to find all of the entries related to, for example, the Camera product line, since one dimension, either accounts or product line, ends up being scattered across the other dimension.
Similarly, FIG. 3a shows accounts by product line. However, this is for only one time period, here January. If the user wishes to break the accounts down by time as well, again the spreadsheet becomes much larger and the entries for one dimension or the other are no longer contiguous in the spreadsheet. Again in FIG. 3b, which shows the product lines over time, only one account is shown, here Sales. To include other accounts again increases the size and complexity of the spreadsheet.
If Company X also has geographic areas, this constitutes a fourth dimension. Each item of data now has four attributes, and the total number of potential cells is the three dimensional total times the number of geographic areas. And if the company wishes to have different "scenarios," for example, to make budget forecasts and then compare the actual results to those forecasts, this is a fifth dimension, and five attributes are needed, with the number of potential cells is now multiplied again, this time by the number of possible scenarios.
In each of these cases, the number of cells required of a spreadsheet to show all possible relationships between dimensions also increases dramatically. FIGS. 4a to 4d show some possible views of such a five dimensional database which a user might wish to see. For example, the front "face" of FIG. 4a is a spreadsheet showing the actual figures for sales and profits for various products as compared to the budgeted figures over time for the San Francisco market. Behind that spreadsheet are other spreadsheets showing the same information for other cities, followed by a spreadsheet showing the same information for the "West," i.e. the total for those cities. FIGS. 4b to 4d each show a similar "stack" of spreadsheets which represents a three dimensional view of the five dimensional database. Note that in each of these examples, there is some intermingling of more than two dimensions, as shown in FIG. 2. Many more possible views could be constructed from the five dimensions used here.
It is thus obvious that the number of possible data cells rapidly becomes enormous if all combinations of data are to be precalculated and ready for reporting (as is necessary to avoid long waits for consolidation and special calculation for even the simplest reports). For example, suppose that there are seven dimensions in a particular application, and that the number of items in each dimension is 10. Each data cell must have seven attributes, each attribute being one of the 10 members of each dimension, and the total number of potential data cells is thus EQU 10.times.10.times.10.times.10.times.10.times.10.times.10
or 10,000,000. Since a data cell containing a standard double precision floating point number requires 8 bytes, 80,000,000 bytes are required to reserve a place for all of the potential cells. Common practice in microcomputer spreadsheet implementation is to maintain all cells in memory, if possible, to speed access time. But since most microprocessors have less than 16 megabytes of memory, most or all of the data would have to be kept on disk if a space were reserved for each potential cell. This would slow the speed of storage and access drastically, but could be done since an 80 megabyte drive is a common fixture on personal computers today.
But suppose that instead of 10 items in each dimension, there are, respectively, 30, 50, 400, 300, 80, 10 and 50. Now the total number of potential cells is EQU 30.times.50.times.400.times.300.times.80.times.10.times.50
or 7,200,000,000,000. Again, with 8 bytes per data cell, a total of 57,600,000,000,000 bytes are required to store all of the potential cells. No currently available disk drive can hold this much data. Even with gigabyte size disk drives, over 50,000 such drives would be needed. If the dimensions have more items, or if there are more than 7 dimensions, the problem may be even worse.
Most databases which handle problems of this magnitude keep only data which actually exists, i.e. they are relational databases whose tables consist only of records that need to exist, and thus do not waste space on "potential" data records. But relational database tables are basically two-dimensional structures (a series of records each containing a fixed "field" dimension) and cannot handle higher dimensionality in any straightforward fashion. Worse, any time a specific data cell is needed, some sort of search of the records must be done whether or not an index is available. In fact, even an index must be searched for the matching attributes. Because the table records have "gaps", even if the records are organized in some regular repeating order, an offset from the beginning of the table cannot be calculated directly to find the desired record. Thus, by conserving space by keeping only the actual data, whether on disk or in memory, speed of access is drastically reduced. This is true of any data structure which has discontinuities in the attributes of adjacent blocks or records of data rather than reserving a place, with a specific length, in a specific known order for any potential data item.
Existing multidimensional databases (non-relational and non-spreadsheet) which incorporate the ability to directly calculate the offset to the desired data item do so by one of two methods. One approach is to use a one-level structure, i.e. to have one data block containing all dimension combinations. The obvious drawback to this is that most of the reserved space is wasted and the number of dimensions and the numbers of members in each dimension is severely limited. If the application is even of medium size, operating in memory must be abandoned to use a disk, and even disk, as slow as it is, cannot offer the space required by typical corporate applications.
The other, more common approach uses a multi-level structure, usually having two levels. The upper level is some sort of index to existing data blocks, and the lower level is either a 1 or 2 dimensional block of data, such as a record representing a single dimension such as a time series, or a spreadsheet-like two-dimensional data block, respectively. The upper level must be searched to find the right index for a given set of attributes. In theory, the upper level may be a list of all potential combinations in a specific order so that the offset to the particular index (pointer) may be calculated from the attributes in the dimensions covered by the upper level structure, but no products using such an upper level are known. Since the potential number of combinations of the upper level attributes is often very large, it is believed that the existing products in this group resort to a sorted list which does not contain unused combinations, and therefore a search of some kind must be employed to reach the proper pointer.
Besides the loss of speed due to this search requirement, the biggest drawback to this type of design is that the number of dimensions in the "block" of data pointed to is fixed at either 1 or 2 dimensions, depending on the database. Furthermore, the specific type of dimension which forms the basic block of data is usually fixed. For example, one product with a one-dimensional data block requires that this dimension be the Time dimension. Another product which has a two-dimensional block requires that the two d linens ions represent "rows" and "columns" ( normally Accounts and Time, respectively). But the operations which can be performed on "rows", "columns" and the other dimensions are distinctly different and therefore limiting as to which type of attribute can be effectively and flexibly used as "rows" or "columns." For example, the "rows" dimension has available a set of calculation functions which are most appropriate for Accounts, so if Accounts are not set up as the row dimension, there is a severe limitation in performing analytical calculations typically required for Account relationships in financial applications. In the time-series oriented structure, the block dimension must be time.
However, the restriction that is most unfortunate is that the user cannot select the number of dimensions which make up the basic unit of data storage and usually cannot even select the dimensions which comprise it. This is not optimal for a number of reasons.
In multidimensional databases, as previously discussed, the major problem is sparseness of data. More often than not, the data for most potential combinations of dimensional attributes does not and will not exist. But to have the ability to directly calculate the location of a required data item, all potential combinations must be represented in the structure without discontinuities, or else the irregularity prevents the direct calculation of the offset to the desired cell.
A two level structure generally reduces this problem somewhat. If the basic unit of allocated storage (the block), when created, is always allocated to have a space for every combination of a subgroup of the dimensions, then within those blocks, at least, the offset can be directly calculated. If the upper level does not reserve a spot for every combination of the remaining dimensions, its size can be kept reasonable, although a slower search algorithm is necessary to locate a pointer in the upper level structure which gives the exact address of the block. Thus, in this structure, at least half of the procedure of locating a data item's location can be done by direct calculation, and blocks for which no data items exists need not be created.
The failure of the existing designs which use this approach to allow the user to select how many and which dimensions make up the block leads to some problems. First, the dimension or dimensions which make up the block may be very sparse for a given user's application. For example, if the user is forced to live with Accounts and Time as the block dimensions, and (as is often the case) there are hundreds or thousands of accounts, of which only a small percentage have data for a given combination of the other dimensions, each block that is allocated is still mostly wasted space.
For example, a company may have 500 departments, 80 product lines, 1000 accounts, 12 scenarios (e.g., Budget, Actual, Variance, Forecast1, Forecast2, etc.), and in each particular department/product line/scenario combination, only 20 of the accounts may have values, on average. Yet, which 20 accounts each department uses may be any 20 of the 1000 accounts. Therefore, each block that is created is, on average, comprised of 98% missing values. As a result, many such applications are impractical with existing multidimensional databases given the hardware constraints.
On the other hand, a database using a one-dimensional block which is fixed as the Time dimension will often make fairly good use of the allocated space, because if there is a data value for a particular combination of attributes in June, there is usually an observation in August and the remaining months. However, that leaves all the dimensions except Time to be represented in the upper level structure/index, and in a 7 dimensional application, this is impractical because either (1) the design reserves a fixed spot in the upper level structure for the pointer to the block, which means that if 6 dimensions are forced to go into the upper level structure, it is impossibly large; or (2) the size of the upper level structure is reduced by not reserving space for each possible combination. Unfortunately, as above, if this is done, a search algorithm must be used, and with small, one dimensional Time blocks, even the number of actually existing blocks is quite large, and the search is therefore very slow. Finally, it is assumed that the usage of a Time-dimension block is fairly dense (most cells used), but that might not be the case in some applications.
There are other variations on these two approaches, but all make use of a "fixed" block dimensional composition, and most must use a search algorithm to locate the index or pointer to the block containing the desired data cell. However, experience shows to the contrary, that there is no one fixed block design that effectively addresses even most applications. Each application has a different number of dimensions and of members in each dimension, and most importantly, a different distribution of data density/sparseness in relation to any specific subset of dimensions in that application.