This patent application relates to a method of selecting aggregates to generate in a data warehouse.
A data warehouse generally contains large quantities of data relating to a business structure from which information is retrieved and analyzed. The Data Warehouse Toolkit by Ralph Kimball. John Wiley & Sons, Inc. ISBN 0-471-15337-0 provides an excellent background on data warehouses. One of the first steps in building a successful data warehouse is to correctly identify the different dimensions and the fact set within a business structure. This is often known as dimensional modeling. Each dimension represents a collection of unique entities that participate in the fact set independent of entities from another dimension. The fact set usually contains transactional data where each transaction (or record) is identified by a combination of entities, one from each dimension. FIG. 1 describes a star schema for a supermarket business where the star schema is the outcome of the dimension modeling process.
Each dimension is a table where each record contains a key (or a composite key) to uniquely identify each entity and a list of attributes to qualify or describe the corresponding entity (or key). Each fact record in the fact table would contain a foreign key to join to each dimension and a list of measures which represents the transactional data. The dimension table is usually not further normalized because the size of a dimension is usually much smaller than that of the fact table. Thus, the space saved by normalizing would not be that significant. Also, it is not time-effective for an OLAP query tool to join the normalized dimension tables at query run-time.
Theoretically, an OLAP tool could directly query against a data warehouse containing transactional data in the above star schema layout. However, in order to allow fast response time on high level queries, for instance, a query to get the monthly sales volume of a particular brand product for each state, pre-aggregation of data in a data warehouse is definitely required.
Pre-aggregation of data is important because it facilitates fast query response times from OLAP tools on commonly asked queries (questions). Thus, it is even more important to be able to define the right set of aggregates to generate in the data warehouse. Otherwise, OLAP tools may not be able to benefit from pre-computed aggregates.
Levels of data are specified in each dimension for aggregation purposes. Each level defines a grouping of dimension entries based on a condition. For instance, in the store dimension, a level could be specified for different states so that the level would contain one aggregate dimension record for each state where each aggregate record represents at least one store. In other words, an aggregate dimension record for a particular state would represent the aggregation of all stores that are in that state. Similarly, we could specify another city level in the store dimension to allow the creation of aggregate dimension records where each entry represents the aggregation of all stores in a particular city. Levels are also referred to as aggregation levels.
Thus, each level determines the group of aggregates generated at that particular level. The rule or condition associated with each level is used for the recognition of the containment relationship between detail dimension records and aggregated dimension records. The condition also determines the flexibility of grouping detail level dimension records into different aggregates.
Each aggregate dimension record contains a surrogate (or synthetic) key because a dimension record for it is not originally present in the input dimension table. The key is used to join the aggregate dimension records with the aggregates generated in the output fact table. In most cases, surrogate keys are also generated for input dimension records so that both aggregate and input dimension records will have the same layout. In this case, all fact aggregate records may also have one layout because they may join to each dimension aggregate or input level record with the same type of keys.
Not all attributes are valid or meaningful for an aggregate dimension record. In those cases, those attribute values are suppressed. For instance, in a store dimension, an aggregate dimension record at the state level will have its street address and city attributes suppressed because these attributes described individual stores. Suppressed attributes may be filled with a null value or a default value specified by users.
The aggregates required in the output fact data are specified by a combination of levels, one from each dimension, to be aggregated on. The combination of levels used to specify aggregation is also referred as a cross product of levels. To do a month by brand by state query in the above star schema example, the corresponding level would need to be defined in each of the dimensions and the aggregation of transactional data would need to be requested based on the cross product of the three specified levels. Users may specify a list of cross products for which they desire aggregations. The cross product of input (or detail) levels represents the input or detail fact data.
A "dimension" level will often be defined in a dimension with a single member containing all the dimension records. Using such a level in a cross-product effectively removes that dimension from the cross product. For example, if the dimension level is specified in the product dimension of the above cross product, a month by state query would be generated.
The technique described in this document relates to defining levels within a dimension.
Many OLAP tools today allow users to specify simple aggregation levels. These levels are usually based on distinct values in one or more columns of a table. This technique serves most of the needs of repetitive and simple OLAP queries. However, more advanced queries over combinations of columns or using complex selection criteria could benefit from more sophisticated pre-generated aggregates. In some cases, these more sophisticated queries are the norm.
A previous product created by the present assignee provides a consulting service which performs recognition of the containment relationship between detail dimension records and aggregate dimension records, and performs aggregation of fact data. The level rule used for recognizing aggregate dimension records is strictly based on distinct attribute values. Each level rule can have zero or more rule items. if no rule item is specified in a level rule, that means all dimension records will be aggregated into one aggregate record unconditionally. Each rule item refers to one distinct dimension attribute and the rule item can be either conditional or unconditional.
An unconditional rule item only consists of an attribute from the corresponding dimension. For instance, the Store dimension may have a level with an unconditional rule item for the State attribute. That means there will be one aggregate dimension record generated for each distinct value of State attribute in the input dimension.
A conditional rule item consists of an attribute, an equal or not equal operator, and a value. For instance, if the conditional rule item is "state equal TX" for the store dimension, then one aggregate dimension record will be generated if there are dimension records with state equals TX. If the conditional rule item is "state not equal TX", then one aggregate dimension record will be generated for each distinct value of the state attribute except for TX.
Level rules with multiple rule items would generate one aggregate dimension record for each distinct combination of attribute values from the set of rule items. Specifying multiple rule items allows for recognition of hierarchical levels without knowing the actual hierarchical order of the attributes or levels. For instance, a level rule can have two unconditional rule items: one for the state attribute and another for the city attribute. If there are two cities with the same name but from a different state, they will generate two different aggregation records appropriately. On the other hand, if the level rule has only one unconditional rule item for the city attribute, fact data of the two cities will be mistakenly aggregated into one record.
Another product providing aggregation is Microsoft's SQL Server 7.0 Beta OLAP server, code named Plato. Each level is associated with one dimensional attribute. However, all levels are specified in hierarchical order, so one aggregate record is generated for each distinct attribute value combination from the current and all the higher levels. Thus, it is similar to the unconditional level rule of the present assignee's previous product described above.
OLAP and aggregation products need a concise and flexible method for identifying aggregation levels. A common practice is to identify levels from the distinct values present in one or more columns of the dimension table. Each unique combination of the level columns represents an aggregation at that level. For instance, the level columns City and State may have a unique combination of Houston and Texas, which represents one aggregation at the City aggregation level.
The common practice is not particularly flexible for a number of reasons:
The capability is required to specify levels that can limit the aggregates generated, so that all combinations are not produced. It is important to control the growth of the data warehouse, and generating little-used aggregates or aggregates that do not involve much data anyway is expensive and wastes space. As an example, if only one store is in Austin, Tex., the user should be able to prevent aggregate records for Austin from being added to the dimension. In this manner, the user can tune which aggregates are really needed to speed up OLAP analysis. Previous products provided selection capability to an extent, but their use was very cumbersome.
The capability is required to specify levels that are not easily derived from the dimension attributes. For example, given State, City, and Population attributes, we might want aggregates for a state's small towns. Although it is easy to create a level involving City, State, and Population&lt;50000 (easy assuming you have expression-based level identification), this produces individual aggregate dimension records like Navasota, Tex., 23450 and Paris, Tex., 34982. There is no immediate way to get a single aggregate for All Cities in State whose Population&lt;50000. This means that the OLAP tool would resort to querying against more rows.
The capability is also required to group levels so that the cross product of levels (described earlier) is easier to specify succinctly. If you have only ten levels per dimension for three dimensions, all combinations result in a cross product of levels that is 1,000 entries long.
The capability is further required to specify levels for which the attributes are inadequate. As an example, if attributes were available for Male Population, Female Population, Male Average Age, and Female Average Age but a condition was wanted around "Average Age&gt;50, " an "average age" attribute is not directly available to create this level, even though there is in theory sufficient information to establish the level. Some products solve this problem by providing "derived attributes" which are not maintained in the dimension tables permanently, but are generated for aggregation purposes. This can work well in certain situations, but can cause problems in systems that track attributes over time. For example, synthetic keys representing level members in the dimension tables are often stored permanently in the dimension tables. Synthetic keys associated with certain attributes never change, and aggregates generated at different times will have the same key values. The fact that the aggregates maintain the same keys can be very useful. However, if derived attributes are used, which are inherently not persistent between aggregations, aggregates generated at different times will inevitably have different keys for the same attribute value. For this reason, many architectures avoid the use of derived attributes, thereby maintaining unchanging synthetic keys, but the advantages afforded by such derived attributes are often still desired.