This patent application relates to a methodology for populating data marts.
A data mart is a database, or collection of databases, designed to help managers make strategic decisions about their business. Whereas a data warehouse combines databases across an entire enterprise, data marts are usually smaller and focus on a particular subject or department. Often, data marts are subsets of larger data warehouses. Planning and Designing The Data Warehouse edited by Ramon Barquin and Herb Edelstein. Prentice Hall PTR. ISBN 0-13-255746-0 describes definitions and the usage of data marts. The Data Warehouse Toolkit by Ralph Kimball. John Wiley & Sons, Inc. ISBN 0-471-15337-0 provides a good description of the background and concepts of data warehousing.
One of the first steps in building a successful data mart is to correctly identify the different dimensions and the fact set within a business structure. This is often known as dimension modeling. Each dimension represents a collection of unique entities that participate in the fact set independent of entities from another dimension. The fact set usually contains transactional data where each transaction (or record) is identified by a combination of entities one from each dimension. FIG. 1 shows a star schema for a supermarket business where the star schema is the outcome of the dimension modeling process.
Each dimension is a table where each record contains a key (or a composite key) that uniquely identifies each entity and a list of attributes to qualify or describe the corresponding entity (or key). Each fact record in the fact table would contain a foreign key to allow it to join to each dimension and a list of measures which represents the transactional data. The dimension table is usually not further normalized because the size of a dimension is usually much smaller than that of the fact table; thus, the space saved by normalizing would not be that significant. Also, it is not time-effective for an OLAP query tool to join the normalized dimension tables at query run-time.
Theoretically, an OLAP tool could directly query against a data mart which contains transactional data in the above star schema layout. However, in order to allow fast response time on high level queries, for instance, a query to get the monthly sales volume of a particular brand product for each state, pre-aggregation of data in a data warehouse is required.
Levels of data are specified in each dimension for the purpose of aggregation. Each level defines a grouping of dimension entries based on a condition. For instance, in the store dimension of FIG. 1, a state level could be specified which would contain one aggregated dimension record for each state having at least one store. In other words, each aggregated dimension record for a particular state would represent the aggregation of data from all stores that are in that state. Similarly, we could specify a city level in the store dimension to allow the creation of aggregated dimension records where each entry represents the aggregation of all stores in a particular city.
A level referred to as the input (or detail) level is the lowest level and contains the same number of records as the input dimensional data. Levels specified by users for aggregation purpose are referred to as aggregate levels. Each aggregate level will contain a different number of records depending on the level condition. For instance, the state level will probably contain less records than the city level. Each (input and aggregated) level is uniquely identified by a level code. The level code is generally represented as an integer for efficiency.
The aggregation required in the output fact data is specified by a combination of levels, one from each dimension. The combination of levels used to specify aggregation is also referred as a cross product of levels. To do a month by brand by state query in the above star schema example, the corresponding level would need to be defined in each of the dimensions and it would need to be specified that aggregation is required of the transactional data based on the cross product of the three specified levels. Users may specify a list of cross products for which they desire aggregation. The cross product of input levels represent the input or detail fact data. There could be one or more aggregation expressions associated with each input fact measure to be aggregated. Some common aggregation expressions includes: max, min, average, sum, count, weighted average, and fill with constant value for aggregated data.
An "all values" level is normally also provided in each dimension which has a single member containing all dimension entries. This level is used in cross products when the corresponding dimension is not involved in the cross product, so that a single format for cross products can be defined each containing a level from every dimension. Adding an "all entries" level into a cross product has no effect on the result of the aggregation and acts as a dummy entry in the cross product. For example, if the star schema of FIG. 1 were being used, and a city by month cross product was required, the product "all entries" level would be incorporated in the cross product. Effectively, the data for all products is aggregated into a city by month aggregate record.
After the process of aggregation, a data mart is generated. Each data mart contains a set of dimension and fact tables. Each dimension in a data mart corresponds to one dimension in the source star schema. The fact table in a data mart contains aggregated data for a list of specified cross products.
A previous product provided by the present assignee is a consulting service which does recognition of containment relationships between detail dimension records and aggregated dimension records, aggregation of fact data, and distribution of the aggregated dimension and fact data.
The aforementioned product has a basic mechanism of output distribution. It can distribute different levels of detail and aggregated dimension records to different output dimension files/tables and distribute different aggregate level cross products to different output fact files/tables. However, it does not truly support the concept of populating multiple data marts.
Only one target (output) star schema configuration is allowed. It does not support the concept of having data marts that are of different star schema configurations. Thus, the different output dimension or fact tables generated are bound to the same output target star schema configuration.
Likewise, the aforementioned product doesn't allow the generation of different aggregates in different target fact tables. All fact tables have to have the same type of aggregates. However, aggregates that are of interest to one group of users may not be of interest to another group of users. Thus, allowing different aggregates can both save resources wasted on processing unnecessary data and limit the scope of data accessible by users.
Furthermore, dimension tables are not tied to any particular fact table. Although the product can generate multiple dimension and fact tables, the concept of a data mart is not enforced to ensure that the correct dimension tables are joined to the correct fact table. In addition, users can specify a dimension table to contain levels that are not related to the cross products of its corresponding fact table. This could be very error prone on database systems that check for referential integrity between dimension and fact tables.
Many techniques exist today that do not have enough flexibility and efficiency to generate multiple data marts in a single aggregation generation process. The following are the reasons why such flexibility and capability is required:
1) It is often not desirable to generate a single data mart that has all the aggregates needed by all user groups because it would contain unnecessary data to other user groups. Moreover, the set of generated aggregates could be too large to be duplicated or shared among different user groups at different physical locations. PA0 2) Small and specialized data marts contain smaller amounts of data and a less complex view of the data to the targeted user group. Querying against specialized data marts is also more efficient because of the smaller or scoped set of aggregated data. PA0 3) Data marts are often required to have different target star schema configurations and aggregates because different user groups are interested in a different set of dimension attributes or fact aggregates. Moreover, different user groups often look at the data in different ways. PA0 4) Data partitioning is very essential in saving storage resource. The ability to have different fact tables sharing the same dimension tables is commonly needed. An example is having two data marts: one for actual sales and one for predicted sales. Both data marts could share the same dimension tables; however, each data mart will need to have its own fact table containing its own view of the fact data. PA0 5) Running aggregation generation multiple times to generate specialized data marts is too expensive. The overhead of doing aggregation multiple times is prohibitively high because it involves reading the input fact file and analyzing the input fact containment in different aggregation buckets multiple times.
Data marts contain published data used by OLAP tools for effective high-level queries. Aggregated data in data marts is mainly used for decision support for effective business management. However, different user groups may require different sets of aggregated data from the same set of transactional data. For instance, the west region marketing group may only care about data for the west region; thus, it would be effective to build different data marts for different geographic subsets from one set of national sales data. On the other hand, users may also require data that is of interest among all regions like the sum of nationwide sales figures. In this case, data marts may also contain some overlapped aggregated data.