1. Field
The present disclosure relates to data warehouses and, in one particular example, to improved processes for managing data warehouses.
2. Related Art
Data warehouses are large repositories that integrate data from many different sources and are commonly used to store data for purposes of reporting and data analysis. In traditional data warehousing, entities and their attributes are mapped into dimensions, where a dimension refers to a data element that categorizes each item in a data set into a non-overlapping region. For example, as applied to a sales receipt, possible dimensions may include “Customer,” “Date,” and “Product.” Dimensions provide filtering, grouping, and labeling and are needed to slice or aggregate data in various ways (e.g. per region, per sales person, per language, per item category, etc.).
Conventional data warehouses may store various types of data, such as measures (e.g., properties in a database on which calculations can be made, such as quantity of items sold or something which changes over time), their changes over time, and dimensions of interest, in data structures called fact tables. The fact tables provide the values that act as independent variables for analyzing dimensional attributes. Dimensions in this model are constructed from the attributes of interest as well as the changes to their values over time. To manage data warehouses storing data in fact tables, various data management techniques, such as a star schema, may be used. A star schema generally refers to a simple form of a scheme in a data warehouse that includes one or more fact tables that may reference any number of dimension tables. For example, a star schema for a data warehouse may include fact tables that include a measure and the identifiers for the dimensions and the set of tables describing each dimension. While generally effective, these data warehouses are fairly difficult to modify. For example, due to the complexity of building a data warehouse, the propagation of schema change is limited as it impacts both the target repository as well as the data pipeline that has been used to construct the warehouse by integrating data from multiple sources.
As data warehouses are being used to store larger amounts of data that change over time, it is becoming increasingly important to have proper data retention mechanisms to retain relevant data while deleting or archiving older, less relevant data. In some systems, data warehouses are physically separated into partitions based on a time period (e.g. by day, month, quarter or year). A common data retention mechanism used in these systems is to simply delete older partitions. While this results in predictable data retention, it may cause older, yet relevant data, to be archived or deleted.
Improved systems and processes for managing data warehouses are desired.