Technical Field
The embodiments herein generally relate to data warehouse, and, more particularly, to an automated system and method for defining a data warehouse schema for a source database.
Description of the Related Art
Data aggregation relates to any process in which information is gathered, arranged, and/or expressed in the form of a summary for purposes such as statistical analysis. In one example relating to data aggregation, the detailed activities of a business may be converted by data aggregation into a high-level view of the business, which can then be used by business managers and non-technical users for evaluation purposes. There are many purposes for data aggregation. One specific purpose of data aggregation is to acquire information about particular groups, based on specific variables, such as age, profession, or income. Creating a data warehouse with historical data stored in a star schema is one method of supporting data aggregation.
Star schema is a database design that may be used for modeling data warehouses to enable analytical querying of numeric data. The star schema may relate to dimensional data, which describes how data is commonly aggregated, and/or fact or event data, which describes individual transactions. Accordingly, the star schema may include two types of database tables (e.g., facts tables and dimension tables). Facts tables contain aggregated numeric and additive values, generally referred to as measurements. Dimension tables contain values for different perspectives by which the measurements can be interpreted.
In one example relating to the star schema, a facts table may contain columns that store aggregated sales amounts of a company. Additionally, associated dimension tables may contain columns that may be used to query the sales amounts by time, location, and customer. In accordance with this example, a star schema may include one facts table (i.e., aggregated sales amounts) and three dimension tables (i.e., time, location, and customer). Each dimension table may contain detailed information about one dimension (e.g., city, state, and country name fields in the location table). Additionally, the facts table presented in this example may contain three foreign key columns (one for each dimension table's primary key), and one or more columns for storing numeric measurements (e.g., the sales amount). A primary key may be an attribute or group of attributes that uniquely identifies a tuple (i.e., a set of attribute values pertaining to a given item in a database) within a database or table. For example, a client table might have a client number as its primary key. A foreign key, on the other hand, may be a table field that is not a key in its current table, but is a primary key in another database table.
The design of a star schema may require knowledge of data relationships and underlying semantics, which may be obtained by analysis of the data. Accordingly, star schema design is typically resource intensive. Some tasks related to determining a star schema may utilize manual analysis or human input to explain data semantics and relationships. It may be difficult to determine which data fields can be placed into the facts or dimension tables and to determine whether there exist any hierarchical or functional relationships among the fields that are chosen for dimension tables. Additionally, even aspects that do not utilize human input may require specific tools and applications that assume the existence of semantic and relational knowledge about the data. Traditionally, as part of developing a warehouse, ETL consultants define the dimensional model by figuring out the facts and dimensions, suggesting measures and hierarchies and mapping the facts and dimensions to the source schemas. This can be an expensive process both in terms of the cost of ETL consultants and the development time required, and are expensive to maintain as well given that source schemas evolve over time.
Accordingly, there remains a need for a faster and less expensive system to provide suggestions for defining a warehouse schema based on an analysis of a source database.