Technical Field
The embodiments herein generally relate to data warehousing, and, more particularly, to an automated system and method for gathering input for defining, refining or validating star schema for a source database.
Description of the Related Art
Data warehouses are becoming more and more important to businesses. The term “data warehouse” is generally used to describe a database containing data gathered from a variety of sources (e.g., existing databases). Data that needs to be analyzed for warehousing is typically captured in normalized schemas that are more amenable for recording modifications (e.g., inserts/updates/deletes) as opposed to in star schemas that are more amenable for analysis. Warehousing requires data to be described in a dimensional model, which includes star schemas with a central fact table linked to dimension tables.
For example, if a company is interested in tracking sales, it tracks Sales as a fact and associated dimensions like Time (e.g., for tracking when each sale occurred), Customer (e.g., to whom the sale was made), Product (e.g., the items sold), Geography (e.g., where the sale occurred), Salesman (e.g., the employee who executed the sale), and so on. On base facts, expressions such as TotalSales (=SalesPrice*Quantity), and Profit (=SalesPrice−Cost) are defined. These expressions are called measures or metrics.
Dimensions are additionally associated with hierarchies like ‘Year→Quarter→Month→Week→Day’ for Time and ‘Country→State→City’ for Geography. A dimensional model facilitates analysis of facts and measures by dimensions and allows drilling up and down a dimension hierarchy as needed. So, the company may ask questions analyzing measures by dimension, (e.g., Profit by Country, or Total Sales by Salesperson), or drill up and down a dimension hierarchy (e.g., go up from Sales by Quarter to Sales by Year, or go down to Sales by Month), or look across multiple dimensions (e.g., Quarterly Sales by Country). The readability of the dimensional model, and the predictable performance stemming from its simplicity has made it very popular for analytics related applications.
For defining star schemas, one may need to understand the source model enough to establish the main entities, keys for the entities, relationships between and within entities, update frequency of entities and workflows. One may also need to identify dedicated entities dealing with aspects such as unit of measure conversion and currency conversion. Currently, the most common data warehouse development practice is to begin development with a business analyst or a data analyst to gather requirements for a data warehouse, and then to pass those requirements to a development team that then develops and codes a data warehouse design for the target environment. Once the data warehouse design is completed, experienced and expensive consultants develop the Extract, Transform, and Load processes based on the requirements. Finally, the data warehouse goes through testing, scheduling, and maintenance cycles in order to ensure proper operation. Significant time, resources, costs, and risks are involved in such data warehouse projects and many companies, may not be able to afford this custom design approach to developing and building data warehouses.
Accordingly, there is a need for technology that can gather the main entities, keys for the entities, relationships between and within entities, etc., to understand the source model that is less labor intensive and less expensive.