The present invention relates generally to data profiling during extract-transfer-load (ETL) processes, and more particularly, to data quality monitoring by running data quality rules and comparing them against previous data quality results to determine whether or not data quality has changed.
Data quality issues for data integration projects for master data management (MDM) systems, data warehousing (DW) systems, business application consolidation systems etc., are identified using data profiling techniques and/or data cleansing approaches in ETL processes. These issues are identified so that only high-quality data is loaded during the initial load phase of these systems. However, when data quality degrades in business applications, data quality issues become a cost factor for enterprises and may even break the business processes entirely. Periodically measuring the data quality using data quality rules is one way to detect the speed of degradation and/or the change of data quality over time. Periodic measuring may also be used as a prompt for an action if certain minimal required data quality key performance indicators (KPIs) are no longer being met. For MDM systems, a data governance (DG) program is established alongside the deployment of the MDM system to control the creation, maintenance and use of master data and master data quality throughout its life cycle. Data stewards working in the data governance organization apply data profiling measurements periodically in order to control compliance with data quality KPIs for the master data. A measurement is often done using semantic rules, which is one of many data profiling techniques. Data quality monitoring includes defining data quality KPIs, creating semantic rules, creating a first baseline measurement during initial load, periodically executing the semantic rules and comparing the results against a baseline result.
Data profiling and data quality monitoring are input/output (I/O) intensive and time consuming operations. Therefore, for data quality profiling and data quality monitoring, data is typically extracted into a staging area in order to avoid performance degradation of an application due to the additional I/O requirements of data extraction. In some instances applications do not allow direct access to the underlying database without using an application specific mechanism. Another reason that the data is extracted into a staging area is to avoid functional issues for the application due to structured query language (SQL) statement concurrency issues caused by conflicts between SQL created by the application and SQL created by the data profiling tool operating in the application database at the same time.
The initial full data extraction required for systems such as DW or business application consolidation often requires a full weekend, which may cause performance degradation of the application for an extended period of time due to the increased I/O requirements. For some systems it may be possible to periodically perform the extract over a weekend. For applications such as e-commerce systems which operate permanently, or for other critical systems, finding a good time to perform the data extract may be difficult. If data quality monitoring is not done, degradation of data quality will remain undetected until the business processes break down or other business issues arise.
Currently known data quality monitoring techniques process all the data that is in scope for the baseline measurement while it is performing additional measurements. The volume of data is also constantly growing and the time window between two measurements is shrinking. As a result, one measurement might not be complete by the time the next one is scheduled to begin which makes data quality monitoring difficult to perform.