The present disclosure relates generally to business intelligence and, more particularly, to determining reliability of data reports.
Most large enterprises invest in a data warehouse to consolidate critical data. Such a data warehouse is used to facilitate reporting, analysis and decision making systems. The data warehouse is fed from the operational systems of the enterprise which are used to process day-to-day transactions. Once in the data warehouse, the information will then be moved to domain-specific data marts and will be available from there for analytical reporting. The reports help the enterprise and external regulators to see trends, risk exposure, data, etc.
The extraction of data from operational systems and its placement into the data warehouse is usually done using an Extract, Transform and Load (ETL) tool, an example of such a tool being IBM® InfoSphere® DataStage®. The movement of data from the warehouse to a data mart is done with a similar tool. The reports are designed and run using a data reporting tool, an example of such a tool being IBM® Cognos® Enterprise.
In certain scenarios, developing the warehouse, populating it, moving the data to a mart and then creating the necessary reports is a large and complex project. In many cases, dozens of developers are needed to develop, test and maintain the ETL code that is needed to produce the final reports. Also associated with the project are analysts, data stewards, data modelers, enterprise architects and project managers. These, combined with the ETL and other developers, result in very large teams that are dedicated to the reporting project.
The flow and transformation of information from the operational systems to the reports via the warehouse and marts is very complex. The data may flow through reporting layers, OLAP layers, data marts, data warehouses, staging databases, intermediate files, file transfers, ETL processes and operational data stores. Within the enterprise no single person may be able to understand this flow in its entirety.
Consider a report that needs to be delivered to government regulators: the enterprise needs to provide associated information that convinces the regulators that the results are indeed accurate and reliable. Since no single person may understand the data flow in its entirety, it is exceedingly challenging for an enterprise to validate the entire data flow and therefore the report's accuracy and reliability. It requires validating every step of the data lifecycle, including, verifying that the ETL code is moving and transforming the data as designed, verifying that the code is accessing and aggregating the data as designed, and verifying that the data sources used throughout the flow do not have any quality issues.
Accordingly, data quality issues reduce the reliability of reports and every enterprise has data quality issues to some extent. Decision-makers reading the reports need to know how reliable the report data is.