In recent years, a large amount of data is accumulated by organisations at different levels. With an ever increase in data, and due to digital transformation and adoption of Internet Of Things (IOT), Social Media, Analytics and Cloud (SMAC) technologies, most of the organisations are trending towards consolidation of data from various data sources such as, real-time and batch data sources etc., into other singular stores. In today's digital period, data acquisition is done from various sources of data such as, databases or live feeds or click stream data. The data acquired is stored in its native form in a storage repository or data lake. The data lake has the potential to transform business by providing a singular repository for all the types of data such as, structured and unstructured data and internal and external data etc. Availability of such singular repository may enable business analysts and data science team to mine and exploit all the data that is scattered across a multitude of operational systems, data warehouses, data marts. However, integration of different types of data sources efficiently is a troublesome and extremely error prone and challenging process today. Often organizations employ only basic or even no checks to ensure that the quality of data is in good upstream.
Existing technologies perform data acquisition and data quality monitoring on structured data or data from relational databases which may be sequential and the quality can be assessed by normalizing the data. However, data acquisitions cannot be performed on heterogeneous data source when type, nature, structure etc., of data is not known. For example, the existing techniques do not work when large volume of data streams received, contain a mixture of structured data, semi-structured data, quasi-structured data and unstructured data. Also, most of the data quality measuring methods in the existing scenario focus only on structured database or relational database. Often the root-cause analysis is performed only when bad results are discovered. This technique is extremely expensive, cumbersome or even impossible given the volume and speed with which data is pushed into data lakes.
The information disclosed in this background of the disclosure section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.