A blockchain may be used as a public ledger to store any type of information. Although, primarily used for financial transactions, a blockchain can store any type of information including assets (i.e., products, packages, services, status, etc.). A blockchain may be used to securely store any type of information in its immutable ledger.
Data analytics engines are becoming increasingly popular in the enterprise. Cognitive solutions rely on a wide portfolio of analytics tools, all of which are based on large amounts of data. As a result, data is becoming more valuable. However, as the value of data continues to increase so does the need to protect the data, and prove that the insights being derived from the data are valid. Many enterprises offer APIs and services that require users to send their data with the goal of providing insights. However, data trust models are requiring data provenance proof as to where does the data come from and whether the data is originating from trusted or untrusted sources, and what services affected the data in its route to its final destination, which may be referred to as data analytics provenance. This may also include determining what algorithms and transformations were used to derive the insights/results being sent back to the client. Concerns regarding the data and the software stack used with the data requires end-to-end provenance of the analytical data results.
In a traditional data flow for a traditional analytics engine, there are several data sources that come from trusted or untrusted sensors/agents. The data is then sent to an ingestion portal through the Internet/world-wide-web where the data is then transformed and ingested into a data store (e.g., DB2, HDFS, etc.). Next, the analytics engine will query the data, and apply some machine learning/data mining algorithms that will yield reports, insights, etc. This process is insecure since new attacks have emerged that try to pollute/manipulate the data insights that results from those analytics engines. Such phenomena has enabled the use of adversarial machine learning, where classifiers are trained to ignore, detect, or withstand such attacks against the algorithms. In adversarial machine learning, most of the attacks are trying to process malicious data, or tamper with the data that is being analyzed, which results in a demand for data provenance in order to establish some form of trust for auditing purposes. Adversarial machine learning works on the premise that attackers may tamper with the data, so there must be ways to protect the data and use different classifiers together to withstand attacks, as well as build secure algorithms.
One key way of preventing attacks on analytics engines is identifying malicious data sources by tracing data paths. Data provenance may include specific results being logged. The logged data may be logged into a centralized database. Simple tags are generated at each point when data was provided into the system, where data was stored, what algorithms were used to process the data, and what results would be associated with the algorithm, such as expected results. However, concerns over the data integrity still exist since data tampering can happen when the data is stored in a centralized database. Even if the database is distributed, data can still be tampered with and compromised. Similarly, most provenance schemes only store checkpoints or simple metadata that reflects what data changes/transformations have occurred.