Big data involves collecting and analyzing large scale, complex datasets at high velocity. Big data may involve datasets of such vast scale that spotting trends or outcomes requires advanced application of analytic data science or knowledge processing (e.g., artificial intelligence). Big data may include training machine learning algorithms, including neural network models, to predict or classify data. At the outset of a big data analysis process, data processing systems must identify which data to analyze to serve a particular analysis goal or topic (i.e., a desired outcome). To meet an analysis goal, data processing systems may face challenges distinguishing datasets relevant to that goal from irrelevant datasets. Further, data processing systems may be unable to distinguish actual data from synthetic data. Problems may arise from data inundation, data redundancy, unknown data lineage, and/or unknown data properties. That is, big data faces challenges arising from collecting too much data that becomes costly to analyze, collecting overlapping redundant data, and collecting unclassified, unmapped, or unlabeled data.
In the field of big data, individual or institutional data processing systems may collect large amounts of data from communication systems, user devices, health systems, transactional systems, transportation systems, medical systems, biological systems, climate systems, environmental systems, educational systems, demographic monitoring systems, water systems, government systems, or other systems. These data may address data analysis goals in science, engineering, human health, demographics, finance, business, medicine, human behavior, education, governance, regulation, environmental management, or other topics. Data processing systems may collect these data continuously or periodically. For example, a data processing system may collect hourly weather data, user device data, and demographic data to determine optimal traffic control patterns in a region. In addition, data processing systems may acquire discrete blocks of data from third parties (i.e., data dumps). For example, a merchant may purchase datasets that include hundreds of millions of transaction records and seek to identify consumer trends related to particular products.
Problems accompany such large-scale data collection efforts. Data processing systems may redundantly gather the same data multiple times from the same or different source. Further, data processing systems may gather data that produces no additional benefit for data analysis goals. In many cases, received data may be unlabeled, with unknown data schema or other data properties. A data processing system may not receive information indicating whether received data comprises actual data or synthetic data. As data processing systems or human data managers change, the system or human managers may lose or forget properties of datasets that the system collects. In some cases, the need for these data may change as data analysis goals change, so that the amount or frequency of collected data may no longer be appropriate.
Conventional approaches to big data analysis involve applying machine learning models or statistical models to received datasets for data prediction or data classification. For example, big data may involve predicting or classifying data using neural network models (e.g., recurrent neural networks, convolutional neural networks), feed forward models, deep learning models (e.g., long short-term memory models), random forest models, regression models, or other models. However, these approaches often do not include upfront data management approaches to address data collection and data analysis inefficiencies noted above. Instead, faced with data inundation, data redundancy, unknown data lineage, and/or unknown data properties, conventional approaches typically merely retrain models with each newly received dataset.
Thus, conventional approaches lead to wastefully escalating computations. Conventional approaches often fail to identify datasets relevant to a data analysis goal prior to analysis. In some cases, conventional systems cannot determine which datasets are related (connected), which overlap, which comprise actual data, or which comprise synthetic data. As a result, conventional data processing systems do not identify useful data collection efforts versus wasteful efforts, leading to inefficient resource use during data collection.
This data collection inefficiency leads to downstream data analysis inefficiencies. Conventional data processing systems waste computing resources analyzing data that is not useful to address a particular need or that is redundant (i.e., analyzing data that the system already analyzed). Conventional systems may discard or ignore data because the data lack labels or have an unknown data schema, thereby wasting of computing resources. Thus, conventional systems may waste valuable computing resources collecting data that no longer serve any data analysis goals or by collecting useful information that goes unanalyzed. Alternatively, conventional systems may collect data sub-optimally, i.e., these systems may fail to recognize that an increase in the amount or frequency of data collected may better serve a data analysis goal.
Therefore, in view of the shortcomings and problems with existing methods, there is a need for improved systems and methods of data processing for processing big data with data redundancy, unknown data lineage, and/or unknown data properties. New approaches to data processing that efficiently collect and analyze data by identifying connected datasets, distinguish actual data from synthetic data, and identify data lineage are needed.