The present disclosure relates generally to data preparation and analysis. More particularly, techniques are disclosed for performing similarity metric analysis and data enrichment using knowledge sources.
Before “big data” systems can analyze data to provide useful results, the data needs to be added to the big data system and formatted such that it can be analyzed. This data onboarding presents a challenge for current cloud and “big data” systems. Typically, data being added to a big data system is noisy (e.g., the data is formatted incorrectly, erroneous, outdated, includes duplicates, etc.). When the data is analyzed (e.g., for reporting, predictive modeling, etc.) the poor signal to noise ratio of the data means the results are not useful. As a result, current solutions require substantial manual processes to clean and curate the data and/or the analyzed results. However, these manual processes cannot scale. As the amount of data being added and analyzed increases, the manual processes become impossible to implement.
Big data systems may be implemented to analyze data to identify other similarly related data. Processing volumes of data becomes a challenge. Even further, the structure, or lack thereof, of the data that is analyzed may pose greater challenges for determining the content and relationship of the data.
Machine learning may be implemented to analyze the data. For example, unsupervised machine learning may be implemented using a data analysis tool (e.g., Word2Vec) to determine similarities amongst data; however, unsupervised machine learning may not be able to provide information indicating a group or category associated with closely related data. Thus, unsupervised learning may be unable to determine a genus or category of a set of species (e.g., terms) that are closely related. On the other hand, supervised machine learning based on a curated knowledge source (e.g., YAGO, from the Max Planck Institute for Informatics) may provide better results for determining a group or a category for data. Supervised learning may provide inconsistent and/or incomplete results. Data provided by a curated knowledge source may be sparse and the quality may depend on the curator. Categories identified based on use of supervised learning may not provide the correct categorization of similarly related data. Multiple knowledge sources may implement different categorization, such that multiple sources can be difficult to merge. Analyzing data to determine similarities and relationships may become more burdensome due to misspellings of terms in the data that is analyzed. Similar data may not be easily identified when the data contains misspellings.
Certain embodiments of the present invention address these and other challenges.