Integrating large amounts of data into an existing schema creates significant challenges. This is compounded when the incoming datasets may be in numerous formats, from numerous sources, and with disparate content. Accuracy being a major consideration, it is important to avoid misclassifying an “apple” in the incoming dataset as an “orange” in the existing schema. This can happen where the metadata of the incoming dataset, such as headers for columns that contain data, does not “match” the metadata in the existing schema. For example, an incoming dataset might call an entity name “Institution Name”, but an incoming dataset may call it “Organization”, creating ambiguity how to classify. Inaccuracies can arise even if the headers match. This can occur where the incoming dataset calls metadata “Institution Name” and the existing schema also calls metadata “Institution Name”, but in fact the nature of the content differs (e.g., one is a parent hospital system, the other a single hospital).
Even more, a static schema, such as one developed after an effort to organize a fixed number of data sources by means of a central data warehouse initiative, cannot readily evolve. This limits the system's ability to adapt to fast-changing developments, to scale to meet an ever-growing amount of data, and to expand its intelligence by self-learning based on multiple iterations. In addition, a static schema constrains broadening the “vocabulary” of attributes which would yield more powerful analytics. And, of course, inaccuracies arise which can compromise fundamental data integrity, rendering the output of downstream algorithms dependent on this data potentially unreliable.
What is needed is a system and method that can recognize and integrate attributes of incoming datasets, enabling schemas to operate with flexibility and evolve with maximum accuracy and data integrity.