The present disclosure relates generally to the field of data alignment. More specifically, the present disclosure relates to systems and methods for aligning data models or schemas (e.g., different databases) so that the information stored in the data models (schemas) may be quickly used in conjunction with other available data once it becomes available. In some embodiments, this use of the data contained in multiple databases may be further facilitated by the creation of an ontology representing the knowledge inherent in the results of the semantic merger of the relational structure and elements of each of the data base models.
The data model (schema) of a relational database, alternatively known as a database schema, may be a metadata language that defines the database structure incorporating the definitions of tables for organizing rows of data, where the columns of each table (also known as table fields) define the constituent data elements for each row of data in the table. Each table often has a primary key comprising a logical subset of table fields that are used to index into and sequence the data rows of a table. Relations between tables are defined using Foreign Keys in the referencing table that are equivalent to Primary Keys in the referenced table. The database schema may define both the internal structure of the database tables holding rows of data, and the relations between them.
Methods to align multiple database schemas into an overall schema may merge schema elements that are common (e.g., determining equivalent semantic relationships between the different schemas, their elements and structure and representing the aligned results in a machine interpretable form that represents this aligned knowledge). If there are no common elements than there is no alignment, only aggregation.
The names used for the database schema elements may have an implied semantic meaning related to a domain of knowledge and contextual purpose. For example, a database and associated schema may contain data about health care practitioners, their specialties, qualifications, and office hours, and, as appropriate, equivalent field names within the schema table definitions. The names or vocabulary for the names of the tables, fields, and relational foreign keys likely may be relevant, to persons familiar with the field, to terms understood in the domain of knowledge for healthcare providers. Database schemas may have no explicit semantics defined that are machine interpretable; rather, other contextual information (specification documents, help files, etc.) may be used to develop an understanding of the meaning of the schema elements (e.g., table names, field names, relational names, etc.).
There are many operational domains where an alignment across databases would provide information about relationships between data that were not present in a single database. Data alignment methods described herein may help to uncover these relationships. Often analysts have a large number of sizable data sets they need to search to discover information about relationships across data sets. In order to discover these relationships across data sets, they need to be aligned. For example, a location stored as lat/long in one data set may need to be aligned with a location stored as a street address in another data set. Quickly aligning new data sets is further complicated by the fact that, in various applications, new data sets are often added on a regular basis, and there is not a way to automatically align them with the existing data. Rather, new data sets must often be manually aligned with existing data (e.g., by a human analyst manually inspecting the new data sets and determining how the new data should be aligned with the existing data). Such manual alignment is a time-consuming process that requires substantial resources (e.g., human resources) and delays the availability of potential valuable new information for use in conjunction with previously available data.