This description relates to processing related datasets.
A dataset is a collection of data that is stored, for example, in a data storage system hosted on any number of physical storage media (e.g., stored in a database hosted on one or more servers). Properties of a dataset such as its structure and storage location(s) can be described, for example, by an entity such as a file or other form of object (e.g., an object stored in an object oriented database). In some cases, an entity describing a particular dataset (e.g., a file) also stores the data in that dataset. In some cases, an entity describing a particular dataset (e.g., an object pointing to a database table) does not necessarily store all of the data in that dataset, but can be used to locate the data stored in one or more locations in a data storage system.
The data in a dataset may be organized using any of a variety of structures including a record structure that provides individual records with values for respective fields (also called “attributes” or “columns”), including possibly a null value (e.g., indicating that a field is empty). For example, the records can correspond to rows in a database table of a database system, or rows in a spreadsheet or other flat file. To access records stored in a given format, a data processing system typically starts with some initial format information describing characteristics such as names of fields, the order of fields in a record, the number of bits that represent a field value, the type of a field value (e.g., string, signed/unsigned integer). In some circumstances, the record format or other structural information of the dataset may not be known initially and may be determined after analysis of the data.
Datasets can be related to each other in any of a variety of ways. For example, a first dataset corresponding to a first table in a database can include a field that has a primary key/foreign key relationship to a field of a second table in the database. The primary key field in the first table may include values that uniquely identify rows in the first table (e.g., customer ID values uniquely identifying rows corresponding to different customers), and the rows in the second table (e.g., rows corresponding to transactions made by a given customer) containing a foreign key field that corresponds to the primary key field in the first table may use one of those unique values to identify each of one or more rows in the second table that represent transactions made by a given customer. Preserving referential integrity among multiple datasets can include preserving relationships between different fields, including foreign key/primary key relationships, or other relationships for which a value in a field of one dataset depends on a value in a field of another dataset.