Record-oriented architectures have long been a tenet of data processing. Record structures are found in programming languages and in databases alike: data fields that are logically related are physically stored in the same record. The physical database is composed of a set of files, each file usually containing records of the same type.
The design of a database schema, which represents the logical level, is usually performed through the mapping of a conceptual schema (such as an Entity-Relationship schema) to a logical schema (e.g. a relational schema). In some models, and notably in relational models, files composed by records, which are in turn composed of data fields, are visible at the logical level, which is the level at which transactions on the database operate. Any change to the logical schema requires the modification of the transactions that operate on the portion of the logical schema that was modified.
The logical level is independent of transactions and describes only data and their relationships with no concern for performance. The physical level, instead, is dependent on the actual use of the database and hence on the transactions run on the database, their frequency of execution, their priority, etc. It is obviously desirable, and in fact a much stressed advantage of relational database systems, that the logical and physical levels be independent, so that tuning the physical level in order to improve performance does not require any changes to the transactions defined on the database.
We will use the term data clustering to denote the organization of data fields into records, and consequently in files. Data clustering has extremely relevant, if overlooked, consequences on performance when record structures are used to store data in secondary memory. In fact, the record size determines the access cost either directly, in the case of sequential or index-sequential scans, or indirectly, as a parameter that influences index height, for index structures.
Clustering of data into records is indeed a physical design choice, and different but logically equivalent clustering strategies produce different access costs. As an example consider that current logical design practices, such as mapping conceptual E-R schemata into relational schemata, encourage the creation of jumbo relations to represent entities: all single-valued attributes of an entity are stored in the same relation. Most transactions on the database only need a fraction of the attributes stored in such relations, but have to access the entire record anyway and as a consequence pay a potentially high overhead caused by useless data.
The best (i.e. less expensive in terms of computer resources) clustering scheme for a relation depends on the actual transactions run on the database, their frequency and their criticality. Therefore, data clustering is a decision to be taken at the physical level, rather than at the logical level as it currently happens.