Composite data files are known in many standards and used in a plurality of business domains. These data items comprise payload data, which is unique for each composite file, and metadata, where some metadata can be common across more than one composite file.
As a first example, medical images are typically stored in the well-known DICOM format, according to the DICOM standard. They contain the payload data, i.e. the actual image as pixel data and other attributes related to the image, and several metadata, in particular demographic data of the patient, study attributes and series attributes. All medical image composite files of one series will contain redundantly the same series attributes as metadata. If a study consists of more than one series, all files of that study will contain the same patient demographic attributes.
A second example is sound data, which can be stored in the also well-known mp3 format. They contain the payload data, for example a song as an mp3 encoded stream, and several metadata, like composer, album, interpret, publishing year etc.
Storing data items in composite files is very useful, since those data items can be copied from one place to another without breaking consistency: the files are self-consistent. However, for management of the files applications need a fast way to query or navigate the hierarchy of the composite files, like e.g. query all series of a particular patient, find all series of a study, find all songs by a particular composer published last year, etc. For better manageability, the typical approach is to use a database and store there in a suitable form the metadata. This way, applications can use the database for browsing and management purposes, and access the files only when the payload data is needed.
This approach has proven outcomes and is best practice since a long time now. However, it has some yet unsolved limitations and drawbacks.
First of all, the updating of metadata in the files is slow or (depending on the format) even impossible. When some metadata are changed by an application, those changes are first committed to the database, and then potentially—depending on the implementation—realised in the composite files. This second step is typically slow and very often not possible at all, for example, in the DICOM standard, where the whole file would need to be rewritten.
Thus, the composite files alone are not always reflecting the up-to-date metadata.
A further disadvantage is the slow rebuild of a database. The composite files must be parsed in order to extract the metadata from them, which is a slow process, especially, when may files are involved. This scenario can be of interest when attaching a new database to an existing file archive or in disaster recovery situations when the database was lost.
Disaster recoverability is, generally speaking, complex and costly. Backing up the composite files in a safe place is not sufficient, the database must be backed up also, because composite files might not be up to date and rebuilding of the database after a disaster might be slow (both explained above).
Additionally, a distributed system, which would provide access to the data from different geographical locations, is very complex and costly since the solution must consider both database and file system access.
A further disadvantage of the known systems is poor scalability, because the costs for large databases are high. These databases store the complete hierarchy information down to the filenames of the composite files.
Finally, applications which typically only need access to the composite files might deal with “out of date” information in the file system, so they always need to access the database to get the most up-to-date metadata for the composite files they use.