Certain types of data can be stored in a manner that includes two sets of corresponding data files. A first set of data files includes object content files that contain the object content itself, while the second corresponding set of data files includes index files related the object content files. The index files can also include associated metadata related to the object content files.
It is desirable to try to ensure that the object content files and the index files are consistent to and correspond with one another. In other words, it is desirable to know if an index file exists when there is no corresponding object content file in the first set of data files. Likewise, it is desirable to determine if an object content file exists that has no corresponding index data file in the second set of data files. If either situation exists, this can be an indication of corrupted data.
In order to determine if either scenario exists, and in an attempt to ensure that the sets of data files are consistent and correspond with one another, the first set of data files may be compared with the second set of data files. One way in which to compare the two sets of data files may include performing a type of data join operation where it is determined if the two sets of data files have an intersection where only one data file is present within either set of data files, e.g., an anti join operation of the two sets of data files. However, there may be billions or even trillions of data files. Thus, such an anti join operation may take weeks or even months to complete the task. Such an operation might also require allocation of large amounts of memory, which might require garbage collection in certain configurations. Garbage collection can utilize large amounts of CPU time and, therefore, negatively affect performance.
Additionally, some data joins and anti joins may be performed where the data files are partitioned into subsets of data files. Such partitioning may be performed based upon a partitioning key. However, if the key is relatively very large with respect to the number of files, then the partitioning of the data files may not result in insufficiently small subsets of data files on which to perform data join or data anti-join operations.
The disclosure made herein is presented with respect to these and other considerations.