Technical Field
This invention relates to file metadata and data storage systems, more specifically the automated creation of metatags associated with objects in an object storage system.
Related Art
There are a variety of generally known data storage systems, including object-based storage systems, commonly referred to as an object stores. Object stores are massively scalable, well beyond traditional file system storage devices in both raw capacity and the number of storable items (objects). Object stores include redundancy and scalability mechanisms that are entirely software based which allows object stores to run on commodity non-specialized hardware with high reliability and consistent performance. Further, object stores allow each object to contain both the data (sequence of bytes representing the object contents) and metadata (set of attributes describing the data), making objects easier to search and locate specific contents than traditional file systems. These properties make object stores highly flexible and desirable platforms for a variety of needs where storage requirements are largely unbounded or unstructured data is collected and may later be accessed for arbitrary purposes.
A large amount of useful metadata is automatically generated and exists within files for numerous content types. For example, standard office documents frequently contain properties such as Title, Subject, Author, Company, etc. and JPEG images contain information on the capturing apparatus and image properties (e.g. JPEGs frequently contain both EXIF and XMP metadata information). There are systems (both proprietary and open-source) for extracting known metadata information from numerous content types. These systems are designed to parse files for known metadata locations and values and output the raw metadata from each file.
Applying metatags (a metadata key paired with a value (key=value or key/value or name/value)) to objects greatly increases the ability to search and use the content in the object store. However, in all but the smallest of cases, manually applying metatag information to each object passed into the storage repository is infeasible. Under most circumstances, a manual process would be prohibitively time consuming and highly susceptible to error.
Since there is no overarching standard for metadata naming conventions, the raw output of metadata extraction from a plurality of content types often contains keys for related information under different key names. For example, one file type may designate an “Author”, while another may designate a “Creator” even though the values contain the same information. Similarly, the same file may contain more than one metadata key with the same or similar (and potentially conflicting) values (ex: JPEGs frequently contain 3 different metadata keys to convey the F-Stop with different formatting of the value for each key).
Compounding these problems are restrictions imposed by the object store on the allowable number, value, and size constraints for metatags, which vary between the object store technologies. Therefore, there is a need for methods and systems that can overcome the challenges of associating useful metadata in a normalized fashion to large and varying collections of unstructured data with a plurality of content types as they are deposited into an object store.