To counter the exponential growth of data, organizations are leveraging Enterprise Content Management (ECM) content repositories to archive the data that holds value and reduce costs to their running businesses. For example, emails are archived in the ECM content repositories to move the disk storage demands from the email servers to the ECM content repositories. As part of this archival process, emails of high business value are identified and suitable record policies are applied (e.g., an email from a CEO about an acquisition is flagged with a hold policy for 10 years). In addition to emails, ECM content repositories are leveraged to store and access data from collaborative enterprise file shares or various servers. Leveraging ECM content repositories to maintain data from email servers, other servers, file shares, case management applications, etc., creates a problem of organizing the data for easy and quick discovery. The problem magnifies as the data from the above mentioned silos is most often unstructured (e.g., text documents or files, presentations, spreadsheets, videos, audio, etc.) with very basic native metadata (e.g., author, time of creation, location, file name, size).
ECM content repositories rely on metadata, including categorization or taxonomy metadata, to provide an organizing structure for content, such as documents, and to make the documents easy for humans to find, whether by search of the metadata or by browsing of a taxonomy tree or category tree. The categorization and taxonomy metadata may be described as information that places content in a category or classification. This metadata is normally assigned or “attached” to a document at the time the document is ingested (i.e., processed by the ECM), placed in a content repository, or at a time when the document is moved from one logical location (e.g., a folder) in a content repository to another logical location. An content item may be located in more than one folder, but the content item is really stored in the content repository once, and the folder is metadata associated with that content item. For humans, though, it is a way to navigate through the repository and find things by browsing, and to organize like content items.
A variety of techniques may be used to assign or attach the metadata. Some metadata may be learned automatically from the document itself, based on document properties. Other metadata may be derived, manually (by humans), or automatically (by systems which analyze text), based on the content of the document. Still other metadata may be assigned either automatically or manually based on external factors.
When metadata is assigned manually, errors may occur, which results in improperly filed or categorized documents. These errors and omissions can occur for a variety of reasons, such as: humans not wanting to perform the additional task of assigning metadata, humans being inconsistent on a judgment call, and improper training for users.
Additionally, as all systems change over time, documents which were originally assigned to one category might be better placed at a later date in a different category, due to additions to the system, changes, and general “drift” of the data model used for representing metadata.
Enterprises may employ a variety of techniques to ensure compliance for metadata standards and proper assignment of metadata. For application-level enforcement, the application requires metadata to be selected, assigned, or entered for a document before the document can be submitted to a repository, but the user selecting the metadata may select any valid value or set of values, which may not be accurate. For workflow-level enforcement, a workflow is invoked to perform a “check” or “quality control” of metadata assignment, and this relies on humans accurately assigning metadata. For automatic assignment based on technology, humans are not involved in categorizing the documents or assigning the metadata, but drift may still occur.