An information management platform may support a variety of document-related functions such as archiving, data protection and electronic discovery, and for these purposes such a platform may employ a searchable document repository. Document repositories generally track documents from different data sources (e.g., electronic mail, a file system, or shared document locations such as SharePoint or Confluence). Regardless of the source, documents can in general be thought of as comprising structured attributes (or “metadata”) and unstructured content. An email for example has structured attributes like Sent Time, Sent By, and Sent To, along with unstructured content like the body of the email (message) or attachments to the email. Searches on the repository can be directed to the structured attributes, the unstructured content, or both (so-called mixed searches).
In one traditional architecture for document repositories, structured attributes are stored in a database and the unstructured content is stored in a conventional file system. The content is indexed using a full text engine to make it searchable. Many modern databases (like SQL Server or Oracle) provide full text engines. Additionally, there are standalone full text engines such as Lucene, FAST and dtSearch. In the traditional architecture, metadata-only searches are served by the database alone. A content search requires use of the full text indexes and augmenting the results via a database “join” operation to pass back the metadata attributes of the qualifying documents. Mixed searches require both a search on the database for the structured metadata part and a search on full text indexes for the content part, followed by an intersection of the results done via “joins” in the database.
It is known to utilize so-called “single instancing” of content, such that identical content appearing in multiple distinct documents is stored and indexed only once. As an example, an email with an attachment may be sent to 100 recipients whose documents are tracked in a document repository. The repository saves the emails as 101 logical documents (one for each recipient and one for the sender), but with content single instancing, the body and attachment (which are the same for all 101 copies) are stored only once and indexed only once, and the single instance of the body and attachment are linked to the 101 logically separate documents.