As data volume grows, being able to effectively search the data becomes increasingly critical. One problem is that the index needed to support searches of the data tends to be very large, and to take a lot of time and computing resources to create. Another problem is that in many environments (e.g. data protection or backup systems, version control systems, email servers, etc), the data being indexed contains a lot of similarity so that the search results tend to be cluttered with similar data.
One of the most popular index is the inverted index as shown in FIG. 1. An inverted index (also referred to as postings file or inverted file) is an index structure storing a mapping from words to their locations in a document or a set of documents, allowing full text search.
One conventional approach to reducing the size of the index is to encode the file IDs in the posting lists using a smaller number of bits. The resulting index is, however, still very large. This approach does not leverage any similarity in the data and is an orthogonal and complementary approach to the current invention. Another conventional approach is to detect and skip identical files to effectively index only one copy of a file. This, however, has no effect on near identical files.
Recently, another approach is to partition original files into virtual sub files and to index the virtual sub files. By carefully tracking how the sub files map to the original files, it is possible to preserve the traditional query semantics during query processing. Such an approach requires significant changes to a query engine. Because of the need to quickly map sub files to original files during query processing, assumptions have to be made about the similarity model and/or restrictions have to be placed on the types of queries that can be handled. For example, one assumption may be that similar files share content in a top-down hierarchical manner and one restriction may be that proximity queries are disallowed. In addition, the mapping needs to be carefully arranged so that the index cannot be incrementally updated. Because the traditional query semantics are preserved during query processing, this approach does not address the cluttering of query results with similar data.