The present invention relates generally to the field of searching documents. More specifically, the present invention is related to a method for performing search queries on a document level based on block-level indexes.
It often happens that multiple documents comprise sections, portions or components with identical content. For example, one email is replied or forwarded many times, and all replied or forwarded emails contain the originally sent email. The same case happens in the post from social media, e.g. wiki, blog etc. In compound documents such as email with attachments or ZIP files whole sub-documents (e.g. the attachments or files in a ZIP) may be shared by many different top-level documents (e.g. different emails or ZIP files).
When indexing multiple documents comprising portions of identical content, specifically full text indexing, the duplicated portions of content will be indexed multiple times leading to redundant information within the index and an large size of the index.
A technology providing non-redundant index representation for duplicated sections in documents would be really helpful because computing capability wasted to reanalyze and index identical sections as well as repository i.e. any kind of persistent storage, e.g. hard disk for storing the index can be saved. On the other hand, the technology has to provide means for searching documents as usual on a document level and not on a block level, i.e. oriented towards search and retrieval of portions of documents.
Hence, there is a need to provide for an efficient and user-friendly method for performing search queries against documents, specifically text documents, providing identical portions, specifically text portions.