Some backup streams are block-based backups. These blocks are structured data-sets which consists of a metadata portion and data portion. Content based anchoring does not work well with block-based backups because the anchors can span block boundaries. For block backups, block-based anchoring is done to align anchor points at the block boundaries. This results in most-effective deduplication for such backups. If the blocks are bigger than a maximum supported segment size of a file system, hybrid anchoring/chunking may be used. In hybrid chunking, anchoring is done on the block boundary along with content-based anchoring in-between. Hybrid chunking ensures that deduplication opportunities are not lost within the block, besides the block boundaries.
For some data blocks, such as Oracle® data blocks, metadata portion can change more frequently than the data portion hence can degrade deduplication of database backups. When that happens, the metadata needs to be removed before anchoring is done. In such cases, the metadata portion of these data blocks is treated as a marker inside by the file system of a storage system. A marker refers to a metadata portion inside a data stream, which may be introduced by backup applications. These markers change even when the data portion in the stream has not changed. Anchoring is done on the data portion of the block after removing the metadata portion and storing the metadata separately inside the file system.
In a conventional scheme, a byte pattern is used to first find a marker. Once the marker is detected, it is removed and block-based or hybrid anchoring is performed. In this scheme, first markers are found and then boundaries are detected relative to the marker pattern location. Detecting the marker first using a pattern has a side effect of wrongly detecting markers in cases when the pattern is weak. For example, in case of a particular data stream such as an Oracle® data stream, the pattern used to determine there is a marker or not, is a weak pattern, hence it can result in a false positive. Frequent false positives can hurt deduplication of the data stream. Another stream can be treated as an Oracle® stream hence hurting deduplication due to forced anchors at presumed block boundaries. An Oracle® stream containing a false positive marker pattern in the data portion of the block can also result forced anchoring and hence hurt deduplication.